Machined Learnings: November 2022

When I was coming up, conferences were a mature technology for disseminating high-quality scientific results at higher speed than journals. The combination of smaller units of work (“one paper, one idea”) and focused peer review were part of the key ingredients. Unfortunately the process is now broken. Arxiv halps fill the gap, but remains caveat emptor; and high quality journals are still slow.

The Inevitability of Game Theory

I've played arguably too much poker in my life, but I did learn some transferable skills. At the poker table, you love to see your opponent upset, because people get upset when they face an adverse condition and have no effective counterstrategy. Unfortunately, in a world where private violence is forbidden, getting upset is counterproductive because you lose to the ability to think clearly in exchange for preparation for a physical altercation that never materializes. If you get upset, you lose.

Clearly people are upset about conferences.

But instead of getting mad at the low quality of peer review exhibited at every major conference nowadays, let's instead try to understand what's happening.

Early in the career path (graduate school, postdoc, pre-tenure), aspiring professionals need credentials to certify their skill.

Research skill is difficult to assess early in a career, and
everybody wants to use objective criterion to avoid bias anyway.

An accept from a famous conference (NeurIPS, ICML, ICLR, etc.) provides credential value.

MSR is like everybody else, our job listings always say “X number of publications in top-tier journals or conferences”.

Accepts are public and visible, while rejects are either private or not salient.
The asymmetric cost/benefits incent everyone to over-submit.
Thus conferences that develop credential value become overwhelmed with submissions and start demanding authors participate in peer review.
A phalanx of participants with neither the time, inclination, or acumen to properly peer review are thrust into the process, resulting in low quality reviews.
The incentives for being a great reviewer are minimal so the situation persists.

e.g., free conference reg. But my employer pays anyway!

Some solutions

Ok so at least now we can generate some ideas.

Remove credential value from accepted submissions.

This is what CODE@MIT did inadvertently, by not publishing proceedings and only recording invited talks. Their goal is a space for preliminary results without fear of embarrassment, but nonetheless they have removed credential value. Thus the conference stays small and high quality (feels like a big multi-day workshop actually, really great, I highly recommend this conference), and will never get spammed with submissions.
While this strategy creates venues for mature professionals who don't need credentials, it doesn't help those early in their career.

Create credential value from high quality reviewing.

Conferences could make a much bigger deal about who the good reviewers.
It needs to be a public visible decision “recognized good reviewer”, with a rate in the low double digit percentage just like accepted paper rates; and
Job postings need to list “X number of recognized good reviewing from top-tier journals or conferences” as a requirement.

Gate submissions on reviewing.

First you need to get recognized as providing good reviews, and this gives you a transferable credit redeemable for some number of submissions. Those who are further along in their career will be incented to provide high quality reviews to accumulate credit transferable to those they are mentoring (e.g., industrial researchers with interns, faculty with graduate students or postdocs).

Bad idea: Create negative credential value from rejected submissions.

Anybody who lived through NeurIPS rejecting the deep learning speech recognition workshop knows science is too human for this to make sense.

Translate Feeling into Thought

Feelings are functional, but only when connected to reason and action. Do you have new ideas? Speak up on social media, not with complaints about how dumb reviewers are, but with actionable suggestions about how to fix the situation. The above suggestions might not work, but we have to start experimenting with new ideas because the situation is getting worse: personally, journals are looking increasingly attractive as an alternative, which is a bad sign.

This slide is from Leon Bottou's amazing ICML 2015 keynote.
You should contemplate this image on a regular basis.

In the above slide Leon Bottou outlines three distinct, valid, and complementary ways of attempting to understand the world. In AI all of these approaches are used, but with different emphasis as trends change. Arguably, there was too much of the left hand side in the “if it's not convex it's crap” days, but then the deep learning revolution pushed many all the way to right. In 2014 I asked a famous theoretician (who will remain anonymous) at a Q&A panel at ICML what he thought the role of theory was given recent developments, and he said “the role of theory is to predict experiments, but now we can just run experiments, so we don't need theory.”

Years later I now realize, the best theory doesn't predict experiments, it suggests them. In other words, the best theory prompts ideas that otherwise would not be contemplated. (Of course, there's still plenty of theory that attempts to explain experiments that are already run, and that's fine: the left column (“Exact Sciences”) of above image needs to digest the output of the other two columns. This is necessary activity, but the real payoff comes when the left hand side suggests something that would have taken a very long time for the other two columns to consider.)

The Decision-Estimation coefficient (DEC) is amazingly good theory, because it suggests algorithm designs that otherwise would not be apparent: already I've published results on infinite action contextual bandits with arbitrary function classes, infinite action neural-bilinear contextual bandits, and risk-averse contextual bandits. The last two are already implemented in Vowpal Wabbit and commercially available in Azure Personalizer Service. In other words, these are not “COLT algorithms”, these are practical algorithms you can run on real problems. (The first one, it works, but as stated in the original paper it doesn't generate causal data exhaust suitable for subsequent offline analysis. We have a fix for this we will publish and then that version will go into the product.) For risk-averse contextual bandits, I was confused for 18 months on how to achieve an online result (the exploration distribution affects the risk measure in a subtle way), and then Dylan Foster explained the DEC to me, and clarity soon followed.

Now I will pay it forward by explaining the DEC to you.

A Tale of Minnie and Max

The DEC is part of the rich tradition of minimax approaches to learning theory: excellent introductions are available from Lafferty et. al., Rakhlin and Sridharan, and Krishnamurthy. For a more modern example, Duchi et. al. provide a minimax characterization of local differential privacy. In classic minimax theory, one bounds the worst case value of a risk functional (i.e., something that says how good your decision rule is) subject to some constraints (e.g., local differential privacy). The DEC is a bit different and fun because it operates as a machine learning reduction to regression, and therefore it bounds the worst case difference between performance on the original problem (“decision”) and the performance on the reduced problem (“estimation”).

Basic Formulation

Here's one version suitable for contextual bandit algorithm design: $$ \text{DEC} = \min_P \max_{Q,f}\ \underbrace{\mathbb{E}_{a \sim P}\left[f(a)\right] - \mathbb{E}_{a \sim Q}\left[f(a)\right]}_{\text{CB regret}} - \gamma\ \underbrace{\mathbb{E}_{a \sim P}\left[ \left(f(a) - \hat{f}(a)\right)^2 \right]}_{\text{Estimation error}} $$ where

$P$ is Player's distribution over actions. In the basic version, Player can pick any distribution over actions, but lately I've been having fun restricting the Player in various ways.
$Q$ is Nature's distribution over actions. In the basic version, Nature can pick any distribution over actions, but constraining Nature is the basis for the smoothed infinite action result.
$f$ is Nature's choice of (expected) reward function. By constraining Nature to pick from a limited complexity function class (e.g., neural-bilinear) you can get nontrivial bounds in infinite action spaces.
$\hat{f}$ is the output of the regression oracle, i.e., our guess about what the true reward function is.
Estimation error here is measured in squared loss, but using different divergences gives different results, e.g., triangular discrimination yields an $L^*$ result for contextual bandits, and expectile loss yields a online risk averse contextual bandit.
$\gamma$ can be interpreted as a learning rate which grows as the decision horizon (total number of rounds of decision making) grows. It can also be interpreted as a Lagrange multiplier: in more recent (currently unpublished) formulations of the DEC the estimation error is constrained rather than subtracted.
You can add a context $x$ to all the components of the DEC (e.g., $\hat{f}(x, a)$), but it doesn't change anything since we evaluate the DEC example-by-example, so we elide it.

Closing the Loop

Let's understand why a DEC bound implies an effective reduction. Informally, a bound on the DEC ensures that one of two things happens: 1) you do well on the original problem you care about or 2) you discover a mistake in your estimator. Because we reduce to online no-regret learning and assume realizability, #2 eventually runs out of steam.

More precisely, suppose we have a bound on the $\text{DEC} \leq C \gamma^{-\alpha}$. After $T$ rounds we have $$\begin{aligned}\text{CB regret} &\leq T C \gamma^{-\alpha} + \gamma \sum_{t=1}^T \left(f_t(a) - \hat{f}_t(a)\right)^2 \\ &= T C \gamma^{-\alpha} + \gamma \text{Reg}(T) \\ &= C^{\frac{1}{1 + \alpha}} \alpha^{-\frac{\alpha}{1 + \alpha}} (1 + \alpha) T^{\frac{1}{1 + \alpha}} \text{Reg}(T)^{\frac{\alpha}{1 + \alpha}} \biggr|_{\gamma =\left(\frac{C T \alpha}{\text{Reg}(T)}\right)^{\frac{1}{1 + \alpha}}} \end{aligned}$$ and in particular

(strongly observable) $\alpha = 1$ implies $\text{CB Regret} = O\left(\sqrt{T \text{Reg}(T)}\right)$ and $\gamma = O\left(\sqrt{\frac{T}{\text{Reg}(T)}}\right)$,
(weakly observable) $\alpha = \frac{1}{2}$ implies $\text{CB Regret} = O(\text{Reg}(T)^{1/3} T^{2/3})$ with $\gamma = O\left(\left(\frac{T}{\text{Reg}(T)}\right)^{2/3}\right)$,
(unobservable) $\alpha \to 0$ implies unbounded DEC and $\text{CB Regret} = O(T)$.

Of course, if $\text{Reg}(T) = T$ then you are hosed. Fortunately for the version of the DEC presented here, with an arbitrary finite function class $\mathcal{F}$, we can use Vovk's aggregation algorithm to achieve $\text{Reg}(T) = O\left(\log \left| \mathcal{F} \right|\right)$, which yields the SquareCB result. In general $\text{Reg}(T)$ will have $T$ dependence which degrades the overall CB regret bound.

Calling All Constructivists

Here's the important part: Foster et. al. show the DEC characterizes the statistical complexity of interactive decision making, but except for a few special cases, the reasoning is nonconstructive. More precisely, they swap the min and max to get an upper bound on the DEC objective value (aka Bayesian reasoning) which does not yield a constructive algorithm. That means if you produce a constructive solution to the DEC for a particular scenario, you will have a novel algorithm which exhibits (near-)optimal minimax performance. (Instance-dependent results are trickier, but possible.).

Consequently, I've been spending lots of time trying to find as many constructive solutions to the DEC as I can. Now you know the secret so you can do the same.

Machined Learnings

Sunday, November 20, 2022

The Conference Problem