However there is already a cloud on the horizon: constraints. Here are two situations I've already encountered:

- In SSP without recourse, the graph structure for a particular realization might not be fully connected. I want to ensure the underlying CSMC doesn't select an infeasible edge.
- In the Ad Selection Problem, the main challenge was ensuring that the same ad was not picked repeatedly.

I've been wondering what the right formulation of constrained CSMC would be, and have thought of two scenarios. The first ``average formulation'' would be appropriate for when the set of feasible choices is related to the problem instances in a stationary fashion. In this case, there is a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). The regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\; E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right]. \] The second ``minimax formulation'' would be appropriate for when the set of feasible choices will vary in an ad-hoc fashion which is unpredictable from past data. In this case, there is a distribution $D = D_x \times D_{\tilde c|x}$, where $\tilde c: K \to \mathbb{R}$ takes values in the regular reals $\mathbb{R}$. Then, an adversary comes in and manufactures a cost vector $c: K \to \mathbf{R}$ in the extended reals $\mathbf{R}$ by setting some of the components to $\infty$; these choices are revealed via $\omega$ prior to a decision being elicited. In this case the regret of a particular classifier is given by \[ r_{mm} (h) = E_{x \sim D_x} \left[ E_{\tilde c \sim D_{\tilde c|x}} \left[ \max_{\omega \in \mathcal{P} (K)}\; \left\{ c (h (x, \omega)) - \min_{k \in K}\; E_{c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right] \right]. \] Intuitively it feels like SSP without recourse would better fit the average formulation better, and ad serving would better fit the minimax formulation.

Now for some speculation. I suspect that the relative efficiency of the filter tree reduction of CSMC to binary is because it does not handle constraints well. In other words, reduction to regression is prepared to handle arbitrary constraints by limiting the scope of the argmax, but one pays for this additional flexibility with a worse regret bound. If there is a rule of thumb in machine learning, it's ``always solve the easiest version of your problem'', and there are lots of CSMC problems which are unconstrained (i.e., all decisions are allowed on all instances), so this additional flexibility is not warranted.

So the next step is to look at the various reductions of CSMC to regression or binary classification and analyze how well they perform on ``constrained CSMC''.