Particle Optimization for Demand Estimation
1 Purpose
This note develops the gradient-flow view of particle optimization for random-coefficient demand estimation. The setting is deliberately modest: low-dimensional product characteristics, roughly K \le 4, where adaptive support points can plausibly improve on a fixed grid without running into the worst high-dimensional optimizer pathologies.
The basic message is:
- FKRB-style estimators turn nonparametric random coefficients into a convex regression problem once the coefficient support is fixed.
- The fixed-support step is also the bottleneck: if the grid misses the economically relevant regions of taste space, no amount of convex weight fitting can recover the distribution well.
- Particle optimization relaxes the fixed-support assumption by moving the support points themselves.
- The resulting method is no longer convex, but its gradients have a useful interpretation: particles move toward coefficient values that make observed choices more likely, while their weights learn how much mass those regions deserve.
The goal here is not to sell particle optimization as a finished estimator. It is to explain why the gradient flow is natural, where it helps, and why the low-dimensional restriction matters.
2 1. Random coefficients demand
Consider a multinomial logit demand model with an outside option. Consumer i chooses among inside products j=1,\ldots,J and an outside option normalized to utility zero. For inside products,
u_{ij} = x_{ij}'\beta_i + \varepsilon_{ij},
where:
- x_{ij} \in \mathbb{R}^K is a vector of product characteristics,
- \beta_i \in \mathbb{R}^K is a consumer-specific taste vector,
- \varepsilon_{ij} is Type-I extreme value noise.
Conditional on \beta, the probability of choosing product j is
g_j(x_i,\beta) = \frac{\exp(x_{ij}'\beta)}{1 + \sum_{\ell=1}^J \exp(x_{i\ell}'\beta)}.
The outside option probability is the leftover mass,
g_0(x_i,\beta) = \frac{1}{1 + \sum_{\ell=1}^J \exp(x_{i\ell}'\beta)}.
Let F denote the population distribution of random coefficients. The unconditional inside-good choice probability is
P_j(x_i) = \int g_j(x_i,\beta) \, dF(\beta).
The econometric problem is to learn F from observed choices. Nonparametrically, this is hard because F is infinite-dimensional and enters through nonlinear integrals.
3 2. The FKRB fixed-support idea
Bajari, Fox, Ryan, and related work propose a useful simplification. Approximate the unknown coefficient distribution by a discrete distribution on a pre-specified support:
F(\beta) \approx \sum_{r=1}^R \theta_r \delta_{b_r}(\beta), \qquad \theta_r \ge 0, \qquad \sum_{r=1}^R \theta_r = 1.
The support points b_1,\ldots,b_R are chosen before estimation. They may come from a grid, random draws, or preliminary estimates. Once these points are fixed, the choice probabilities are linear in the unknown weights:
P_j(x_i) \approx \sum_{r=1}^R \theta_r g_j(x_i,b_r).
Let y_{ij} be an indicator that consumer i chose product j. Define the basis matrix Z(B) by stacking the model-implied probabilities generated by each support point:
Z_{r,(i,j)}(B) = g_j(x_i,b_r),
where B=(b_1,\ldots,b_R). With fixed B, the FKRB estimator solves
\hat\theta(B) = \arg\min_{\theta \in \Delta^{R-1}} \frac{1}{NJ}\sum_{i=1}^N\sum_{j=1}^J \left(y_{ij} - \sum_{r=1}^R \theta_r g_j(x_i,b_r)\right)^2,
where \Delta^{R-1} is the probability simplex.
This is the central computational advantage. Conditional on B, the problem is a convex constrained least-squares problem. There is no likelihood surface over continuous random-coefficient parameters. There are no MCMC draws. There is no simulation noise inside a nonlinear optimizer.
3.1 2.1 What fixed support buys
The fixed-support approach gives three valuable properties:
- Convexity in weights. The objective is quadratic in \theta and the constraints are linear.
- Nonparametric flexibility. A sufficiently rich support can approximate complicated coefficient distributions.
- Computational transparency. The estimator is just a constrained regression on basis functions g_j(x_i,b_r).
3.2 2.2 What fixed support costs
The price is that all flexibility is mediated by the chosen support. If the support misses the important regions of coefficient space, the estimator can only compensate by putting weight on nearby or extreme points.
In two dimensions, a grid can be visually and computationally reasonable. If K=2 and we place 20 points along each axis, R=400 support points are manageable. But a literal tensor grid grows as C^K. With C=20 and K=4, the grid already has 160,000 points. Most of these points are economically irrelevant empty space.
For low-but-not-tiny dimensions, the tension is clear:
- a coarse grid is computationally convenient but may miss modes;
- a fine grid covers space but wastes many atoms;
- random support is cheaper but can leave holes near important regions.
Particle optimization is a way to keep the finite-mixture representation while letting the support points adapt.
4 3. From fixed support to particles
The fixed-support objective can be written as
L(\theta,B) = \frac{1}{NJ}\left\|Y - Z(B)'\theta\right\|^2, \qquad \theta \in \Delta^{R-1}.
FKRB fixes B and optimizes \theta. Particle optimization treats both as unknown:
(\hat\theta,\hat B) = \arg\min_{\theta\in\Delta^{R-1},\,B} L(\theta,B).
The support points b_r are now particles. Each particle represents a candidate consumer type. The weights \theta_r represent the mass assigned to each type.
This changes the problem qualitatively. The objective is no longer convex because Z(B) depends nonlinearly on B. But it also changes what the estimator can do: instead of hoping a grid point is already close to a true taste type, particles can move toward taste types that explain choices well.
5 4. Gradient flow for the support points
The useful way to understand particle optimization is not as a black-box neural optimizer. The gradients have an economic interpretation.
Let
\hat y_{ij} = \sum_{r=1}^R \theta_r g_j(x_i,b_r)
be the model-implied probability for product j and consumer i, and let
e_{ij} = y_{ij} - \hat y_{ij}
be the residual. The mean squared error objective is
L(\theta,B) = \frac{1}{NJ}\sum_{i,j} e_{ij}^2.
For a single particle r, the gradient with respect to its location is
\nabla_{b_r} L = -\frac{2\theta_r}{NJ} \sum_{i=1}^N\sum_{j=1}^J e_{ij}\, \nabla_{b_r} g_j(x_i,b_r).
Thus the particle update
b_r \leftarrow b_r - \eta_b \nabla_{b_r}L
can be written as
b_r \leftarrow b_r + \frac{2\eta_b\theta_r}{NJ} \sum_{i,j} e_{ij}\,\nabla_{b_r} g_j(x_i,b_r).
This formula is the core gradient-flow point.
5.1 4.1 Interpreting the residual-weighted score
For a product that is chosen, y_{ij}=1. If the current mixture underpredicts that choice, then e_{ij}>0. The update pushes particles in directions that increase g_j(x_i,b_r).
For a product that is not chosen, y_{ij}=0. If the current mixture overpredicts that choice, then e_{ij}<0. The update pushes particles in directions that decrease g_j(x_i,b_r).
So each particle receives a residual-weighted score signal:
- move toward coefficient values that raise probabilities of underpredicted chosen products;
- move away from coefficient values that raise probabilities of overpredicted unchosen products;
- move more strongly when the particle has larger mass \theta_r.
The last point is important. A high-weight particle is interpreted as representing many consumers, so the loss is more sensitive to its location. A low-weight particle has little effect on predictions and therefore receives weak location gradients.
5.2 4.2 The logit derivative
The derivative of the logit probability makes the movement more concrete. Let
\bar x_i(b) = \sum_{\ell=1}^J g_\ell(x_i,b)x_{i\ell},
where the outside option contributes zero characteristics. Then
\nabla_b g_j(x_i,b) = g_j(x_i,b)\left(x_{ij} - \bar x_i(b)\right).
Substituting into the particle gradient gives
\nabla_{b_r} L = -\frac{2\theta_r}{NJ} \sum_{i,j} e_{ij}\, g_j(x_i,b_r)\left(x_{ij} - \bar x_i(b_r)\right).
This says that a particle moves in the direction of characteristics that distinguish underpredicted choices from the particle’s own predicted average choice bundle.
If a particle currently thinks consumer i should choose products with average characteristics \bar x_i(b_r), but the residuals say the model needs more probability on product j, then x_{ij}-\bar x_i(b_r) tells the particle how to adjust tastes to make product j more attractive.
5.3 4.3 Continuous-time picture
Ignoring optimizer details, the gradient descent dynamics approximate the flow
\frac{d b_r(t)}{dt} = -\nabla_{b_r}L(\theta(t),B(t)),
or
\frac{d b_r(t)}{dt} = \frac{2\theta_r(t)}{NJ} \sum_{i,j} e_{ij}(t)\, \nabla_{b_r} g_j(x_i,b_r(t)).
In this picture, particles are not random simulation draws. They are moving basis functions. The estimator begins with a rough dictionary of possible taste types and then deforms the dictionary toward regions that reduce aggregate prediction error.
This is why the method can help in low dimensions. Instead of needing a dense grid everywhere, we need enough particles to receive gradient signals and migrate toward the relevant regions.
6 5. Weight dynamics and particle collapse
The weights solve a different problem. Holding B fixed, the gradient with respect to \theta_r is
\frac{\partial L}{\partial \theta_r} = -\frac{2}{NJ}\sum_{i,j} e_{ij} g_j(x_i,b_r).
Particles whose basis functions align with positive residuals get more weight; particles that worsen predictions lose weight.
The natural projected-gradient update is
\theta \leftarrow \Pi_{\Delta} \left(\theta - \eta_\theta \nabla_\theta L\right),
where \Pi_\Delta projects onto the simplex.
This is also where the failure mode appears. If one particle is slightly better early on, it gains weight. Because the location gradient is multiplied by \theta_r, that particle then moves faster. Nearby particles lose weight and stop moving. This creates a positive feedback loop:
\text{better fit} \Rightarrow \text{higher weight} \Rightarrow \text{larger location gradient} \Rightarrow \text{even better fit}.
This feedback is useful when the leading particle is heading toward a genuine mode. It is harmful when it kills exploration too early.
6.1 5.1 Stabilizing the flow
For the low-dimensional setting, several stabilizers are natural:
- Warm up locations with uniform weights. For an initial phase, set \theta_r=1/R and update only B. This lets all particles move before the simplex weights start killing weak particles.
- Box-project particle locations. Restrict b_r \in [a,b]^K to prevent extreme coefficients from fitting finite-sample noise.
- Use separate learning rates. The scale of \theta and B gradients differs. A single Adam optimizer for both is unnecessarily brittle.
- Scale location learning rates with dimension. A step size proportional to 1/\sqrt{K} is a simple way to prevent location updates from growing too aggressive as K rises.
- Add entropy or softmax temperature. Entropy regularization, or a softmax parameterization of weights, keeps particles alive longer.
A practical algorithm is therefore:
- Initialize R particles uniformly over a plausible bounded region.
- For T_0 iterations, hold \theta_r=1/R and update particle locations.
- For T_1 iterations, jointly update \theta and B.
- Project \theta to the simplex and B to the coefficient box after each update.
- Monitor effective sample size, max |b_r|, loss, and mode recovery diagnostics.
7 6. Why low dimensions are the right sandbox
The low-dimensional restriction is not just computational convenience. It changes whether the gradient signal is usable.
When K grows, several problems compound:
- Dot products x_{ij}'b_r can grow in magnitude, saturating logit probabilities.
- The same number of particles covers exponentially less of the coefficient space.
- More directions are available for overfitting finite-sample residuals.
- The top weighted particles can duplicate a single mode while missing others.
For K\le4, the geometry is still inspectable. We can plot two-dimensional slices, monitor where particles move, and run enough restarts to see whether mode recovery is genuine. This is the right place to understand the estimator before trying to scale it.
8 7. Numerical illustration
This section makes the discussion concrete with a small self-contained Monte Carlo exercise. The design is intentionally simple and low-dimensional.
For each K\in\{2,3,4\}, generate N=1200 consumers and J=10 inside goods. Characteristics satisfy
x_{ij} \sim \mathrm{Uniform}([-1,2]^K),
and random coefficients come from a two-point mixture,
\beta_i = \begin{cases} (-2,\ldots,-2), & \text{with probability } 0.6,\\ (1,\ldots,1), & \text{with probability } 0.4. \end{cases}
Choices are generated from the multinomial logit model with an outside option. Each particle estimator uses R=50 particles and 1600 optimizer iterations. The diagnostic loss reported below is mode recovery error: after estimation, take the top two estimated support points, match them to the two true modes by minimum total distance, and report the true-weighted matched distance. Lower is better.
Before the summary table, it is useful to look at the particle geometry in two dimensions. The first plot uses a simple two-point mixture with modes at (-2,-2) and (1,1). The fixed-support estimator spreads mass over the grid, while particle optimization moves support points toward the relevant regions.
The second plot uses the correlated bimodal BFR-style design. This is a harder case: the target distribution is not just two point masses, but two correlated components with unequal weights. The figure compares the true density, the fixed-support estimate, and the particle estimate.
The next plot compares mode recovery error for the vanilla finite-support estimator and several particle variants. For K=2, the fixed-support baseline is a literal finite grid. For K>2, it uses a fixed random support of 100 atoms, which is the computationally comparable version of the FKRB idea used here.
The table reports averages over three seeds. ESS is the effective number of particles, 1/\sum_r\theta_r^2.
| K | method | mode error | ESS | max |\beta| | max weight |
|---|---|---|---|---|---|
| 2 | fixed support | 2.36 | 4.18 | 4.00 | 0.36 |
| 2 | baseline projected-simplex particles | 1.49 | 8.54 | 10.50 | 0.21 |
| 2 | boxed particles + scaled \beta LR | 1.40 | 8.33 | 4.00 | 0.23 |
| 2 | uniform weights, boxed particles | 0.85 | 50.00 | 4.00 | 0.02 |
| 2 | staged warmup + boxed particles | 3.55 | 50.00 | 4.00 | 0.02 |
| 2 | high-entropy boxed particles | 2.18 | 23.19 | 3.71 | 0.10 |
| 3 | fixed support | 2.91 | 7.03 | 3.99 | 0.25 |
| 3 | baseline projected-simplex particles | 6.23 | 8.43 | 17.46 | 0.21 |
| 3 | boxed particles + scaled \beta LR | 3.24 | 8.26 | 4.00 | 0.19 |
| 3 | uniform weights, boxed particles | 3.22 | 50.00 | 4.00 | 0.02 |
| 3 | staged warmup + boxed particles | 3.18 | 48.66 | 4.00 | 0.03 |
| 3 | high-entropy boxed particles | 2.52 | 31.43 | 4.00 | 0.05 |
| 4 | fixed support | 3.42 | 8.16 | 3.99 | 0.20 |
| 4 | baseline projected-simplex particles | 7.12 | 7.63 | 19.99 | 0.23 |
| 4 | boxed particles + scaled \beta LR | 2.88 | 6.49 | 4.00 | 0.27 |
| 4 | uniform weights, boxed particles | 5.39 | 50.00 | 4.00 | 0.02 |
| 4 | staged warmup + boxed particles | 3.34 | 46.31 | 4.00 | 0.03 |
| 4 | high-entropy boxed particles | 3.86 | 33.67 | 4.00 | 0.05 |
Several patterns are useful.
First, the finite-support baseline is a serious benchmark. It is not dominated by the naive particle method: at K=3 and K=4, the unconstrained particle baseline is worse because particles run to extreme coefficient magnitudes. Adaptive support only helps once the particle dynamics are stabilized.
Second, simply bounding the particles helps a lot. At K=4, box projection plus a 1/\sqrt K location learning-rate scale reduces average mode error from 7.12 to 2.88 and slightly improves on the fixed-support baseline’s 3.42, while keeping coefficients inside the intended domain.
Third, there is a tradeoff between mode accuracy and particle diversity. The boxed method with learned simplex weights has low mode error but low ESS. The staged method keeps nearly all particles alive, but can be slightly worse on the top-two-mode diagnostic. This suggests a useful two-stage workflow: use warmup/staging to explore support, then allow weights to concentrate once particles have reached plausible regions.
Fourth, the uniform-weight variant does surprisingly well at K=2. This is a warning that the weight dynamics are not always helping in tiny samples. In low-dimensional exploratory work, it is worth reporting the fixed-support estimator, the adaptive-weight particle estimator, and a uniform-weight particle baseline.
The main takeaway is that the poor scaling visible in the naive algorithm is partly fixable, but the finite-support baseline should always be shown. The low-dimensional recipe should use bounded particles, separate optimizers, and diagnostics for both coefficient runaway and particle collapse. The estimator is not plug-and-play, but the gradient-flow idea is empirically useful once the dynamics are stabilized.
9 8. Product-market shocks and first-stage \xi
The experiments above deliberately isolate the support-selection problem. That is useful, but it leaves out a central applied-demand complication: product-market demand shocks \xi_{jt}. In real market-share data, prices and product characteristics usually do not explain most of the variation in shares. If the constructed type-specific shares omit \xi_{jt}, then the FKRB regression is not just noisy; its regressors are wrong.
With product-market shocks, the type-specific basis function should be
s^r_{jt} = \frac{\exp(x_{jt}'b_r + \xi_{jt})} {1 + \sum_\ell \exp(x_{\ell t}'b_r + \xi_{\ell t})}.
If \xi_{jt} is unobserved, the practical Meeker-style fix is to estimate a first-stage simple logit,
\log(s_{jt}/s_{0t}) = x_{jt}'\bar\beta + \xi_{jt},
and plug the residual \hat\xi_{jt} into the constructed shares. A slightly richer first stage uses the Salanié-Wolak FRAC approximation. In the price-only heterogeneity case,
\log(s_{jt}/s_{0t}) = x_{jt}'\bar\beta + \sigma_\alpha^2 K_{jt} + \xi_{jt} + O(\sigma_\alpha^4),
where
K_{jt} = (p_{jt}/2 - e_t)p_{jt}, \qquad e_t = \sum_j s_{jt}p_{jt}.
This suggests a useful extension of the particle estimator: estimate \hat\xi first, then run either fixed-support FKRB or particle optimization using x_{jt}'b_r + \hat\xi_{jt} in the logit index. The oracle version, using true \xi, is an upper benchmark.
I added a preliminary sandbox script, xi_experiment.py, that simulates markets with endogenous-ish prices correlated with \xi, compares no correction, logit first-stage \hat\xi, FRAC first-stage \hat\xi, and oracle \xi, and reports mean absolute error in own-price elasticities. The current small run is intentionally not definitive, but it is informative:
The first-pass evidence cuts both ways. Oracle \xi is almost always excellent for particles, so the dark-matter shocks really do contain information the support cannot fully manufacture. But without oracle information, sufficiently rich heterogeneity plus adaptive particles often competes with or beats the simple logit/FRAC corrections on elasticity error. This is especially visible when heterogeneity is large: the particle estimator can use beta dispersion to soak up share patterns that fixed support attributes poorly.
That does not mean \xi is irrelevant. It means there are two substitutable margins in finite samples: unobserved product quality and flexible taste heterogeneity. The next experiment should vary the relative size of \sigma_\beta and \sigma_\xi more systematically, add more seeds/restarts, and separate prediction fit from structural elasticity accuracy. A useful diagnostic is precisely where the two margins become observationally confounded.
9.1 8.1 Replicating Brand’s blog simulation with particles
I also replicated the simulation design from Brand’s post using the FKRB.jl source as a reference. The data-generating process uses 500 two-product markets and 500 four-product markets, random coefficients on price and x with mean (-1,1) and standard deviations (0.3,0.3), product-market shocks with standard deviation 0.3, endogenous-ish prices
p_{jt} = 2(z_{jt} + u_{jt}) + \xi_{jt},
and market fixed effects. The plotted statistic is the average own-price elasticity across products and markets. I ran the exact DGP for 80 Monte Carlo replications and added a tuned particle estimator using the FRAC-style first-stage \hat\xi.
The ranking matches the blog’s qualitative message and gives particles a small additional gain. Mean average own-price elasticities in this run were:
| method | mean elasticity | bias vs truth |
|---|---|---|
| Truth | -0.798 | 0.000 |
| FKRB, no \xi | -0.571 | 0.227 |
| FKRB + logit \hat\xi | -0.599 | 0.199 |
| FKRB + FRAC \hat\xi | -0.612 | 0.186 |
| Tuned particles + FRAC \hat\xi | -0.620 | 0.178 |
This is not yet a decisive win for particles; it is a modest improvement on the FRAC-corrected fixed-support estimator under Brand’s DGP. The more important result is that the particle estimator moves in the right direction without changing the first-stage correction. The remaining gap to truth is still large, which suggests that in this DGP the first-stage \xi problem is more binding than support adaptation alone.
10 9. How particle optimization helps here
The FKRB estimator is strongest once support is good. Particle optimization is a support-improvement device.
In a low-dimensional demand problem, the true taste distribution may have a few economically meaningful regions: price-sensitive consumers, quality-sensitive consumers, brand loyalists, and so on. A fixed grid spreads atoms across the whole coefficient box. Particle optimization lets the atoms concentrate where the data indicate such types might live.
The benefit is clearest when:
- the true distribution is multimodal but low-dimensional;
- a coarse grid places too few atoms near a mode;
- the researcher has a plausible bounded coefficient region;
- the goal is exploratory recovery of taste heterogeneity rather than a final production estimator.
The method should be viewed as an adaptive dictionary for choice-probability basis functions. FKRB estimates the mixture weights on a given dictionary. Particle optimization tries to learn a better dictionary.
11 10. What to report in experiments
For the low-dimensional experiments, report more than a single error curve. The useful diagnostics are:
- final prediction loss;
- mode recovery error when true modes are known;
- effective sample size 1/\sum_r \theta_r^2;
- number of active particles above thresholds like \theta_r>0.01;
- maximum |b_r|;
- sensitivity to seeds and restarts;
- plots of particle locations and weights.
These diagnostics separate three different failure modes:
- Bad support: particles never reach the relevant regions.
- Collapse: particles reach a region but all mass concentrates on too few atoms.
- Runaway coefficients: particles fit noise with extreme taste values.
The fixes above target different failures: warmup fights collapse, box projection fights runaway coefficients, and restarts fight bad initialization.
12 11. Bottom line
For K\le4, particle optimization is a promising extension of fixed-support FKRB estimation. Its appeal is not that it magically solves nonparametric demand estimation. Its appeal is that it turns support selection into a gradient-based adaptive basis problem.
The gradient flow has a clean interpretation: particles move toward taste vectors that raise probabilities of underpredicted choices and lower probabilities of overpredicted choices. In low dimensions, this can recover useful support points with far fewer atoms than a dense grid.
The open problem is to make the dynamics robust. The most promising practical recipe is bounded particles, staged weight learning, separate optimizers, and explicit diagnostics for collapse and coefficient runaway.
13 References
- Bajari, P., Fox, J. T., & Ryan, S. P. (2007). Linear regression estimation of discrete choice models with nonparametric distributions of random coefficients. American Economic Review, 97(2), 459-463.
- Fox, J. T., Kim, K., Ryan, S. P., & Bajari, P. (2011). A simple estimator for the distribution of random coefficients. Quantitative Economics, 2(3), 381-418.