Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes

Technical Synopsis§

This paper systematically studies the phenomenon that different Nash equilibrium solvers converge to different members of the convex set of equilibria (the Nash polytope) in zero-sum games. The authors construct a tabular testbed with analytically known Nash sets, including a 2D polytope and Kuhn poker, and compare algorithms from two families: regularized last-iterate methods (R-NaD, magnetic mirror descent) and regret-averaging methods (CFR, CFR+, fictitious play).

Core Methodology§

The Nash polytope for a zero-sum game is defined as: $$\mathcal{N} = \{ (\mathbf{x}, \mathbf{y}) \mid \mathbf{x} \in \Delta^m, \mathbf{y} \in \Delta^n, \mathbf{x}^\top A \mathbf{y} = V^ \}$$ where $V^$ is the game value. The key insight is that regularized methods with a uniform reference distribution $\mathbf{p} = (1/m, \dots, 1/m)$ converge to the maximum-entropy equilibrium: $$\mathbf{x}^ = \arg\max_{\mathbf{x} \in \mathcal{N}} H(\mathbf{x}), \quad \mathbf{y}^ = \arg\max_{\mathbf{y} \in \mathcal{N}} H(\mathbf{y})$$ which is equivalent to the information projection (I-projection) of the uniform reference onto the Nash set: $$\mathbf{x}^* = \arg\min_{\mathbf{x} \in \mathcal{N}} D_{KL}(\mathbf{x} \| \mathbf{p})$$

Algorithms Studied§

R-NaD: Regularized Nash Dynamics. Updates using softmax with temperature decay, adding an entropy bonus to the payoff.
Magnetic Mirror Descent (MMD): Mirror descent with a strongly convex regularizer that penalizes divergence from a reference.
CFR/CFR+: Counterfactual Regret Minimization. Accumulates regret and uses regret-matching to select actions.
Fictitious Play (FP): Best response to empirical average of opponent's actions.

Key Findings§

1. Algorithm-dependent selection: Within each family, seed randomness does not cause variation; the deterministic part of the algorithm dictates the selected equilibrium. Regularized methods converge to the maximum-entropy member (verified analytically on 2D polytope, >99.7% on Kuhn). Regret methods converge to lower-entropy faces (CFR+ attains max entropy in only ~6% of 180-game ensemble).

2. Consequences for robustness: The maximum-entropy equilibrium is a stricter hedge against suboptimal opponents, especially in games with hidden information (e.g., Kuhn poker). In matrix games, differences exist but no strict dominance.

3. Negative results: Removing CFR's $\max(R, 0)$ projection does not eliminate drift away from the center of the polytope. R-NaD's selection is anchor-following (depends on reference distribution), not initialization-independent.

Implementation Details§

The authors provide an exact tabular solver for small games (e.g., 3x3 matrix, Kuhn poker with 6 card states). The Nash polytope is computed via linear programming (vertex enumeration). The codebase (in Python) allows analytic verification of solver outputs against the known equilibrium set.

import numpy as np
from scipy.optimize import linprog

def nash_polytope(payoff_matrix):
    # compute V* via linear programming
    m, n = payoff_matrix.shape
    c = np.zeros(m + 1)
    c[-1] = -1
    A_ub = np.hstack([payoff_matrix, -np.ones((m, 1))])
    b_ub = np.zeros(m)
    A_eq = np.ones((1, m + 1))
    b_eq = np.array([1.0])
    bounds = [(0, None)] * m + [(None, None)]
    res = linprog(c, A_ub=A_ub, b_ub=b_ub, A_eq=A_eq, b_eq=b_eq, bounds=bounds, method='highs')
    V_star = -res.fun
    # then compute full polytope via vertex enumeration (omitted for brevity)
    return V_star

def rnad_update(strategy, payoff, temperature, reference):
    # single step of regularized Nash dynamics
    logits = payoff @ strategy
    strategy_new = np.exp((logits + temperature * np.log(reference))) 
    strategy_new /= np.sum(strategy_new)
    return strategy_new

Open Conjecture§

The maximum-entropy / I-projection characterization is stated as a strongly supported conjecture, with analytic verification on small polytopes and empirical backing on 180 games.

Abstract

Technical Analysis & Implementation

Technical Synopsis§

Core Methodology§

Algorithms Studied§

Key Findings§

Implementation Details§

Open Conjecture§

Related Research

Second-Order KKT Guarantees for Bregman ADMM in Nonconvex and Non-Lipschitz Optimization

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand

Surprises in Proper Positive-Only Learning