Multi-armed bandit vs A/B-тест: когда что выбирать в 2026

Q: Какой algorithm выбрать для начала?

Thompson Sampling. Best performance в большинстве cases, прост для implement.

Q: Сколько arms нужно для MAB полезен?

3+. Меньше — обычный A/B проще и точнее.

Q: Что лучше для conversion vs revenue?

- Conversion (binary): Thompson Sampling с Beta distribution - Revenue (continuous): Gaussian Thompson или Bayesian linear bandits

Q: Можно ли совмещать A/B и MAB?

Да, и нужно. A/B для validation важных hypotheses, MAB для optimization.

Q: Бывает что MAB хуже A/B?

Когда need стат значимости A/B даст более clear answer. MAB optimize'ит value, не truth.

Multi-armed bandit (MAB) — алгоритм online optimization, который одновременно exploit'ит лучшие варианты И explore'ит новые. В отличие от A/B-теста, где трафик жёстко делится 50/50 на фиксированный период, MAB динамически переключает трафик к winning вариантам. В этом гайде — когда что выбирать, 3 главных MAB алгоритма с Python кодом, production-кейсы.

Главное различие

**A/B**: explore первое, exploit потом (sequential phases). **MAB**: explore и exploit одновременно. **A/B** даёт статистическую правду. **MAB** даёт business value во время теста.

Когда A/B, когда MAB

A/B-тест лучше когда:

✅ Нужна statistical significance для решения (publish/no-publish)

✅ Тест важный (impact на core flow)

✅ Variants стабильны (не меняются в течение теста)

✅ Достаточный traffic для A/B

✅ Regulatory / compliance requirements (бренд, регуляторика)

MAB лучше когда:

✅ Много variants для тестирования одновременно (5-10+)

✅ Short-term campaigns (промо, баннеры)

✅ Cold-start ситуации (новый product, нет prior data)

✅ Cost of suboptimal traffic высокий (упускаем revenue)

✅ Variants могут меняться (новые добавляются)

3 главных MAB алгоритма

Epsilon-Greedy — самый простой

С вероятностью ε explore (random arm), с 1-ε exploit (best arm so far).

\\\python


import numpy as np

class EpsilonGreedy:
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)

def select_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms) # explore
return np.argmax(self.values) # exploit

def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] = ((n - 1) / n) * self.values[arm] + (1 / n) * reward

\\\

Pros: простой, легко understand, deterministic exploration rate.

Cons: не использует uncertainty (даже если arm clearly bad, продолжает explore с probability ε).

UCB (Upper Confidence Bound) — uncertainty-aware

Выбирает arm с наибольшим upper confidence bound на estimated reward.

UCB_i = X̄_i + √((2 ln t)/(n_i))

\\\python


class UCB:
def __init__(self, n_arms):
self.n_arms = n_arms
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
self.t = 0

def select_arm(self):
self.t += 1
# Try untried arms first
for arm in range(self.n_arms):
if self.counts[arm] == 0:
return arm
ucb_values = self.values + np.sqrt(2 * np.log(self.t) / self.counts)
return np.argmax(ucb_values)

def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] = ((n - 1) / n) * self.values[arm] + (1 / n) * reward

\\\

Pros: автоматически explore arms с большой uncertainty.

Cons: требует tunable parameter (constant в UCB formula).

Thompson Sampling — Bayesian, the king

Для каждого arm храним posterior distribution на reward. На каждом шаге sample от distributions, выбираем arm с highest sampled value.

\\\python


class ThompsonSampling:
def __init__(self, n_arms):
self.n_arms = n_arms
# Beta(alpha, beta) для binary rewards (conversion)
self.alpha = np.ones(n_arms) # successes + 1
self.beta = np.ones(n_arms) # failures + 1

def select_arm(self):
samples = np.random.beta(self.alpha, self.beta)
return np.argmax(samples)

def update(self, arm, reward):
if reward > 0:
self.alpha[arm] += 1
else:
self.beta[arm] += 1

\\\

Pros: state-of-the-art performance в большинстве benchmarks. Naturally Bayesian — incorporates prior knowledge.

Cons: complex для interpretation. Distribution choice matters (Beta для conversions, Gaussian для continuous).

Benchmark: regret comparison

Regret = разница между optimal strategy и наш algorithm.

\\\python


def simulate_bandit(algorithm, true_means, n_trials=10000):
n_arms = len(true_means)
rewards = []
for _ in range(n_trials):
arm = algorithm.select_arm()
reward = np.random.binomial(1, true_means[arm])
algorithm.update(arm, reward)
rewards.append(reward)
return np.cumsum(rewards)

# 3 arms with conversion rates 5%, 8%, 12%
true_means = [0.05, 0.08, 0.12]

eps_rewards = simulate_bandit(EpsilonGreedy(3), true_means)
ucb_rewards = simulate_bandit(UCB(3), true_means)
ts_rewards = simulate_bandit(ThompsonSampling(3), true_means)

\\\

Typical результаты на 10K trials:

Optimal (always picks best): 1200 conversions
Thompson Sampling: ~1180
UCB: ~1140
Epsilon-Greedy: ~1100
Random: ~830

Thompson Sampling обычно лучший на real-world применениях.

Production кейсы

Yandex Market: ranker variants

Yandex Market использует Thompson Sampling для ranking model selection. Несколько ML models онлайн, MAB динамически allocate'ит traffic к best performing.

Booking.com: hotel image selection

Booking.com MAB выбирает primary hotel photo (из 50+ candidates). Photos с higher click-through автоматически получают больше impressions.

Netflix: artwork personalization

Netflix использует contextual bandits (extension MAB на context features) для выбора thumbnails per user.

LinkedIn: feed ranking

LinkedIn использует MAB для balancing exploration of new content types vs exploitation of known good types.

Hybrid approach

Production реальности часто A/B + MAB:

A/B тест 2-3 variants на достаточном sample size для statistical significance
MAB для дальнейшего optimization среди winners
Continuous experimentation через MAB на long-tail variants

Common mistakes

❌ MAB на core flow без safety nets

MAB может «застрять» на subtly bad arm. Always monitor key metrics + manual stop button.

❌ Игнорировать non-stationary effects

User preferences меняются (seasonality, trend). Pure MAB не учитывает. Use sliding window или forget-factor.

❌ Использовать MAB для big decisions

Statistical significance важна для важных decisions. MAB не даёт guaranteed statistical правду — он optimizing.

❌ Cold start без exploration

Если starting from scratch, нужно forced exploration phase (5-10% randomized traffic).

FAQ

Какой algorithm выбрать для начала?

Thompson Sampling. Best performance в большинстве cases, прост для implement.

Сколько arms нужно для MAB полезен?

3+. Меньше — обычный A/B проще и точнее.

Что лучше для conversion vs revenue?

Conversion (binary): Thompson Sampling с Beta distribution
Revenue (continuous): Gaussian Thompson или Bayesian linear bandits

Можно ли совмещать A/B и MAB?

Да, и нужно. A/B для validation важных hypotheses, MAB для optimization.

Бывает что MAB хуже A/B?

Когда need стат значимости A/B даст более clear answer. MAB optimize'ит value, не truth.

Что дальше

Источники

Sutton & Barto «Reinforcement Learning: An Introduction» — bandit chapter
Russo et al. «A Tutorial on Thompson Sampling» (2018)
exp-platform.com — Microsoft A/B platform

Multi-armed bandit vs A/B-тест: когда что выбирать в 2026

Когда A/B, когда MAB

A/B-тест лучше когда:

MAB лучше когда:

3 главных MAB алгоритма

Epsilon-Greedy — самый простой

UCB (Upper Confidence Bound) — uncertainty-aware

Thompson Sampling — Bayesian, the king

Benchmark: regret comparison

Production кейсы

Yandex Market: ranker variants

Booking.com: hotel image selection

Netflix: artwork personalization

LinkedIn: feed ranking

Hybrid approach

Common mistakes

❌ MAB на core flow без safety nets

❌ Игнорировать non-stationary effects

❌ Использовать MAB для big decisions

❌ Cold start без exploration

FAQ

Какой algorithm выбрать для начала?

Сколько arms нужно для MAB полезен?

Что лучше для conversion vs revenue?

Можно ли совмещать A/B и MAB?

Бывает что MAB хуже A/B?

Что дальше

Источники

Похожие статьи