论文标题
具有凹奖励的上下文匪徒,以及对公平排名的应用
Contextual bandits with concave rewards, and an application to fair ranking
论文作者
论文摘要
我们考虑具有凹奖励(CBCR)的上下文匪徒,这是一个多目标匪徒问题,在该问题中,奖励之间所需的权衡取决于已知的凹面目标函数,奖励向量取决于观察到的随机上下文。我们介绍了第一个算法,对CBCR的遗憾逐渐消失,而无需限制政策空间,而先前的工作仅限于有限的政策空间或表格表示。我们的解决方案基于对CBCR算法的几何解释,作为在所有随机策略所跨越的凸的预期奖励集上的优化算法。在弗兰克·沃尔夫(Frank-Wolfe)的基础上,我们在凸出的凸优化方面进行了分析,我们从CBCR的遗憾中得出了一种新颖的降低,到了标量奖励强盗问题的遗憾。我们说明了如何在非组合作用的情况下应用降低现成以获得具有线性和一般奖励函数的CBCR算法。在推荐方面,我们描述了一个具有排名和公平意识目标的CBCR的特殊情况,从而导致了第一种算法,并遗憾地保证了上下文组合匪徒,并具有公平的曝光。
We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a scalar-reward bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure.