论文标题
在聚合系统中学习近似平衡的价值差异最小化
Value Variance Minimization for Learning Approximate Equilibrium in Aggregation Systems
论文作者
论文摘要
为了有效匹配资源(例如出租车,食品,自行车,购物项目),汇总系统非常成功。在聚合系统中,中央实体(例如Uber,Food Panda,Ofo)汇总供应(例如驾驶员,送货人员),并匹配供应的需求连续(顺序决策)。由于中央实体的目标是最大化其利润,因此个人供应商被牺牲,从而为个人离开系统而动机。在本文中,我们考虑了在聚合系统中学习近似平衡解决方案(双赢解决方案)的问题,以便个人有动力保留在聚合系统中。 不幸的是,这样的系统有成千上万的代理商,必须考虑需求不确定性,而潜在的问题是(部分可观察到的)随机游戏。鉴于在随机游戏中学习或计划的重大复杂性,我们做出了三个关键贡献:(a)利用每个代理商和匿名性的无限贡献(代理之间的奖励和过渡取决于互动量),我们在互动中,我们将其表示为多种代理学习(MARL)的问题(MARL)的问题,该问题是基于非挑剔的集会模型来建立的,这是一个基于洞察力的模型。 (b)我们提供了一种新颖的差异机制,用于将关节溶液移向NASH平衡,从而利用了每个药物的无限贡献;最后(c)我们在三个不同的领域提供了详细的结果,以证明与最新方法相比,我们的方法的实用性。
For effective matching of resources (e.g., taxis, food, bikes, shopping items) to customer demand, aggregation systems have been extremely successful. In aggregation systems, a central entity (e.g., Uber, Food Panda, Ofo) aggregates supply (e.g., drivers, delivery personnel) and matches demand to supply on a continuous basis (sequential decisions). Due to the objective of the central entity to maximize its profits, individual suppliers get sacrificed thereby creating incentive for individuals to leave the system. In this paper, we consider the problem of learning approximate equilibrium solutions (win-win solutions) in aggregation systems, so that individuals have an incentive to remain in the aggregation system. Unfortunately, such systems have thousands of agents and have to consider demand uncertainty and the underlying problem is a (Partially Observable) Stochastic Game. Given the significant complexity of learning or planning in a stochastic game, we make three key contributions: (a) To exploit infinitesimally small contribution of each agent and anonymity (reward and transitions between agents are dependent on agent counts) in interactions, we represent this as a Multi-Agent Reinforcement Learning (MARL) problem that builds on insights from non-atomic congestion games model; (b) We provide a novel variance reduction mechanism for moving joint solution towards Nash Equilibrium that exploits the infinitesimally small contribution of each agent; and finally (c) We provide detailed results on three different domains to demonstrate the utility of our approach in comparison to state-of-the-art methods.