潜在差异分配的确切边际推断

论文标题

潜在差异分配的确切边际推断

Exact marginal inference in Latent Dirichlet Allocation

论文作者

Maennel, Hartmut

论文摘要

假设我们在z $中具有潜在的“原因” $ z \，它们产生了“事件” $ w $，并具有已知概率$β（w | z）$。我们观察到$ w_1，w_2，...，w_n $，关于原因的分布我们能说什么？贝叶斯估计将在$ z $上（我们假设一个dirichlet之前）进行先验，并计算后部。然后，平均后部对$ z $进行分配，这估计每种原因$ z $都会导致我们的观察结果。这是潜在的Dirichlet分配的设置，可以应用，例如要在文档中“产生”单词。在这种情况下，观察到的单词数量很大，但潜在主题的数量很少。我们在这里对具有许多潜在“原因”的应用感兴趣（例如，在全球位置），但只有少数观察结果。我们表明，对于给定的上限$ n $，具有令人惊讶的简单公式的给定上限，可以在$ | z | $中以$ | z | $在$ | z | $中计算确切的贝叶斯估计。我们将此算法推广到稀疏概率$β（W | Z）$的情况下，其中我们只需要假设观测值上的“相互作用图”的树宽度受到限制。另一方面，我们还表明，没有这样的限制，问题是NP-HARD。

Assume we have potential "causes" $z\in Z$, which produce "events" $w$ with known probabilities $β(w|z)$. We observe $w_1,w_2,...,w_n$, what can we say about the distribution of the causes? A Bayesian estimate will assume a prior on distributions on $Z$ (we assume a Dirichlet prior) and calculate a posterior. An average over that posterior then gives a distribution on $Z$, which estimates how much each cause $z$ contributed to our observations. This is the setting of Latent Dirichlet Allocation, which can be applied e.g. to topics "producing" words in a document. In this setting usually the number of observed words is large, but the number of potential topics is small. We are here interested in applications with many potential "causes" (e.g. locations on the globe), but only a few observations. We show that the exact Bayesian estimate can be computed in linear time (and constant space) in $|Z|$ for a given upper bound on $n$ with a surprisingly simple formula. We generalize this algorithm to the case of sparse probabilities $β(w|z)$, in which we only need to assume that the tree width of an "interaction graph" on the observations is limited. On the other hand we also show that without such limitation the problem is NP-hard.

下载PDF全文

下载文献需遵守相关版权规定

论文标题