论文标题

基于深度学习的主题分析有关金融新兴活动推文

Deep Learning based Topic Analysis on Financial Emerging Event Tweets

论文作者

Aryaman, Shaan, Yen, Nguwi Yok

论文摘要

股票市场的财务分析在很大程度上依赖定量方法,以试图根据历史价格和其他可衡量指标来预测随后或市场的变动。这些定量分析可能错过了不可量化的方面,例如情绪和投机,这也影响了市场。分析大量的定性文本数据以了解社交媒体平台上的公众舆论是解决这一差距的一种方法。这项工作通过聚类在28264个财务推文[1]上进行了主题分析,以发现股市中的新兴事件。在此期间经常讨论三个主要主题。首先,财务比率EPS是投资者经常讨论的措施。其次,对股票的简短销售进行了大量讨论,经常与摩根士丹利一起提及。第三,经常将石油和能源部门与政策一起讨论。这些推文是通过通过Word2Vec算法组成的方法在语义上聚集的,以获取将单词映射到向量的单词嵌入。然后形成语义单词簇。然后使用单词组成的单词的术语插图频率(TF-IDF)词进行矢量化,并基于其单词的簇为基础。然后通过训练深层自动编码器将Tweet向量转换为压缩表示。然后形成K-均值簇。与通常的向量空间模型相反,该方法降低了维数并产生密集的向量。使用潜在的Dirichlet分配(LDA)和最常见的单词进行主题建模用于分析集群并揭示新兴事件。

Financial analyses of stock markets rely heavily on quantitative approaches in an attempt to predict subsequent or market movements based on historical prices and other measurable metrics. These quantitative analyses might have missed out on un-quantifiable aspects like sentiment and speculation that also impact the market. Analyzing vast amounts of qualitative text data to understand public opinion on social media platform is one approach to address this gap. This work carried out topic analysis on 28264 financial tweets [1] via clustering to discover emerging events in the stock market. Three main topics were discovered to be discussed frequently within the period. First, the financial ratio EPS is a measure that has been discussed frequently by investors. Secondly, short selling of shares were discussed heavily, it was often mentioned together with Morgan Stanley. Thirdly, oil and energy sectors were often discussed together with policy. These tweets were semantically clustered by a method consisting of word2vec algorithm to obtain word embeddings that map words to vectors. Semantic word clusters were then formed. Each tweet was then vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) values of the words it consisted of and based on which clusters its words were in. Tweet vectors were then converted to compressed representations by training a deep-autoencoder. K-means clusters were then formed. This method reduces dimensionality and produces dense vectors, in contrast to the usual Vector Space Model. Topic modelling with Latent Dirichlet Allocation (LDA) and top frequent words were used to analyze clusters and reveal emerging events.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源