RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian
This paper presents RuSentiment, a new dataset for sentiment analysis of social media posts in Russian. RuSentiment is currently the largest in its class for Russian, with 30,521 posts annotated with Fleiss' kappa of 0.58 (3 annotations per post). To diversify the dataset, 6,749 posts were pre-selected with an active learning-style strategy. We report baseline classification results, and we also release the best-performing embeddings trained on 3.2B tokens of Russian VKontakte posts.
RuSentiment was developed with a new set of comprehensive annotation guidelines that enable lightweight, quick and consistent annotation. These guidelines can be extended for development of comparable social media datasets in other languages, and we release them in both Russian and English versions.
Update: the dataset is no longer available at the above link due to request from VKontakte. However, there is a ready-to-use sentiment analysis model for Russian that was pre-trained on RuSentiment.
Baseline classifier performance
Classifier | F1 | Precision | Recall |
---|---|---|---|
Logistic Regression | 0.6884 | 0.6953 | 0.6946 |
Linear SVC | 0.6856 | 0.6946 | 0.6925 |
Gradient Boosting | 0.6848 | 0.6963 | 0.6919 |
NN classifier | 0.7164 | 0.7199 | 0.7215 |
(5-class classification, average over 10 runs) |
Publications
@inproceedings{rogers2018rusentiment, title={RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian}, author={Rogers, Anna and Romanov, Alexey and Rumshisky, Anna and Volkova, Svitlana and Gronas, Mikhail and Gribov, Alex}, booktitle={Proceedings of the 27th International Conference on Computational Linguistics}, pages={755–763}, year={2018} }