RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian

Anna Rogers Alexey Romanov Anna Rumshisky Svitlana Volkova Mikhail Gronas Alex Gribov

This paper presents RuSentiment, a new dataset for sentiment analysis of social media posts in Russian. RuSentiment is currently the largest in its class for Russian, with 30,521 posts annotated with Fleiss' kappa of 0.58 (3 annotations per post). To diversify the dataset, 6,749 posts were pre-selected with an active learning-style strategy. We report baseline classification results, and we also release the best-performing embeddings trained on 3.2B tokens of Russian VKontakte posts.

RuSentiment was developed with a new set of comprehensive annotation guidelines that enable lightweight, quick and consistent annotation. These guidelines can be extended for development of comparable social media datasets in other languages, and we release them in both Russian and English versions.

Update: the dataset is no longer available at the above link due to request from VKontakte. However, there is a ready-to-use sentiment analysis model for Russian that was pre-trained on RuSentiment.

Baseline classifier performance

Classifier	F1	Precision	Recall
Logistic Regression	0.6884	0.6953	0.6946
Linear SVC	0.6856	0.6946	0.6925
Gradient Boosting	0.6848	0.6963	0.6919
NN classifier	0.7164	0.7199	0.7215
(5-class classification, average over 10 runs)

Publications

A. Rogers A. Romanov A. Rumshisky S. Volkova M. Gronas A. Gribov RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian. Proceedings of COLING 2018.

@inproceedings{rogers2018rusentiment,
title={RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian},
author={Rogers, Anna and Romanov, Alexey and Rumshisky, Anna and Volkova, Svitlana and Gronas, Mikhail and Gribov, Alex},
booktitle={Proceedings of the 27th International Conference on Computational Linguistics},
pages={755–763},
year={2018}
}