RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian

Anna Rogers Alexey Romanov Anna Rumshisky Svitlana Volkova Mikhail Gronas Alex Gribov

We present RuSentiment, a new dataset for sentiment analysis of social media posts in Russian. RuSentiment is currently the largest in its class for Russian, with 30,521 posts annotated with Fleiss’ kappa of 0.58 (3 annotations per post). To diversify the dataset, 6,749 posts were pre-selected with an active learning-style strategy. We report baseline classification results, and we also release the best-performing embeddings trained on 3.2B tokens of Russian VKontakte posts.

RuSentiment was developed with a new set of comprehensive annotation guidelines that enable lightweight, quick and consistent annotation. These guidelines can be extended for development of comparable social media datasets in other languages, and we release them in both Russian and English versions.

RuSentiment is freely available for non-commercial use.

Baseline classifier performance

Classifier F1 Precision Recall
Logistic Regression 0.6884 0.6953 0.6946
Linear SVC 0.6856 0.6946 0.6925
Gradient Boosting 0.6848 0.6963 0.6919
NN classifier 0.7164 0.7199 0.7215

(5-class classification, average over 10 runs)

Publications

A. Rogers A. Romanov A. Rumshisky S. Volkova M. Gronas A. Gribov RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian. Proceedings of COLING. 2018

@inproceedings{rogers2018rusentiment, title={RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian}, author={Rogers, Anna and Romanov, Alexey and Rumshisky, Anna and Volkova, Svitlana and Gronas, Mikhail and Gribov, Alex}, booktitle={Proceedings of the 27th International Conference on Computational Linguistics}, pages={755--763}, year={2018} }