Projects

TBD

This study presents a triad-based neural network system that generates affinity scores between entity mentions for coreference resolution.

We use neural network models to recover temporal relations among events and time expressions from text.

RuSentiment is a new high-quality dataset for sentiment analysis in Russian, enriched with active learning. We also present a lightweight annotation scheme for social media that ensures high speed and consistency, and can be applied to other languages (Russian and English versions released).

We apply deep learning to the problem of normalization of medical records, i.e. mapping of clinical terms in medical notes to standardized medical vocabularies.

Word embeddings are the most widely used kind of distributional meaning representations in both industrial and academic NLP systems, and they can make dramatic difference in the performance of the system. However, the absence of a reliable intrinsic evaluation metric makes it hard to choose between dozens of models and their parameters. This work presents Linguistic Diagnostics (LD), a new methodology for evaluation, error analysis and development of word embedding models that is implemented in an open-source Python library. In a large-scale experiment with 14 datasets LD successfully highlights the differences in the output of GloVe and word2vec algorithms that correlate with their performance on different NLP tasks.

This tutorial covered the current proposals for representation and interpretation of semantic features in word-level word embeddings, representation of morphological information, building sentence representations and encoding abstract linguistic structures that are necessary for grammar but hard to capture distributionally. For each problem we discussed the existing evaluation datasets and ways to improve them.

This project presents a novel recurrent neural network model to automate the analysis of students' computational thinking in problem-solving dialogues for computational microgenetic learning analytics.

This task aimed to stimulate the development of novel methods of humour detection that would not treat humor as a binary variable, and also take into account its subjectivity. The shared task was run with a new dataset based on humorous responses submitted to a Comedy Central TV show.

Clinical Named Entity Recognition system (CliNER) is an open-source natural language processing system for named entity recognition in clinical text of electronic health records. It supports (1) a traditional machine learning architecture for named entity recognition with a CRF classifier, and (2) a deep learning architecture using a recurrent neural network with LSTM for sequence labelling.

Text Machine Lab's PI took part in the development of TARSQI, a modular system for automatic temporal annotation that adds time expressions, events and temporal relations to news texts.

In the interest of designing better metrics for language generation tasks such as machine translation and paraphrasing, we have developed a unit testing framework. Given a dataset, a metric, and a set of corruptions, our code takes the corpus, generates corrupted sentences, and sees if the metric is able to identify the true sentence based on a set of references.

This work emphasizes the interpretable topic features for Post-ICU mortality prediction. It facilitates the analysis of mortality prediction and investigation of the complexity between mortality and diseases.

This workshop focused on advancing state-of-the-art in clinical NLP, with invited talk by Dr. Timothy Baldwin and best paper award sponsored by Philips North America. We accepted 14 high-quality submissions advancing such areas of clinical NLP as normalization of medical mentions, outcome prediction, clinical data de-identification and others.

TwitterHawk is an open-source natural language processing system for Twitter sentiment analysis. This system was developed for the SemEval-2015 Task 10: Sentiment Analysis in Twitter.

The Knowledge Evolution project is an experiment in tracking and mapping the evolution of knowledge domains as well as the reputations and intellectual networks of the past. The project uses the history of the Library of Congress book acquisitions and classification, and the text of historical and contemporary editions of Encyclopedia Britannica and Wikipedia.

This paper demonstrates the effectiveness of a Long Short-Term Memory language model in our initial efforts to generate unconstrained rap lyrics. The goal of this model is to generate lyrics that are similar in style to that of a given rapper, but not identical to existing lyrics: this is the task of ghostwriting. Unlike previous work, which defines explicit templates for lyric generation, our model defines its own rhyme scheme, line length, and verse length. Our experiments show that a Long Short-Term Memory language model produces better “ghostwritten” lyrics than a baseline model.

This tutorial provided a detailed introduction to Deep Semantic Annotation, and practical guidance for decomposition of complex deep tasks for their annotation with Mechanical Turk.

Text Machine lab developed the 2012 Temporal Relations Challenge dataset as part of Informatics for Integrating Biology & the Bedside (i2b2) project.

As part of SHARP project, Text Machine lab developed Multi-Scrubber (an ensemble system for de-identification of protected health information), and also investigated knowledge-based vs. bottom-up methods for word sense disambiguation in clinical notes. Registration is required for accessing the project data.

The PI of Text Machine Lab was involved in the development of CPA (Corpus Pattern Analysis), a technique for mapping meaning onto words in text that is based on the Theory of Norms and Exploitations. In CPA, meanings are associated with prototypical sentence contexts, which makes it a promising approach for automatic word sense disambiguation.