Projects

Clinical Named Entity Recognition system (CliNER) is an open-source natural language processing system for named entity recognition in clinical text of electronic health records. It supports:

  1. a traditional machine learning architecture for named entity recognition with a CRF classifier
  2. a deep learning architecture using a recurrent neural network with LSTM for sequence labelling

Word embeddings are the most widely used kind of distributional meaning representations in both industrial and academic NLP systems, and they can make dramatic difference in the performance of the system. However, the absence of a reliable intrinsic evaluation metric makes it hard to choose between dozens of models and their parameters. This work presents Linguistic Diagnostics (LD), a new methodology for evaluation, error analysis and development of word embedding models that is implemented in an open-source Python library. In a large-scale experiment with 14 datasets LD successfully highlights the differences in the output of GloVe and word2vec algorithms that correlate with their performance on different NLP tasks.

In the interest of designing better metrics for language generation tasks such as machine translation and paraphrasing, we have developed a unit testing framework. Given a dataset, a metric, and a set of corruptions, our code takes the corpus, generates corrupted sentences, and sees if the metric is able to identify the true sentence based on a set of references.

We are organizing a shared task on normalization of clinical concepts in provider notes to standardized vocabularies. This task was supposed to run at the 2019 SemEval as Task 11, but has been delayed due to the data access issues. It is now slated to run in the spring of 2019 as an i2b2 spin-off shared task.

Check out our task video here: https://www.youtube.com/watch?v=uYZTqYxo9AU

This project presents a novel recurrent neural network model to automate the analysis of students’ computational thinking in problem-solving dialogues for computational microgenetic learning analytics.

This work emphasizes the interpretable topic features for Post-ICU mortality prediction. It facilitates the analysis of mortality prediction and investigation of the complexity between mortality and diseases.

The Knowledge Evolution project is an experiment in tracking and mapping the evolution of knowledge domains as well as the reputations and intellectual networks of the past. The project uses the history of the Library of Congress book acquisitions and classification, and the text of historical and contemporary editions of Encyclopedia Britannica and Wikipedia.

This task aimed to stimulate the development of novel methods of humour detection that would not treat humor as a binary variable, and also take into account its subjectivity. The shared task was run with a new dataset based on humorous responses submitted to a Comedy Central TV show.

This workshop focused on advancing state-of-the-art in clinical NLP, with invited talk by Dr. Timothy Baldwin and best paper award sponsored by Philips North America. We accepted 14 high-quality submissions advancing such areas of clinical NLP as normalization of medical mentions, outcome prediction, clinical data de-identification and others.

This paper demonstrates the effectiveness of a Long Short-Term Memory language model in our initial efforts to generate unconstrained rap lyrics. The goal of this model is to generate lyrics that are similar in style to that of a given rapper, but not identical to existing lyrics: this is the task of ghostwriting. Unlike previous work, which defines explicit templates for lyric generation, our model defines its own rhyme scheme, line length, and verse length. Our experiments show that a Long Short-Term Memory language model produces better “ghostwritten” lyrics than a baseline model.

We apply deep learning to the problem of normalization of medical records, i.e. mapping of clinical terms in medical notes to standardized medical vocabularies.

As part of SHARP project, Text Machine lab developed Multi-Scrubber (an ensemble system for de-identification of protected health information), and also investigated knowledge-based vs. bottom-up methods for word sense disambiguation in clinical notes. Registration is required for accessing the project data.

We use neural network models to recover temporal relations among events and time expressions from text.

Text Machine lab developed the 2012 Temporal Relations Challenge dataset as part of Informatics for Integrating Biology & the Bedside (i2b2) project.

The PI of Text Machine Lab was involved in the development of CPA (Corpus Pattern Analysis), a technique for mapping meaning onto words in text that is based on the Theory of Norms and Exploitations. In CPA, meanings are associated with prototypical sentence contexts, which makes it a promising approach for automatic word sense disambiguation.

This tutorial provided a detailed introduction to Deep Semantic Annotation, and practical guidance for decomposition of complex deep tasks for their annotation with Mechanical Turk.

This study presents a triad-based neural network system that generates affinity scores between entity mentions for coreference resolution.

Text Machine Lab’s PI took part in the development of TARSQI, a modular system for automatic temporal annotation that adds time expressions, events and temporal relations to news texts.

TwitterHawk is an open-source natural language processing system for Twitter sentiment analysis. This system was developed for the SemEval-2015 Task 10: Sentiment Analysis in Twitter.

This tutorial covered the current proposals for representation and interpretation of semantic features in word-level word embeddings, representation of morphological information, building sentence representations and encoding abstract linguistic structures that are necessary for grammar but hard to capture distributionally. For each problem we discussed the existing evaluation datasets and ways to improve them.