Skip to main content

A new frontier in data mining

A new service, developed by Birkbeck’s Department of Computer Science and Information Systems, can automatically annotate huge volumes of text data to better serve the needs of researchers, businesses and individuals.

The digital economy is growing at an exponential rate, making it increasingly difficult to understand or make use of large quantities of data. News reports, social media posts and digitised documents in archives all contain information about people, places, products and important events that is of interest to researchers, businesses or society, but extracting information from such a large collection of sources in a useful way is difficult without knowing where to look in any given dataset, and because people may talk about the same subject in different ways.

Now, researchers from Birkbeck’s Department of Computer Science and Information Systems have simplified the process of understanding and using data from text, with a new service which automatically annotates digital text documents. Called Samtla API, the service can label words and phrases with named entities and sentiments, by using machine learning and text mining technologies.

The initiative has been spearheaded by Birkbeck academics Dr Mark Levene, Dr Martyn Harris, and Dr Andrius Mudinus, in response to the growing need for easily understandable annotations on the large volumes of text data, generated by media, businesses and individuals all over the world.

The researchers explained: “Labelled data is now a key commodity for many businesses, as well as in research, particularly those adopting machine learning approaches.  Machine learning algorithms require a very large body of data to train for a particular application. The training data is often annotated by human annotators in-house or through crowd sourcing, but this approach cannot is not sustainable and is costly.

“The Samtla API addresses many of these issues by providing a service to automatically label natural language text with annotations. These can be used to gain insights about the content of a collection of documents, with very little input from the user.”

Anyone interested in Samtla API can register their interest for early access to the service.

Further Information