Skip to main content

Birkbeck Knowledge Lab Seminar: Principal Phrase Mining and Clustering Text

Venue: Birkbeck Main Building, Malet Street

Extracting frequent words from a collection of texts is performed at great scale in many subjects. Extracting phrases, on the other hand, is not commonly done due to inherent complications, the most significant being double-counting, where words or phrases are counted when they appear inside longer phrases that themselves are also counted. Several papers have been written on phrase mining that describe solutions to this issue; however, they either require a list of so-called quality phrases to be available to the extracting process, or they require human interaction to identify those quality phrases during the process. This talk will introduce a method that eliminates double-counting without the need to identify lists of quality phrases. In the context of a set of texts, we define a principal phrase as a phrase that does not cross punctuation marks, does not start with a stop word, with the exception of the stop words “not” and “no”, does not end with a stop word, is frequent within those texts without being double counted, and is meaningful to the user. Our method can identify such principal phrases independently without human input, and enables their extraction from any texts within a reasonable amount of time. In addition, we will present a new distance measure called text distance that is particularly suited to obtain the distance between texts as presented in a term-document matrix, together with a new clustering technique that utilizes this measure to sort a collection of texts in a predefined number of clusters. Finally, we will show an application of principal phrase mining, text distance, and text clustering on a set of abstracts obtained from the PubMed database of medical research papers. Principal phrase mining, the text distance measure, and the accompanying text clustering technique have been implemented in the R package phm.

Contact name:

  • Dr. Ellie Small -

    Dr. Ellie Small is an assistant professor at Drew University in Madison, NJ. She is an alumnus of Birkbeck college where she obtained a first class bachelor’s degree in mathematics with statistics and computer science in 1992. She attended Rutgers University in New Brunswick, NJ, from 2014 to 2019, where she studied statistics with a specialization in data science. When she obtained her PhD in 2019, she started work at Drew University, where she teaches mostly students in the Master’s in Data Analytics program. She is currently on her first sabbatical until January 2023.