Analysis of Astronomy Abstracts Using Natural Language Processing (NLP)
Nikolai Shaposhnikov, Ph.D
Back in my days as an X-ray astronomer, my notion of data was based on some sort of quantitative structure such as a vector, an array or a tensor. I was studying spectra, lightcurves, images and only had a vague idea that computational methods may be applied to texts. As I found out when I switched to data science, more often than not data scientists are dealing with unstructured data. Natural Language Processing is the field of data science which includes specific algorithms for digitizing and analyzing text data.
In this post I feature two of my passion areas together: astronomy and data science. Applying NPL methods to the abstracts of astronomy papers published in various scientific journals, I show how this data can be digitized and used to make predictive analytics.
Writing papers is a necessary part of life of any researcher. When a scientist makes a discovery or simply obtains an interesting result, the first order of business is to publish it. By publishing the results of your work, you establish its ownership, inform the scientific community of the new findings and receive feedback. This is very important in defining the direction of new research projects. Without proper publication, new findings have limited value in looking for funding or as arguments in a scientific dispute.
The most challenging part of publishing scientific papers in a refereed journal is the review process, when the journal editor sends the paper to an expert scientist in your research field for evaluation. Based on the reviewer ‘s recommendation the editor may accept the publication, reject the paper, or recommend edits prior to publishing the work.
“The Astrophysical Journal” and “Astronomy & Astrophysics” are refereed journals. The most common examples of non-refereed publications are conference proceedings which scientists typically use to report intermediate results of an ongoing research. These carry much less impact. Refereed journals also differ in terms of importance, thus the journal impact factor (https://en.wikipedia.org/wiki/Impact_factor).
Publication in a refereed journal is a quality stamp from the community of your peers. However, it is often a tedious and lengthy process. Is there some way to make it easier? A seasoned scientist usually knows in advance which journal he plans to submit his new article to, based on his previous experience. For a researcher with less experience, however, choosing a journal can be rather challenging. As part of my experimentation with NLP analysis, I explore the possibility of predicting which journals, papers were published in, based on the text of their abstracts. This model can serve as a way to recommend the most appropriate journal for submission. When submitted to a right journal, a lot of time and energy can be saved by going through an easier and faster review process.
Data: Astronomy Abstracts
An abstract is a short paragraph summarizing the main results of the presented work. It is read much more often than any other part of the paper. It is the most important and informative part of a publication. For this project, I extracted data from the NASA Abstract Service, the primary resource for astronomers to search publications. I downloaded all abstracts for papers published in the refereed astronomical journals for the period from January 2017 to December 2018. There is a total of 45,800 abstracts published in 359 different journals. For the purpose of my analysis I retain only journals that have 1000 or more publications. This leaves us with 31,645 abstracts in 15 journals. The figure below gives the number of abstracts for each journal used in model training.
Data Processing and Modeling
Our goal is to create a classification model, which predicts and helps identify a particular journal based on the text of an abstract. Generally, a classifier takes as an input a set of numerical features of fixed length and a set of corresponding labels. However, an abstract is a collection of words of varied lengths. To transform a text object into a numerical vector I create a pipeline consisting of several standard NLP processing steps:
* Turn the journal names to numerical labels
* Tokenize the text, i.e. break the text into sentences and words
* Put the words into lowercase
* Apply part of speech (POS) tagging
* Lemmatize, i.e. put the words into their basic form by removing plural, ending, etc.
* Apply [N-Gram](https://en.wikipedia.org/wiki/N-gram) model by keeping 1-gram and 2-grams sequences
* Apply term frequency–inverse document frequency ([TDIDF](https://en.wikipedia.org/wiki/Tf–idf)) statistics to vectorize the data
* Randomly split the data into the train and test set using 75%/25% train-test ratio
* Finally, feed the vectors and corresponding journal labels from the train set to the stochastic gradient (SGD) classifier
I implement these steps by creating processing pipeline using python [NLTK](http://www.nltk.org) and [scikit-learn](http://scikit-learn.org) packages.
After the model has been fit to the train data, I apply it to the abstracts in the test set. The model performance is summarized in the chart below using standard metrics, i.e. F1-score, precision and recall (https://en.wikipedia.org/wiki/Precision_and_recall):
The overall average precision is 67%. In other words, the model correctly predicts and identifies the journals for two out of three abstracts. It is a good result, though it may not appear all that impressive on an absolute scale. The model is tasked to predict one from fifteen categories based on very complex and noisy data. There is no clear reason for the abstracts in, for example, “The Astrophysical Journal” to be too much different from those published in “Astronomy & Astrophysics”. However, the model has found a way to distinguish journals by analyzing such subtle aspects as editing styles, possible differences in terminology, regional differences, etc.
It is instructive to look at the confusion matrix:
There are at least two major groups of journals, which can be defined as part of the astrophysical group (Astronomy & Astrophysics, “The Astrophysical Journal”, “Monthly Notices of the Royal Astronomical Society” and their Letters counterparts) and planetary and Earth sciences (“Earth and Planetary Science Letters”, “Geochimica et Cosmochimica Acta”, “Geophysical Research Letters”, “Icarus”) group. Also, the “Classical and Quantum Gravity” and “Physical Review D” may present a small group related to general physics. The prediction for a journal in identifying which group a journal pertains to with high probability falls into the same group. This shows that the model is able to distinguish between texts which belong to different subdivisions of astronomy.
It is also interesting to look at both the most and the least important words in the corpus of astronomy abstracts shown on the chart below.
The green and red bars show features “weight” of the most and the least important features in the model prediction respectively. We can see that the terms which have more specific meaning, like “satellite”, “navigation”, “orbit”, and abbreviations like GPS or IRI (International Reference Ionosphere) are among the most important model features, while general terms like “source”, “light”, “instrument”, “scale” are among the least important.
In this post we applied Natural Language Processing to a set of scientific abstracts. We have created a model that is able to predict and identify the journals which have published specific papers. This shows the power of NLP as a tool to automate analysis of professional texts, which otherwise require deep domain expertise in fundamental sciences, health and medicine, law, security and other areas. At Epigen Technology we put NLP to heavy use to help our clients utilize text analytics in their information systems.