Natural Language Processing (NLP) Tasks

Industries
Healthcare & Life Sciences
Expertise
Artificial Intelligence & Machine Learning
Technologies
Python, R
Client

Our client is a multinational Fortune 500 pharmaceutical company aiming to provide products and services of the best quality to the customers along with high responsibility standards to the patients and to all who use its products.

Business Challenge

During internal audits, lots of data are generated. A large part of the data is free text entered by human users. It includes findings, CAPAs (corrective action / preventive action), and quality investigations. The data is analyzed to discover present and emerging issues. The customer would like to employ NLP (natural language processing) to help humans discover such patterns.

Solution

Topic Modeling and Text Search applications have been developed for discovering common topics and for similarity search across audit findings. R is the implementation language, with Shiny used for UI.

For topic modeling, we used LDA (Latent Dirichlet Allocation), which considers each document to be a mixture of a relatively small number of topics (e.g. 20 topics). In our case, documents are free text information from findings. LDA tries to identify commonalities in the set of documents and outputs proportions of each topic within each document. These are then analyzed by a human using the Shiny app to see if discovered topics are indeed cohesive and if insights can be drawn from them.

For similarity search, both user queries and documents are represented as numeric vectors (called document embeddings). Documents whose vectors are most similar to the query vector are returned to the user as the search result. Shiny is used for UI.

Libraries used: tm, topicmodels, stm, quanteda, stopwords, textclean.

At the same time, more sophisticated approaches to pre-processing and document embeddings are being developed using Python and libraries from its NLP ecosystem. We have acquired document embeddings using BERT and ULTFiT. One use of the embeddings is document clustering, i.e. identification of groups of similar documents. Another use of the embeddings is predicting some of the categorical variables that users typically enter manually, e.g. which compliance topic the finding belongs to. This can be used for providing a user with hints during data entry.

Libraries used: NLTK, Flair, spaCy.

Results & Benefits

Our solution helps non-technical users discover patterns in large amounts of textual data, which facilitates identifying issues. This reduces the human effort required to work with the finding database and helps to point out unusual findings.

Related Cases

Read all

RTSM Solution: Data Ingestion Improvement

Removing issues in data architecture and processing in order to provide a solid foundation for future growth of the platform.

LMS Content Import and Export Feature

A solution for importing and exporting content from / to Moodle and IMSCC platforms.

Content Generation with Copilot Studio and MCP Servers

A solution to help new teachers rapidly adapt to the educational system while providing easy access to the existing content base.