Big Data ETL

Big Data ETL

Customer

Our client is a major international pharmaceutical company, which conducts research and development activities related to a wide range of human medical disorders, including mental illness, neurological disorders, anesthesia and analgesia, gastrointestinal disorders, fungal infection, allergies and cancer.

Challenge

The amount of data required for healthcare researchers to be able to do their work increases daily. Our customer’s databases contain terabytes of data.

This makes it a challenge to perform operations on data — loading new datasets, transforming data, and querying resulting databases — in manageable time.

It is also critical to make sure that the data is of adequate quality, since the results of the medical research depend on it.

Another challenge is to lower the costs of ETL process by automating as much work as possible. 

Solution

The team performs database and application design; development; quality assurance; performing data extractions, loads, and transformations; administering of databases and query optimization; users support; and issue tracking.

In order to be able to operate on huge amounts of data, the latest advanced SQL Server Enterprise edition features are used, such as page-level data compression, horizontal range partitioning, and partition switching.

Some databases are migrated to the massively parallel database system — Analytic Platform System, which improved query performance significantly.

Data quality is enforced by Validation Analysis System — a utility that collects statistics and finds anomalies within loaded data, and a Web site displaying results.

The majority of ETL operations are automated by Integration Services packages and CDM Builder application.

The CDM Builder — multithreaded & distributed application for transforming a number of popular databases to the Common Data Model. The purpose of the Common Data Model (CDM) is to standardize the format and content of the observational data.

Results / Benefits

Data loads time was significantly improved — in some cases from days and weeks down to hours.

The quality of data was improved with the help of Validation Analysis System.

Data integration process was made mostly automated, requiring little to no human attention.

Technologies

Wallpaper Preview Application
Machine Learning Pipeline for Biochemistry
We use cookies to provide a better browsing experience for all. By using this site, you agree to our use of cookies. Learn more Don't show again