Big Data ETL

Healthcare & Life Sciences
Data Engineering & Business Intelligence
SQL Server, .NET

Our client is a major international pharmaceutical company conducting research and development related to a wide range of human medical disorders, including mental illness, neurological disorders, anaesthesia and analgesia, gastrointestinal disorders, fungal infection, allergies, and cancer.

Business Challenge

The amount of data required for healthcare researchers to be able to do their work increases daily. Our client’s databases contain terabytes of data.

This makes it a challenge to perform operations on data — loading new datasets, transforming data, and querying resulting databases — in manageable time.

It is also critical to make sure that the data is of adequate quality, as the results of the medical research depend on it.

Another challenge is to lower the costs of ETL process by automating as much work as possible. 


The team performed database and application design, development, quality assurance; data extractions, loads, and transformations; administering of databases and query optimization; users support and issue tracking.

In order to be able to operate on huge amounts of data, the latest advanced SQL Server Enterprise edition features were used, such as page-level data compression, horizontal range partitioning, and partition switching.

Some databases are migrated to the massively parallel database system — Analytic Platform System, which improved query performance significantly.

Data quality was enforced by Validation Analysis System — a utility that collects statistics and finds anomalies within loaded data, and a website displaying results.

The majority of ETL operations were automated by Integration Services packages and CDM Builder application.

CDM Builder is a multithreaded & distributed application for transforming a number of popular databases to the Common Data Model. The purpose of the Common Data Model (CDM) is to standardize the format and content of the observational data.

Results & Benefits

Data load time was significantly improved — in some cases from days and weeks down to hours.

The quality of data was improved with the help of Validation Analysis System.

Data integration process was made mostly automated, requiring little to no human attention.

Related Cases

Read all

Online Robotics Simulation Application

An educational robotics kit—a browser app simulating the whole process of building, programming and testing a robot.

Implementing LTI 1.3 for LMS

Implementation of the latest version of the standard, LTI 1.3 and in particular LTI Advantage.

OneRoster 1.2 Integration for LMS

A solution for passing grade information from the LMS to a student information system (SIS)