The BIG DATA in the Malaysian Cohort Project

Everyone is talking about Big Data nowadays and many want to get their hands on big data. One of the major sources of big data is within the health industry. Hospitals which have implemented the electronic medical record (EMR) system will have big data stored in their databases. Those hospitals which are still using paper-based medical records should move towards EMR, otherwise their ‘big data’ will not be able to be harnessed and analysed effectively.

The simplest definition for big data is that the data is too large to fit into a laptop or a desktop even those with the largest memory. In most cases, the data is stored in large databases with proper back-up and security framework. Of course the famous statement about the 5Vs of big data: volume, velocity, variety, veracity and value. The last V, representing value, is perhaps the one that matters the most.

The Malaysian Cohort project certainly has big data. To start with, we have the baseline recruitment data for 106,527 participants. By all accounts this the largest set of collection of health related data in Malaysia. Each participant has more than 2000 data points collected from the questionnaire, biophysical measurement and blood tests. The address of each participant is also mapped onto the geographical information system (GIS).

The current amount of data from TMC project stored in our database is about 50 terabytes. A terabyte (TB) is a measure of computer storage capacity which is 2 to the 40th power or approximately a trillion bytes. Another way to put it is one TB is defined as 1,024 gigabytes (GB).

Our TMC researchers are now starting to mine and analyse the Big Data from the project. The biospecimens themselves are also a resource for research using big numbers. Imagine asking a question on the outcome of the pre-diabetics (10% or 10,000 participants) after 4-6 years of follow-up. Or a cancer biomarker discovery project on the 500 participants who have succumbed to cancer and doing a proteomic analysis on the pre-symptomatic samples. This is just the beginning and we believe the sky is the limit for the kind of research we can do with the data and biospecimens from The Malaysian Cohort.