DATA : LAB = 2 : 1

Written by Katarina Kovač | Aug 27, 2015 1:17:04 PM

The amount of time scientists spend analyzing their data has been steeply increasing since the late 1990s/early 2000s.

I remember starting to spend more time in the office when I began to use "modern" molecular biology techniques such as real-time PCR and DNA-microarrays in particular. Before that I was more involved in conventional biotechnology and biochemistry intensive research that was far more laboratory intensive. After this office time increased and caused quite a problem for the organization I was working for, as it meant that an increasing number of researchers had to have permanent office space and their own PCs to meet the demand for ever increasing data analysis.

Although there is no clear rule as to what the background of a bioinformatician really should be, most of them come from computer science.

With lab techniques becoming more and more data-intensive, researchers had no choice but to dig into data analysis themselves. This is how a lot of bioinformaticians were born: learning basic programming skills for organizing, filtering and annotating data in tables containing tens of thousands of rows. And the charming, seductive and misleadingly simple names of programming environments such as R or Python did very little to help.

With lab techniques becoming more and more data-intensive researchers had no choice but to dig into data analysis themselves. This is how a lot of bioinformaticians were born: learning basic programming skills for organizing, filtering and annotating data in tables containing tens of thousands of rows.

Then came Next Generation Sequencing (NGS). Compared to DNA-microarrays NGS shifted the time required for bioinformatics to process data to the next level. If DNA-microarray data analysis could still be handled by a "bio-person", with basic knowledge or at least an affinity for computers and programming, then NGS is a completely different story. The amount of data produced by NGS is incomparably higher than that of any preceding molecular biology technique. NGS brought a clear demand for a new breed of researchers – bioinformaticians – who spend next to no time in a lab. In a way this means that researchers were relieved of the burden of bioinformatics.

However this blessing also had a price: although there is no clear rule as to what the background of a bioinformatician really should be, most of them come from computer science. This has resulted in a huge language barrier as, even for overlapping terms, completely different vocabularies are used. It is like having to learn to speak again.

A time when (sequence) data was represented with a few lines are long gone. Now NGS data have to be transported on physical hard drives, not attached to an E-mail or otherwise sent via the Internet, as they are simply too big.

A lot of data means all kinds of new problems. One is data storage. In the end data need to be stored locally, which means more investment into data storage. The second problem with large data is that they are often generated by distant service providers and need to be transported back to the lab. I used "transported" intentionally, because mostly NGS data are physically shipped on portable disks via "snail-mail" as the Internet is much too slow. These are just a few examples of how big of a problem the data explosion in molecular biology is.

Data is already unimaginably important and has every chance of superseding laboratory experimentation.

A lot of data means all kinds of new problems - one is data storage.

It seems that data are becoming more and more important compared to laboratory experiments. Will there eventually be so much data that people will stop making experiments in a lab and instead will focus on getting answers from existing data? For that to become even remotely possible one important prerequisite will have to be fulfilled: accessible curated data. It is not a problem to dump data into public databases, the problem lies in the fact that the experimental part – the data generating process – will have to become much more standardized or at least much better documented, so that the quality of experiments will be obvious and poorly designed or executed experiments can be identified. The old saying "garbage in, garbage out", which one learns at the beginning of a bioinformatics career, still holds true.

The fact that companies like IBM, Google and Samsung are stepping into the area of bioinformatics, or at least data processing and storage in biology, medicine and the like, is a very clear signal that data is already unimaginably important and has every chance of superseding laboratory experimentation.

By Matjaž Hren, PhD, COO and Head Research and Development, BioSistemika LLC

[tw_callout size="waves-shortcode" text="" callout_style="style2" thumb="" btn_text="Republish the article" color="#37a0d9" btn_url="https://scinote.net/blog/republish/" btn_target="_blank"]

View full post