The journal Cell recently published an article describing a detailed simulation of the full life cycle of the bacterium M. Genitalium (Karr et al. 2012. A Whole-Cell Computational Model Predicts Phenotype from Genotype. Cell, 150: 389). This has been hailed as the “First Complete Computer Model of a Living Organism”. It remains a question whether all the important processes in the life of the cell have really been captured and described accuratelly.
Regardless, the scope of the work is pretty astounding. The authors tried to capture the workings of the cell at a level that would represent all of the processes that we believe are necessary for its function. The authors are careful to say that much remains unkonwn, and this is more of a first draft, rather than an accurate model of the organism.
A simulation of this size reveals a number of problems faced by computational biologists. First, there is the problem of fitting the model – in this case there were over 1900 parameters that were determined using results from over 900 papers. This brings to mind von Neumann who said that “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” How do you constrain a model with 1900 parameters? The authors chose to fit the model using some of the literature, and then compare the predictions they obtained to results they have not used in the fitting. The model seems to make quite accurate predictions – something we would not expect if it was chosen at random.
However, I find a few quotes from the press release somewhat perplexing. For instance:
“Right now, running a simulation for a single cell to divide only one time takes around 10 hours and generates half a gigabyte of data,” Dr. Covert wrote. “I find this fact completely fascinating, because I don’t know that anyone has ever asked how much data a living thing truly holds. We often think of the DNA as the storage medium, but clearly there is more to it than that.”
This is an interesting question. However, I don’t know what Dr. Covert means here. Suppose we simulate the trajectory of every single molecule in the cell, as well as the conformational changes that they undergo. If we track this information over the lifetime of a cell, it would generate far more than a gigabyte of data. But how much of this information is really meaningful, or necessary to describe the process? For instance, think of the trajectory of a molecule in a cell as a constrained random walk. The amount of information we need to store this trajectory depends on how we choose to represent it – and could be made as large as we want. But is all this information really relevant, or is it only important that the molecule made it from point A to point B? Similarly, perhaps a lot of the information that is generated by the simulation is really “noise”. The true information that describes the part of the processes that are relavant to the cell may be much smaller.
A related question is: How complex does a computational model have to be to describe the life cycle of an organism? If we really need models with 1900 parameters, and a host of components that interact in complex and irreducible ways, we may not be able to fully understand even this simplest of organisms.