Supercomputers get lots of press. With colorful names like Yellowstone, Stampede and Blue Waters, their high speed, huge storage capacity and ability to produce high-resolution simulations make them ideal tools for scientific problem solving in a wide range of fields, including drug discovery. But all that computing power is useless without software to make sense of all the data that's generated, usually in the form of numbers.
This is where scientists such as Chandrajit Bajaj, PhD, computational applied mathematics chair in visualization and director of the Center for Computational Visualization, and his team at The University of Texas at Austin (UT), come in. They're taking big data—another term that's getting a lot of media play in all realms—and making it meaningful for laboratory science and technology professionals working in life science R&D.
One aspect of Bajaj's research involves "structure determination from imaging"—essentially, a combination of computer modeling, simulation, elucidation analysis and visualization achieved through the application of geometric and signal processing algorithms that run on UT's supercomputers. At the dedication of the university's Bill & Melinda Gates Computer Science Complex and Dell Computer Science Hall, Bill Gates extolled the promise Bajaj's work holds for combating infectious diseases such as HIV. Like others working on improving screening results using computational methods, the goal is to take advantage of computer power to streamline the drug-discovery process without sacrificing accuracy.
Bajaj and his team have developed computer algorithms that, simply put, enable the identification of targets and potential treatments using less data than was previously required. That's important, Bajaj explains, because before comprehensive research can start on targeting a candidate molecule implicated in a disease, investigators need a detailed computer model that captures the form and properties of the target molecule, an informed collection of complementary molecules that might bind to the target, as well as any prior information on characteristics and potential binding sites. "Generally, molecular interaction modeling work requires a 3-D atomistic model of both the target and associated protein cells of a virus—but currently available tools aren't always up to the task," he observes.
"The main workhorse for structure determination close to the atomistic level is X-ray crystallography. However that wouldn't work in our case," Bajaj says. His group has been targeting a pre-infection molecule on the envelope of HIV that triggers helper T-cells and allows the virus to inject its RNA into those cells. However, "envelope proteins such as those on HIV are extremely difficult to crystallize; they're highly fluid and they're multi-proteins, so you can't get them into crystal lattice forms, which is a pre-determinant to X-ray diffraction."
Another option is electron tomography, which has been around for a couple of decades but only recently began to come into its own in protein modeling, Bajaj notes. "Electron tomography is very much like computed tomography imaging, but down to the nanometer scale," he explains. On the positive side, the technique enables scientists to build up three-dimensional models that include both the surface and volumetric information about what is inside the molecule and of its surrounding environment. "However, the nanoscale images that emerge from this method are very noisy—that is, the signal-to-noise ratios are pretty much equivalent—and sometimes you don't know what is a signal and what isn't," he says. In addition, although the goal is to refine a molecule's structure to as close to the atomic level as possible, "that's only possible when you get down to angstroms, which are a tenth of a nanometer. This structure determination method isn't quite there yet."
Working with another computational modeling and image refinement technique, Bajaj devised algorithms that could be used to create simpler "3-D quasi-atomistic models" for pre-infection molecules of HIV, for which 3-D atomistic models are not currently available. He additionally reported in 2012 that his newer structure elucidation algorithms were able to detect secondary structures—alpha-helices and beta-sheets—of proteins using 3-D maps reconstructed at resolutions of 6 to 10 angstroms, and supersecondary structures such as small collections of helices/sheets at coarser (10–15 angstroms) resolutions using 3-D maps reconstructed primarily from single particle cryo-electron microscopy.
"Only when we have a detailed model of the molecule implicated in a virus's life cycle can we find inhibitors that might interrupt its infection or formation," Bajaj emphasizes. "At this point, using a combination of ours and published algorithms, we can resolve structures down to about four to five angstroms, a level of detail that allows us to predict the comparative strength of binding affinity, and thereby order or rank potential drug candidates."
It has taken 10 years and a good deal of interdisciplinary collaboration to arrive at the current algorithms, Bajaj observes. "We've had to rely on good acquisition of electron microscope data from groups such as Sriram Subramaniam and his colleagues at the U.S. National Cancer Institute and then work with other computational applied mathematicians and biophysicists at UT and elsewhere who understand both the imaging process and the mathematization of that process," he says. "Doing so has enabled us to come up with a computational procedure by which we can do model construction or refinement of target structures for a range of molecules implicated in infectious diseases.
"You just can't throw hardware at a problem when you're trying to identify a potential drug target," Bajaj continues. "You need an algorithm that can be operationalized and also is efficient. And the efficiency clearly is secondary; the main thing is accuracy – how reliable is your reconstruction from all those giga bits and tera bytes of data? We know our models will never be 100 percent reliable; there always will be some bounded uncertainty built in, which is why we're always refining our methods. But we know we have a sequence of robust processes with quantified uncertainity, so when we get to the next computational step of making predictions about what might bind to this target model, we can state that prediction with a measure of confidence, and be dubious that we may be way out in left field somewhere."
Even with a detailed target model, that next step is by no means simple, Bajaj acknowledges. "The next major challenge is to comprehensively search for inhibitor molecules that effectively bind to this target molecular model, and that involves another set of computational drug prediction procedures. The caveat is that the drug screening functions or the scoring functions for binding affinities are all approximative mathematical models once again. So we again need to come up with new biochemical and biophysical mathematical models of molecular-molecular recognition, in the form of new scoring functions of molecular affinities that can accurately predict with quantified uncertainity when molecules would bind to targets in vivo, and additionally, verify this predictive understanding by computational means. (a.k.a., computational drug trials)"
That said, "It's not enough to have a model or an algorithm; you also have to implement it," Bajaj stresses. That's especially important now because the procedure he used to develop the HIV model can be applied to molecules implicated in other diseases, such as the Trypanosoma brucei ribosome, which is responsible for African sleeping sickness and the Machupo virus, which is common in Bolivia and elsewhere in South America. To hasten the discovery process further, Bajaj is offering all the structural and mathematicals models that are developed in his group at no charge to any group that wants to use them, even as his group continues to work on improving the prediction process. To request these models, email Bajaj at firstname.lastname@example.org.
"Whether we are looking for an antiviral or an anti-ribosomal compound, or discovering an antigen that might work for developing a vaccine, we need to do constant remodeling from imaging, and through a better mathematization of the molecular interactions," Bajaj says. Moreover, as the targeted viruses mutate, "we need to re-exercise the structure-determination process for different strains. Structure-determination from microscopy is not just a one-shot thing, where we build a model and that's it. We need to continually look at how whatever we modeled changes, and quickly update the target and the potential predictive models based on that."
Asked whether some people are skeptical about the results that can be achieved using computational methods, Bajaj says, "I think the biggest skeptic is me—and that's what really motivates our group. My interest in computational science blossomed as a result of a number of serendipitous encounters with respected scientists in this area, including Tinsley Oden, who encouraged me to move to UT, Ron Milligan and Arthur Olson at Scripps, Andy McCammon at UCSD, Wah Chiu at Baylor College of Medicine, Joachim Frank at Columbia and Timothy Baker at University of California, San Diego, amongst others. These researchers awakened me to the challenges and potential for scientific discovery through computation, and they taught me that one of the main challenges is accepting that our modeling and prediction algorithms are never perfect."
When he first began working in the drug discovery area, Bajaj was struck by the fact that many computational scientists would publish a method or a model "without trying to prove anything or provide any guarantee of how reliably it could work," he says. "What our group tries to do is to always quantify the uncertainty associated with our models, and include that along with with the prediction methods based on those models that we publish. That way, other scientists know what they are getting, and what they need to keep in mind as to its shortcomings."
Before Bajaj even thinks of publishing, he puts his algorithms and computations through his own rounds of rigorous testing. "Recovering structure from imaging is error prone because there are so many pitfalls—for example, we might think the data we're working with was accurately acquired, but it could have been mishandled. And so we must be able somehow verify the algorithm against the possibility that it may not be correct." Bajaj says. Therefore, each time Bajaj or a member of his group identifies a potential target or inhibitor, "we go through routine sets of what I call ‘self controls' and progressive checks. Our model includes an assumption about the input errors, and if we go back and check that the input is, indeed, within a reasonable margin of error, then we can say, ‘our prediction is based on this bound on error, and therefore our prediction has a certain level of confidence.' It's important to include this information, because there are other notable scientists who are as skeptical as I am, and will ask for those details rather than simply look at a final result."
Skepticism or not, computational science is here to stay in drug discovery and development. According to the Texas Advanced Computing Center at UT, "computational science has become the third pillar of scientific discovery, complementing theory and physical experimentation, allowing scientists to explore phenomena that are too big, small, fast or dangerous to investigate in the laboratory."
Although computer science is coming into its own in drug discovery, the purpose is not to replace other experimental methods, but "to work along with them," Bajaj says. "Ten years ago, if I were asked how much computation was used in the biological sciences, I'd say less than five percent. It's growing now, and that's because people are refining their methods so the predictions are more reliable. We don't want someone finding when they test the algorithms using cell assays or animal studies that most of the computational predictions don't work in even the simplest laboratory test, because then instead of helping to create an accelerated discovery pipeline, we've slowed them down even more."
When used properly, with algorithms that are as precise as possible, it's difficult to ignore the value of computational techniques. "Virtual (computational ) screening involves working with large databases of potential chemical compounds, and with the target testing combinatorics being so huge, I don't see any way that mechanical biological testing can ever be accelerated other than by using computational methods in conjunction with robotics automation to do some pre-screening," Bajaj says. "While the final testing and validation has to be done with the necessary laboratory and field trials, computational screening is increasingly essential to whittle down possible drug leads from millions to a few hundred."
"The bottom line," says Bajaj, "is that computational techniques have improved significantly, and some of these techniques are integral to high-throughput drug screening methods that are already in use, as part of a computer-aided scenario. New and hybrid strategies are now possible because everything else is coming together, too. Our understanding of molecular form and function is increasing. Our super computers are getting better and more available, and it would be a shame not to use them to capacity. Hardware has taken a big lead over software, but software development is catching up. And most of all, teams of people from many different disciplines have been working together, as evidenced by the work being done in our Institute for Computational Engineering and Sciences, the Texas Advanced Computing Center and several other places across the United States and beyond. Computational data science is no longer in the minor leagues. The time has come to take full advantage of its capabilities."
July 29, 2013