Join MySLAS Social

Small Molecule Profiling: Dealing with the Data

"Although high-throughput screening has been going on for almost two decades, only now are substantial public datasets emerging and being made available for cross analysis—and they're large enough and well annotated enough to make a difference in our work," says Paul Clemons, Ph.D., director, Computational Chemical Biology Research at the Broad Institute.


"Many researchers are now beginning to run high-content screens, but they're using old methods to analyze the data," adds Darren Green, Ph.D., director, Computational Chemistry-UK, GlaxoSmithKline (GSK). "Yet new techniques make it possible to pull out so much more information from them. We all should become familiar with these innovative approaches and start using them."

The two trends—more accessible data and new ways to mine that data—underpin the selection of articles for the recently released Journal of Biomolecular Screening (JBS) special issue, Knowledge from Small-Molecule Screening and Profiling Data, for which Clemons and Green served as guest editors. The articles describe ongoing efforts to standardize data and novel methods of data analysis, many of which involve collaboration and data sharing across organizations.

"If we had put this special issue together 10 years ago, many of the papers would have dealt with how to pick your hits, how to monitor the quality of your screens and similar concerns," Green observes. "We have a couple of examples of these classical approaches because they are still needed, but the primary focus is on data."

The Big Picture

The technology and the science of small-molecule profiling now are at a point where "big data" approaches can be used to work with compounds on a larger scale and gain insights sooner, says Clemons. For example, working with Matthias Wawer and colleagues, Clemons applied a machine- learning and pattern-recognition approach to identifying structure-activity relationships. They exposed a collection of some 30,000 compounds to high-throughput gene-expression and image-based profiling, a process that produced about 2,000 different readouts in parallel. Using automated tools, the team was able to connect chemical structural features of the compounds to patterns in their biological activity profiles and use that information to prioritize groups of compounds for further study.

"An analogy would be when you buy something on Amazon, and you see 'people who bought this product also bought the following....' or 'because you bought this, you may also want to buy this,'" Clemons explains. "We use the same basic methodology to say, 'If you have this chemical structure, then you're likely to have this pattern of activity.'" By contrast, "a more traditional approach would be to do a high-throughput screen with a single readout, narrow the compounds down to a much smaller collection—maybe a few hundred, at most—and then do the profiling."

That said, other groups are working on ways to bring more value to results from single screens. For example, Aurelie Bornot and colleagues at AstraZeneca use known bioactivity data from screened compounds to infer putative targets, pathways and biological processes that are consistent with an observed phenotypic response. "This knowledge-based approach, which uses existing data to help inform decision making after you get your readout, complements the pattern-recognition approach," notes Green. "Both methods, and others, are likely to be utilized going forward."

Before that can happen, however, "the approaches we see in papers need to get embedded into robust, industrial-strength software that people can run routinely in their labs," Green says. "The papers in this JBS special issue are like the peak of the iceberg—data scientists, computational scientists and statisticians are doing most of the analyses. To really drive these new techniques into screening labs, we need products that are reliable and work across platforms." Although some software companies are providing products with these capabilities, these are essentially "vertical solutions, so it's difficult to swap algorithms in and out," Green explains. "Companies need to make their software compatible with a wider range of products and platforms so it can be adopted on a bigger scale."

The Need for Standards

Even if software for analyzing data were to become more modular and compatible across platforms, the challenge of standardizing the data remains—a challenge that is common to other areas of drug discovery R&D, as well. "We can't stress the importance of standards highly enough," says Green. "We need standard vocabularies and ontologies so we can pull data from different sources together and do meta-analyses. We also need standards so we can compare methods, rather than, say, my proposing something based on my data that someone else can't integrate with their data."

"Standards also will allow drug discovery scientists to more rapidly prioritize compounds, and to potentially identify new relationships between compounds and targets or targets and pathways or pathways and diseases that were not previously appreciated," adds Clemons. "That's where an initiative such as BARD (BioAssay Research Database) comes into play. BARD has controlled vocabularies associated with assay descriptions, so that when researchers use the same reagents, for example, we can say with confidence that those assays are similar."

BARD is a public chemical biology resource with a query portal that runs across multiple organizations, locations and disciplines. It continues to evolve, with the aim of facilitating the use of data generated from the NIH Roadmap Molecular Libraries Program (MLP) and the Molecular Libraries Probe Production Centers Network, which ended in May 2014, according to two of the architects of the resource and authors of the manuscript featured on the front cover of this JBS special issue, Thomas (T.C.) Chung, director of outreach and project manager at the Sanford-Burnham Medical Research Institute's screening center and Andrea de Souza, a consultant and former general manager, Center for the Science of Therapeutics at the Broad Institute.

"The chemical biology community recognized that we had a very rich dataset as a result of the $100 million plus NIH Common Fund initiative, but we had not come up with standards for how the data should be annotated," explains de Souza. "Once we were into year three of the program, we realized we couldn't interpret and mine the data as effectively as we would like to. So we had a choice. We could continue to do things the way we'd always done them and simply put the data into PubChem. Or, if we wanted to preserve the richness and integrity of the data, we could learn from our mistakes and put in place the standards and the vocabulary that would enable us to make more sense of what we had."

de Souza, Chung and colleagues decided to collaborate with their MLP partners to curate and annotate the data, deposit it into a next-generation technical repository and open source the code. de Souza also conceptualized the Assay Definition Standard and has set the stage for adoption, which should lead to the meaningful exchange of assay information between industry and academia. In its present iteration, BARD uses "structured assay and result annotations that leverage the BioAssay Ontology (BAO) and other industry-standard ontologies, and a core hierarchy of assay definition terms and data standards defined specifically for small-molecule assay data," according to the JBS special issue cover feature that describes how the interdisciplinary BARD team is collaborating to implement the standards.

"What we're really involved in is a change initiative aimed at getting scientists to understand the value of annotation and curation," de Souza says. "Our success will depend on how well BARD is adopted. Migrating data from PubChem to BARD, re-annotating it and getting it to the standard we've developed is challenging. We're dealing with a huge volume of data, and it takes a lot of time and effort, but we're slowly and steadily making progress towards our goal: to build BARD into a single resource that biologists and chemists can trust for reliably comprehensive query results."

In parallel with BARD, Stephan Schürer, associate professor and principal investigator in the Department of Molecular and Cellular Pharmacology at the University of Miami School of Medicine, and colleagues are continuing to evolve BAO, which they created in 2010, and share with BARD its metadata terms and definitions for describing the data from drug and probe screening assays. Schürer also is heading up an effort to create similar specifications to describe results from the highly diverse data that is being generated from the NIH Library of Integrated Network-based Cellular Signatures (LINCS) program. The LINCS pilot program, which is now transitioning into the next phase, included two data-generation centers, four technology development centers, four computational centers and an interim data coordinating center, all of which work together as a network, Schürer explains. In contrast to the MLP, the LINCS program is generating a more complex data matrix that extends into multiple dimensions, including perturbations (e.g. small molecule, genetic, environmental); model systems (e.g., cell lines, primary cells, induced pluripotent stem cells); and measured response profiles (e.g., genome-wide transcription, kinome-wide protein binding, cell cycle state, cell viability, and other phenotypes, metabolic parameters). In addition, LINCS is generating computational and software tools. The goal, according to the team's report in JBS, is to create "a sustainable, widely applicable and readily accessible systems biology knowledge resource."

Given the vast amount of complex data generated by the LINCS consortium, simply creating and pulling information from a large central database wouldn't be practical, Schürer says. Instead, his team is creating tools based on their newly developed metadata standards and various controlled vocabularies and ontologies that respond to queries by providing an integrated view of data from multiple sources. For example, one tool, the LINCS Information FramEwork (LIFE) is being developed as a "knowledge-based search engine." Users can search for molecular entities or other concepts—e.g.,specific compounds, diseases or proteins—to identify relevant data or they can browse the data by category, e.g., "bioassays," "cell lines by organ," "small molecules," "kinase proteins" or "cell lines by disease." Integrating complementary technologies to store and process complex datasets is the foundation for a distributed knowledge-based information management and search system that is different from the classical central data warehouse approach. This helps avoid data replication and places less of a burdern on data producers, according to Schürer.

The goal is to develop tools that do more than simply deliver information based on keywords in the way that classical search engines do, Schürer explains. The new tools will not only retrieve results relevant to a specific query; they can also represent and model the underlying knowledge and help create new knowledge by inferring connections among those results. For example, the system knows which disease is related to which model system, enabling researchers to identify compounds that show activity related to a specific cancer subtype and then find those kinases that are also inhibited by the same or very similar compounds. As more knowledge is formalized and more data is connected to that knowledge, its capabilities will "exponentially increase," Schürer says. "Common data standards will help make the development of such tools possible."

A Community Effort

BARD, LINCS and similar endeavors are efforts to turn increasingly large amounts of data into meaningful information that can be used by research teams across industry, academia and government. "What I like about them is that they all represent ways that the scientific community is coming to grips with a problem that has been around for a long time," Green says. "We've been moaning about the lack of standards in our own offices for years, and so it's great to see these kinds of initiatives taking hold."

These collaborative undertakings are especially meaningful now that industry has begun to participate, Green adds. After 23 years in the industry, he is witnessing "a clear shift to more pre-competitive collaboration," he says. "One example is GSK's recent agreement with the European Bioinformatics Institute and the Sanger Center to share target validation information in a pre-competitive way."

Industry-led initiatives such as the Pistoia Alliance also are working toward the development of data standards, Green notes. "We've been able to put down our swords and are doing quite well at sharing experiences." In addition, GSK and other industry leaders are involved in Open PHACTS, a European initiative that aims to "reduce the barriers to drug discovery in industry, academia and for small businesses." The result will be an online platform with publicly available pharmacological data, as well as freely available software and tools for querying and visualizing that data, according to the initiative's website.

"The biggest thing we can all do now is to get involved in these projects, and where we have emerging standards, try to adopt them as quickly as possible," Clemons urges. "In the end, everyone will benefit—companies, government, academia and ultimately, patients—and that's what really matters."

Learn More in the JBS Special Issue

In addition to the contributions discussed in this article, the JBS special issue, Knowledge from Small-Molecule Screening and Profiling Data, features review articles and additional original reports on new methods for interpreting and mining high-throughput screening data.

June 30, 2014