Construction and data mining of chemical and biological databases
Bastienne Wentzel

30 april 2009, Interface

Predicting biological activity of compounds accurately is of great importance in the development of new drugs. Not only bioinformatics methods are used to predict for example receptor interaction with a drug, but also cheminformatics methods to predict the possible toxicity of a potential drug molecule. Jeroen Kazius constructed extensive databases and designed new data mining methods which can more accurately predict such activities.

Jeroen Kazius' thesis is not about bioinformatics alone. It is also about cheminformatics, the chemical counterpart of biological informatics. The two are similar yet different, he explains. "Both methods use algorithms that look for patterns in a data set. But the type of data is very different. Bioinformatics tools often handle sequences of letters to encode genes or proteins. Cheminformatics is about analysing small molecules. These are mostly represented by a two-dimensional structure. Therefore the algorithms used are very different."
Kazius has constructed and used biological and chemical databases that were not available yet. Then he designed ways to predict certain properties of a new structure more accurately by using information from these databases. The bioinformatics part of the research focuses on a particularly important protein superfamily, G protein-coupled receptors (GPCRs). About half of the known drugs interact with these proteins. Therefore they are an important target for the pharmaceutical industry.

Collecting data
Kazius collected GPCR data from electronic sources and from printed literature and constructed a new database called NaVa (from natural variants). "Selecting this data is time consuming because it has to be checked manually," Kazius explains. "Data can be erroneous or outdated. We developed software to check data automatically but some of it still needs to be done by humans." The researchers initially focused on single nucleotide polymorphisms (SNPs), but chose to also include rare mutations. Later, the database was extended with further types of variants, such as copy number polymorphisms (CNPs). Kazius observed that few of the single nucleotide polymorphisms were convincingly linked to disease susceptibility in contrast to rare mutations with a drastic effect on phenotype.
The aim of the database is to provide an overview of variants that occur naturally in humans GPCRs. Furthermore, the variants were linked to health data and to frequency data, which enables the distinction between rare disease mutations and common, often harmless polymorphisms. The database now contains almost 80,000 variants. "When we mapped these variants on the 3D structure of GPCRs, we found certain trends," Kazius says. "For example, mutations in the membrane domain of the receptor lead to disease more often than in the sites outside the membrane. Also, most disease causing mutations were found on the inside of this conserved 3D structure of GPCRs whereas mutations on the outside rarely lead to diseases."

By investigating genetic variants for disease-causing features, the efficacy of certain drugs on GPCRs may be explained. But developing new drugs requires information about the chemical structure of these compounds. Affinity is only one property that a drug candidate requires. Promising compounds are commercially unviable when they are, for example, mutagenic, introducing possibly malicious mutations into DNA. The accuracy of mutagenicity predictions is therefore important to the pharmaceutical, food, and cosmetic industries.
Kazius investigated the chemical features that correlate with biological activity by constructing a large dataset to identify and analyse toxic and nontoxic substructures that are useful for prediction. The aim is to find and predict the relationship between toxicity and chemical structure of a molecule. "The overall aim of our study is older than I am," explains Kazius. "But I feel that we have shown that new computational methods can really contribute to our understanding of biologically relevant chemical features. The speed of this new method enabled us to consider far larger, and more intricate chemical substructures."
The researchers introduced a more complex data mining method. These so-called substructure mining algorithms are important drug discovery tools. The method involves looking at all possible substructures in a dataset instead of only small, linear fragments. The algorithm for the substructure data mining method was developed by Kazius' colleague PhD student Siegfried Nijssen. Kazius: "I combined his algorithm with extra chemical information. The result is a system that can predict biological activity a lot better." In one of the tests, the error in the prediction of mutagenicity was reduced from 30% using only linear test structures to 21% using all complex patterns. Furthermore, new toxic substructures were discovered, such as the polycyclic planar system, as well as new toxicity-reducing substructures, like a trifluoromethyl substituent. 
"Nevertheless, there is always room for improvement. For one, many people underestimate the complexity and restrictions for the use of the available chemical databases for the prediction of toxicity or other biological activities. When you predict toxicity of a new heterocyclic compound, the value of your toxicity prediction will be next to worthless if your prediction was derived from steroids. For that reason it is very important to indicate the reliability of the prediction. It was not possible to also tackle this issue during my time as a PhD student, but I am working actively on this now."

After his thesis defense, Kazius started the company Curios-IT ( ). Kazius is introducing speed improvements and features to a new method of substructure mining. "But not all work is scientific. I expect the graphical interface to be ready within months," he says. "Then I hope to start selling the tool." There are about forty interested parties already. Most of these are foreign companies. Not only in the pharmaceutical as Kazius expected, but also in the food and cosmetics industry.
"Pharmaceutical researchers could use this tool for prediction at any stage of drug discovery, but they could also use it more actively with new data from high throughput screening. But apparently, it is also very important for other industries to accurately predict toxicity. More than I expected. They too need biological activity but not toxicity." A special feature of the company is its link with the Research For Charity Foundation (, which enables all Curios-IT's profits to be donated to charities such as WWF, Red Cross and Oxfam.
The research described here resulted from the close cooperation between the LACDR and LIACS research centers in Leiden. Jeroen Kazius was a PhD student from 2003 to 2007 at the Medicinal Chemistry group of the Leiden University and the Leiden/Amsterdam Center for Drug Research (LACDR), supervised by professor Ad IJzerman. He worked closely together with Siegfried Nijssen who was a PhD student at the Algorithms Cluster of the Leiden Institute of Advanced Computer Science (LIACS), supervised by professor Joost Kok. The former institute is mainly involved in drug research. At the latter, mathematical and computer research is core business. Kazius says: "It was so easy to drop in and discuss ideas or problems thanks to the small physical distance between the collaborating institutes. This way of working is very stimulating, and I feel that it has given us many possibilities."

Name: Jeroen Kazius
University: Leiden University
Promotors: Prof. Dr. A.P. IJzerman, Prof. Dr. J. N. Kok
Thesis title: Computers and drug discovery - construction and data mining of chemical and biological databases
Obtained on: 11 June 2008

Kazius J., et al. (2008) GPCR NaVa database: natural variants in human G protein-coupled receptors. Human Mutation, 29(1), 39-44.

Kazius, J., et al. (2005) Derivation and validation of toxicophores for mutagenicity prediction. Journal of Medicinal Chemistry, 48(1), 312-320.

Helma, C., Kazius, J. (2006) Artificial intelligence and data mining for toxicity prediction. Current Computer-Aided Drug Design, 2(2), 123-133.

Kazius, J., et al. (2006) Substructure mining using elaborate chemical representation. Journal of Chemical Information and Modeling, 46(2), 597-605.

Published in Interface Bioinformatics, issue 3, May 2009