Vladimir Bajic

Linking text-mining, sequence information and gene expression for disease gene prediction

There are a number of successful developments of text-mining systems that support research in biomedical field. However, very few systems utilize combination of text-mining with the expression behaviour of genes, or with specific information based on sequence composition. If properly utilized such combined information can significantly improve accuracy of extracted information, and even generate new knowledge. In this talk I will present some of the recent development in my laboratory aimed at narrowing down the candidate disease gene list by making a strong use of the three sources of relevant information: biomedical text-mining, promoter sequence composition and gene expression behaviour. Linkage analysis typically produces several hundreds candidate disease genes that require further detailed functional analysis that possibly may lead to identification of the relevant genes. The experiments with our method suggest that the candidate gene list can shrink to approximately 25% of its original size and still retain the correct disease genes in almost 90% of cases.

Vladimir Bajic is Extraordinary Professor of Bioinformatics in the South African National Bioinformatics Institute (SANBI) and Head of the Knowledge Extraction Lab of the Institute for Infocomm Research, Singapore. His is RIKEN invited scientist for 2005. He is a member of the Academy of Nonlinear Sciences (Russia) and several others scientific and professional organizations. In the past, he was President of the SA Branch of the Academy of Nonlinear Sciences and Chairman of the Africa Chapter (SIG) of the International Neural Network Society (INNS). Vlad has authored or co-authored over 350 fully refereed research papers, edited volumes and software products. He was/is on the editorial boards of eight international journals and serves as a reviewer for many international publishers and governmental agencies and has chaired and co-chaired several international conferences and workshops. Vlad is the designer of two world-leading systems for promoter recognition: Dragon Promoter Finder and Dragon Gene Start Finder, integrated into the TRANSPLORER package of Biobase, Germany. He designed a number of unique systems that efficiently discover estrogen target genes and reveal their functional properties by in silico methods; the fastest system for finding short DNA motifs by ab-initio methods; a series of text-mining tolls of biomedical literature and generation of scientific hypotheses, and several integrated bioinformatics systems such as Dragon TF Integrator, Dragon Regulome (HS) database, etc. Several wet-lab experiments have confirmed biologically active sites predicted by systems he has designed, including protein marker for early detection of epithelial ovarian cancer.


Nigel Collier

What's in a name?
(Presentation material)

It is now widely accepted that computational support is necessary to aid researchers in accessing and synthesizing the knowledge held in the vast repository of unstructured scientific texts that forms what has been called the 'bibliome'. While scientists have a clear idea of what they want to do they are increasingly overwhelmed by the volume of electronic literature they have to access, especially as searching often needs to be done in related but unfamiliar fields. Creating solutions to these problems has brought together researchers not only from natural language processing (NLP) and molecular and systems biology but also terminologists, philosophers and linguists to address the problems of defining meaning and then mapping form to meaning. In this talk I will present a review of progress over the last eight years in terminology identification and classification - one of the fundamental tasks in text mining the bibliome and the subject of intensive investigation over the last eight years. Known commonly as the named entity recognition (NER) task, it targets the names of genes and gene products and as shown in shared task evaluations such as BioCreative and JNLPBA, it has typically been approached with a mixture of both rule-based and machine learning techniques utilizing large-scale biomedical terminology resources and ontologies. By systematic analysis of features we have found that NER in the biomedical domain has proven to be a surprisingly difficult compared to its application in the news domain and I would like to conclude by offering some thoughts on why this might be so and to reflect on the wider implications of our experience in NER for text mining in biomedicine as a whole.

Nigel Collier is Associate Professor in the Information Foundations Division at the National Institute of Informatics (NII) in Japan. He has been actively involved in Natural Language Processing and Knowledge Acquisition for over ten years. Before coming to NII he received a BSc. in Computer Science from Leeds University (UK) in 1992, an MSc. in Machine Translation from UMIST (UK) in 1994 and a Ph.D. in Language Engineering from UMIST in 1996. From 1996 to 1998 he was a Toshiba Fellow working at their human interface laboratories on knowledge acquisition for machine translation and from 1998 to 2000 he worked on information extraction in the GENIA project at the University of Tokyo as a JSPS research associate. His current research interests include semantic interpretation of biomedical texts using empirical corpus-based methods, and exploring the integration of the shallow semantics typically used in text mining with the deeper representations that are possible in ontologies. He currently leads the ZAISA project which looks at developing a robust level of discourse-level analysis for scientific texts through rhetorical zone annotation.


Udo Hahn
Jena University

Language Processing in Super-Hard Domains: Lessons from Biology and Medicine

Large-scale application-driven natural language processing systems have typically been using newspaper or newswire data as textual input for tasks such as document retrieval, information extraction, text-based question answering or text summarization. When documents from the life science domain are dealt with, however, we systematically observe significant performance drops for virtually all of these applications in various evaluation scenarios. Why is this so? In this keynote, evidence is presented that the problems we encounter for natural language processors running on biomedical input can mainly be attributed to inherent super-hardness properties of the language used in and the background knowledge referred to the life science domain, and to a lesser degree though (and not so surprisingly) to unbalanced amounts of effort language technologists have spent to build up comprehensive domain-specific resources and tools for the biological and medical domain.

Udo Hahn is currently a Full Professor for Computational Linguistics at the Jena University Language and Information Engineering (JULIE) Lab in Germany (http://www.coling.uni-jena.de) and one of the initiators of the German Virtual Centre for Text Mining in Biomedicine (http://www.biotem.org). He was affiliated with Freiburg University as an Associate Professor for more than a decade and with Passau University as an Assistant Professor for three years. He got his Ph.D. in Information Science from Constance University, and his Master's Degree in Linguistics from Mainz University. His research interests include various forms of automatic text analysis (information extraction, text mining, text summarization, cross-language document retrieval) and the methodologies required for such tasks (language technology, ontology engineering, knowledge representation and automatic knowledge acquisition). He has been working on biomedical language processing and knowledge management for almost a decade.


Satoshi Kobayashi

Tree Grammars for RNA Secondary Structure Analysis

Non-coding RNAs (ncRNAs) are functional RNAs which do not code protein. By recent discoveries, ncRNAs are revealed to be much more numerous than previously thought and to play significant roles. Grammatical devices are shown to be quite useful in order to find a new member of an ncRNA gene family in a large genome database. Stochastic context free grammars are one of such important grammatical devices which were applied to the modeling of tRNA gene families by Eddy and Durbin and also by Sakakibara, et al., independently. Eddy's method constructs automatically a stochastic context free grammar from a given multiple alignment of a gene family with secondary structure annotation. However, it is known that context free grammar can not model pseudo-knotted structure. Motivated from the importance of pseudo-knots in the function of ncRNAs, there have been proposed some grammatical devices to model RNA secondary structures including pseudo-knots. Tree grammar is one of the candidates for such modeling grammars. This talk reviews the tree grammatical approach to RNA secondary structure analysis.
In this talk, the author first introduces a subclass, TAGRNA, of tree adjoining grammar (TAG) proposed by the author and his co-workers. It is a TAG specialized for modeling RNA secondary structure. We discuss its generative capacity and compare it to other approaches. Then, the author reports the current state of the art of the technologies concerning (tree) grammatical approaches to RNA secondary structure analysis. The author will discuss on (1) the relationship between RNA secondary structure prediction and grammar parsing, (2) the relationship between the alignment of RNA secondary structures and that of derivation processes of grammars, (3) an automatic generation of a tree grammar from an RNA multiple alignment with secondary structure annotation, and (4) approximation problem of tree grammars by regular grammars.

Satoshi Kobayashi has been associated with the Univ. of Electro-Communications since 2000. He received his Dr.E. degree from the University of Tokyo in 1993. His research intrests include computational learning theory, grammatical inference, theory of molecular computation, RNA secondary structure analysis using grammars.


Gary Geunbae Lee

A Bio Text Mining Workbench combined with Active Machine Learning
(Presentation material)

Named-entity recognition is one of the most elementary and core problems in biomedical information management and text mining. In this paper we introduce the POSBIOTM/NER workbench which intends to assist biologists in effectively and effciently annotating biomedical named entities in text for future processing application. The workbench consists of a suite of tools for gathering, managing, analyzing and annotating biomedical texts. We employ the Conditional Random Fields (Lafferty et al. 2001) machine learning technique to automatically identify different sets of biological entities in the text. To enhance the recognition performance through machine learning, we use the workbench idea of re-using the annotated data and re-training the NER (Named Entity Recognition) module. To minimize the human labeling effort in a larger extent, we employ the active learning method to select the most informative samples for training. We propose a new active learning paradigm which considers not only the uncertainty of the classifier but also the diversity of the corpus. A three-level hierarchy is used to represent the sentence structure and the sentence structure similarity is measured to represent the diversity. This paper also presents an empirical study on measuring the effectiveness of training example selections in the active learning paradigm for POSBIOTM/NER workbench.

Gary Geunbae Lee has received his B.S. and M.S. degrees in Computer Engineering from Seoul National University in 1984 and 1986 respectively. He received his Ph.D. in Computer Science from UCLA in 1991 and was a research scientist in UCLA. He was an assistant professor during 1992 to 1996, and an associate professor since 1997 at POSTECH. In 2004, he promoted to a full professor in Postech. He authored more than 100 papers in international journals and conferences and has served as a technical committee member and reviewer for several international conferences such as ACL, COLING, ACM SIGIR, IRAL, EMNLP, etc. His current research interests include natural language processing, biological text mining, spoken language understanding, and TTS systems.


See-Kiong Ng

BioContrasts: Protein-Protein Contrasts Extracted from the Biomedical Literature for Knowledge Discovery

Contrasts are useful conceptual vehicles for learning processes and exploratory research of the unknown. For example, contrastive information between proteins can reveal what similarities, divergences, and relations there are of the two proteins, leading to invaluable insights for better understanding about the proteins. Such contrastive information are found to be reported in the biomedical literature. However, there have been no reported attempts in current biomedical text mining work that systematically extract and present such useful contrastive information from the literature for knowledge discovery. In a joint work between I2R and KAIST, we have developed a BioContrasts system that extracts protein-protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for knowledge discovery. Contrastive information are identified in the text abstracts with contrastive negation patterns such as "A but not B". A total of 799,169 pairs of contrastive expressions were successfully extracted from a 2.5 million-abstract corpus. Using grounding of contrastive protein names to Swiss-Prot entries, we were able to produce 41,471 pieces of contrasts between Swiss-Prot protein entries. These contrastive information are then presented via a user-friendly interactive web portal for knowledge discovery, such as refining the functional roles of similar proteins in biological pathways.

See-Kiong Ng is currently the Department Manager of the Knowledge Discovery Department at the Institute for Infocomm Research, Singapore. He is also Adjunct Associate Professor at the School of Computer Engineering, Nanyang Technological University, Singapore. See-Kiong obtained both his BS and PhD degrees in Computer Science from Carnegie Mellon University (Pittsburgh, Pennsylvania USA) and his masters degree from University of Pennsylvania (Philadelphia, Pennsylvania USA). In terms of research, See-Kiong is currently interested in unravelling the underlying functional mechanisms of protein interaction networks, using computational techniques from diverse research areas of machine learning, information extraction, and natural language processing.


Alfonso Valencia
C.N.B. - C.S.I.C.

Text mining in Bioinformatics
(Presentation material)

Retrieving information from text has become an important area in bioinformatics, and not too surprisingly all the main journals and conferences have published a considerable number of papers in this area.
Many of the requirements of the text mining in molecular biology are similar to those in other areas of Bioinformatics:
- Availability of high quality input information
- A set of objective metrics for the comparison of different methods
- The need to develop systems useful for the experimental biologist, keeping the focus
on developing applications that are suitable for end users and biological databases.
These issues have been addressed in other fields of bioinformatics such as protein structure prediction, in part thanks to the organization of CASP (Critical Assessment of techniques for protein Structure Prediction) during the last 10 years.
I will discuss two developments in which my group has been implicated are addressing the needs mentioned above: the BioCreAtIvE competition and the development of the public iHOP system.

Alfonso Valencia is a Full Professor for CNB-CSIC. He is also Executive Editor of the journal Bioinformatics (co-executive with Alex Bateman) since July 2004. His main scientific interest is how proteins adapt to perform specific functions, and in particular, how they co-operate to perform biological functions. The approach is closely related to the study of the processes of evolution, especially the co-evolution of molecular systems.


Limsoon Wong

Protein Function Prediction from Protein Interactions
(Presentation material)

Protein interaction information has been one of the major sources of motivation in the application of natural language processing and text mining technologies to biology and medicine. What can one put protein interaction information to use for? In this talk, I will describe some uses---in particular, in the inference of protein functions.

Limsoon Wong has recently joined the National University of Singapore as a professor. Prior to his present appointment, he served the A*STAR Institute for Infocomm Research for 17 years, rising to the position of Deputy Executive Director. He is currently working mostly on knowledge discovery technologies and is especially interested in their application to biomedicine. Limsoon has written about 100 research papers, a few of which are among the best cited of their respective fields. He serves on the editorial boards of several journals, including Bioinformatics (OUP), JBCB (ICP), and Drug Discovery Today (Elsevier). He is also a scientific advisor to a number of companies in US, India, and Singapore. He received his BSc(Eng) from Imperial College London and PhD from University of Pennsylvania.