User:KVerspoor
From Open Annotation Collaboration
Karin Verspoor
Research Assistant Professor
Center of Computational Pharmacology
University of Colorado Denver
Use Case: Annotating the biomedical literature with OBO and linguistic concepts
Our work explores the challenges associated with achieving automated annotation of web content using techniques from natural language processing. We specifically focus on the scientific literature in the biomedical domain. We have several active projects that center on annotation of biomedical publications. These efforts will ultimately support both biological curators – people whose jobs involve formalizing facts from the unstructured biomedical literature into structured databases – and biological analysts directly. There is already extensive interest in annotation of scientific text, both from publishers such as Elsevier and from biological users themselves. The users struggle with the huge and ever-growing body of publications in the biomedical domain: over 2500 new publications are indexed in PubMed daily.
First, we are building a manually annotated corpus called CRAFT (Colorado Richly Annotated Full Text) of approximately 750k words of molecular biology (mouse genomics) texts derived from the PubMed Central Open Access subset, augmented with multiple layers of annotation ranging from full linguistic analyses (syntax parse trees) to several kinds of biological entities and concepts corresponding to terms in existing community ontologies from the Open Biomedical Ontologies Foundry (http://www.obofoundry.org/). This effort will additionally include co-reference annotation for entities and events, several relational constructs, and annotation of scientific discourse structure.
Second, our text mining efforts involve construction of automated tools that achieve annotation and extraction of biological concepts, both entities such as genes and gene products and events such as protein-protein interactions. Our group has significant depth of expertise in Biomedical Natural Language Processing (BioNLP), as well as of the most widely used text mining platform in the BioNLP community, the Unstructured Information Management Architecture (UIMA, http://uima.apache.org/).
In addition, we are currently building a resource, KaBOB or the Knowledge Base of Biology, which is an RDF triple store that aims to link formal knowledge from a plethora of biological databases. KaBOB will be used to enable reasoning over biological knowledge, to support deeper information extraction from the literature by using the background knowledge as a basis for interpretation of new texts, and to enable knowledge integration of biological data at a scale previously not possible. We would ultimately like to support bi-directional dynamic interaction between the KaBOB resource and text mining: using the background knowledge to improve natural language understanding algorithms, and augmenting the background knowledge with new facts extracted from the texts.
Our efforts in text annotation currently result in stand-off annotation of documents stored in a local text repository due to the many variations in source formats of the biomedical literature (PDF, PubMed Central XML, publisher or journal-specific XML, HTML, even plain text in the case of abstracts of publications). However, we are very interested in moving towards a model of annotation of the biomedical literature that would enable these annotations to be exposed to users of the publications through publishers' websites, as well as supporting integration of annotations derived from multiple sources over the same text. For instance, we have previously explored the use of consensus annotation for the problem of annotation of gene mentions in text, as well as participating in the development of a system (U-compare, http://u-compare.org/) that explicitly considers multiple independent sources of annotations in parallel.
We are currently working on the integration of the off-line natural language processing-based automatic annotation of documents based on the UIMA framework with web-based annotation efforts. We have begun the integration with the developers of the Annotation Framework using the Annotation Ontology, as our first attempt to enable our UIMA-based annotations to be available in the context of the semantic web, and ultimately to enable judgment and revision by humans of the automatically generated annotations. We are interested in exploring the use of the OAC as well, for comparison and engagement of a broader community. We would ultimately like to settle on an annotation model that allows us to represent the full complexity of the text that we work with: from individual concepts, to linguistic structures, to complex event and discourse structures.