Curation for Human Health and Nutrition

Odd number posters will be presented on Monday, 8th April and even numbered posters on Tuesday, 9th April.

Posters 94 - 110.

94 COSMIC: integrating and interpreting the world's knowledge of somatic mutations in cancer

Ponting, Laura
Wellcome Trust Sanger Institute

COSMIC, the Catalogue Of Somatic Mutations In Cancer (http://cancer.sanger.ac.uk) is a comprehensive online resource for exploring the full range of curated genetic mutations across all human cancers. Our latest release (November 2018) describes nearly 6 million coding point mutations in approximately 1.5 million tumour samples, across most human genes. Expert, high-quality, detailed manual curation combined with data from large-scale sequencing studies gives COSMIC an unrivalled breadth and depth of coverage. Annotation of cancer phenotypes enables researchers and clinicians to determine the distribution of specific mutations across different tumour types. Curation of numerous other clinical data points, relating to the individual, tumour and screened sample, provide further context for studying the landscape of mutations in cancer. Website tools such as Cancer and Genome Browsers facilitate exploration of the disease and mutation data, respectively. Additionally, defined mutation subsets, including gene fusions and drug resistance mutations, are highlighted. This core of COSMIC data is enhanced with several specialised projects. The Cancer Gene Census provides a curated catalogue of over 700 genes driving every form of human cancer and Hallmarks of Cancer describes the functional changes in cells in terms of ten biological processes that drive cancer. COSMIC-3D, developed with Astex Pharmaceuticals, allows analysis of cancer mutations in the context of three dimensional protein structure, with implications for druggability. COSMIC aims to be an ever-expanding, relevant resource, supporting and committed to the global cancer research community. The curated COSMIC data can be freely accessed via the COSMIC website, while download files and database dumps are also free to academia and available under licence to industry.

95 The BioGRID Interaction Database: Curation of Genetic, Protein and Chemical Interactions and Post-Translational Modifications

Boucher, Lorrie
Lunenfeld-Tanenbaum Research Institute

The Biological General Repository for Interaction Datasets (BioGRID) is an open-access resource that curates and freely disseminates protein, genetic and chemical interaction data for all major model organisms and humans. As of December 2018, BioGRID contains over 1,650,600 biological interactions manually curated from more than 57,650 publications for 71 species. Curation is undertaken at a species level and a themed project level. Complete coverage of the literature for Saccharomyces cerevisiae is maintained and now covers 738,430 interactions. Collaborations with PomBase, WormBase, FlyBase and other model organism databases help minimize curation redundancy and improve literature coverage. For example, a recent collaboration with the Bio-Analytic Resource for Plants (BAR) database has increased the collection of Arabidopsis thaliana records in BioGRID to a total of ~56,000 interactions. Project level curation is used to achieve deep coverage in specific areas of biological interest. In a pilot project, the interactions for all kinases and phosphatases in S. cerevisiae has been consolidated as a unified dataset of 97,397 interactions and 3,853 post-translational modifications (PTMs) curated from over 4,700 publications (see yeastkinome.thebiogrid.org). In another major project, BioGRID has systematically curated >220,000 interactions for >1200 ubiquitin-proteasome system (UPS) genes/proteins in humans, as well as for the UPS of S. cerevisiae. The UPS project has also captured >100,000 sites of ubiquitin modification and 141 chemical inhibitor-UPS enzyme relationships. Chemical-protein interactions for many other human drug targets have also been drawn from DrugBank and other databases, as well as directly curated from the literature. All data in BioGRID are made freely available through the BioGRID website (thebiogrid.org) and via model organism database and meta-database partners. Supported by the NIH Office of Research Infrastructure Programs [R01OD010929 to M.T., K.D.].

96 The Utilization of Public Health Services and Its Influence Factors among Migrants in China’s Cities

Cui, Jiawei
Institute of Medical Information & Library, Chinese Academy of Medical Sciences

With the implementation of Chinese economic reform and rapid urbanization, the restrictions on migration has been loosened, resulting in large-scale migration across China. By research the utilization of public health services among migrants and exploring its main influence factors, we can provide a scientific basis for exploring ways to improve the utilization rate of public health services of migrants, promote the equalization of basic public health services, and strengthen the health of migrants . The data used in this study are from the 2016 China Migrants Dynamic Survey(CMDS), based on the“2016 China City Business Charm Ranking”, we divided the data into four samples according to the city level and quantitatively analyzed the differences of individual characteristics, family characteristics, economic characteristics, social characteristics, migrant characteristics and the utilization of public health services of the migrants in each sample. For continuous variables and categorical variables, the p value is calculated using Student t test and chi-square test respectively. Logistic regression is employed to estimate the association of utilization of public health services with each kind of characteristic. The results show that there are significant differences in the population characteristics of the migrant in cities of different levels. The above five characteristics all have a strong influence on public health services utilization of migrants on account of different factors with different influence intensity, such as education level, number of households in the inflow area, employment status, migtant range. Based on the dissection of the reasons for their influence, we proposed the following recommendations: transform service conception, improve supply capacity, expand population coverage; innovate working methods, take advantage of new media, utilize Internet resources; perfect existing systems, achieve the basic guarantee, promote diversification.

97 Curating with the Clinical Community: Gene Panel Annotation in Panel App

Foulger, Rebecca
Genomics England

Analysis of whole genome sequences from participants in the 100,000 Genomes Project, by the Bioinformatics Team at Genomics England, includes the use of virtual gene panels comprised of genes with diagnostic-grade evidence for causation of rare disease or cancer. Panels are curated in PanelApp (https://panelapp.genomicsengland.co.uk), a publicly-available knowledge base reviewed by >200 international experts. The >190 high quality panels can be queried directly in PanelApp or through the API.PanelApp data is integrated into OpenTargets and the community variant project VarSome. DECIPHER provides links to PanelApp gene pages, and Ensembl’s Variant Effect Predictor now also accepts PanelApp data. PanelApp links out to identifiers in Ensembl, Gene2Phenotype, ClinVar and OMIM, and from our curation >40% of OMIM phenotypes are captured for diagnostic-grade genes on panels from the 100,000 Genomes Project.The PanelApp team is active in the curation community. Genomics England is a driver project of the Global Alliance for Genomics and Health (GA4GH), and as part of this we have collaborated with the Australian Genomics group to harmonise curation efforts and make PanelApp Open Source. We are active members of the Gene Curation Coalition, set up to standardise gene-disease curation. Aiding these efforts, PanelApp annotations and reviews are publicly visible to ensure transparency of underlying evidence and promote open discussion.We have recently developed new PanelApp features: improved tracking of panel changes, Panel Types to denote projects, and Super panels where larger gene panels are created from existing child panels. PanelApp has also been extended to allow curation of Short Tandem Repeats (STRs) and Copy Number Variants (CNVs), expanding genomic content for diagnostic analysis and providing an advantage over traditional gene-only panels. These additional curation features are critical for PanelApp to form a key part of the new NHS Genomic Medicine Service.

98 OncoMX: a cancer biomarker resource leveraging published literature and genomics data

Holmes, Evan
George Washington University

The massive, multiform datasets generated by cancer genomics studies serve as incredible resources for the scientific community. However, the size and heterogeneity of these datasets present challenges to interpreting biological significance across datasets. A variety of technical characteristics including file formats, attribute names, and reference data often differ between databases. While these distinctive technical characteristics are designed to facilitate highly specific research, they hamper re-use in more general studies. OncoMX is designed to broadly support the scientific community by unifying datasets to remove such challenges. The four major use-case perspectives driving the OncoMX web portal interface development are (1) exploration of cancer biomarkers, (2) evaluation of mutations and expression in an evolutionary context, (3) side-by-side exploration of published literature-mined data for mutations and expression in cancer, and (4) exploration of a specific gene or biomarker within a pathway context. To this end, OncoMX integrates and unifies sequence-based mutation data from BioMuta, cancer/normal differential expression data from BioXpress, normal expression data across organisms from Bgee, literature mining evidence for mutation and expression in cancer with DiMeX and DEXTER, biomarker data from EDRN, and pathway data from Reactome. Data integrated in OncoMX will be supplemented by functional annotations, scRNA-seq data, and additional FDA-approved biomarker data, as available. OncoMX is a collaboration between The George Washington University, NASA’s Jet Propulsion Laboratory, the SIB Swiss Institute of Bioinformatics, and the University of Delaware. The multifarious research foci of this international collaboration supports diverse end-user research interests, ultimately shaping the OncoMX data model and web portal. Therefore, OncoMX is projected to widely support cancer research through relevant cancer biomarker data aggregation.

99 Abstract withdrawn

100 EWAS Atlas: a curated knowledgebase of epigenome-wide association studies

Li, Mengwei
Beijing Institute of Genomics

Epigenome-Wide Association Study (EWAS) has become increasingly significant in identifying the associations between epigenetic variations and different biological traits. In this study, we develop EWAS Atlas (http://bigd.big.ac.cn/ewas), a curated knowledgebase of EWAS that provides a comprehensive collection of EWAS knowledge. Unlike extant dataoriented epigenetic resources, EWAS Atlas features manual curation of EWAS knowledge from extensive publications. In the current implementation, EWAS Atlas focuses on DNA methylation––one of the key epigenetic marks; it integrates a large number of 299,016 high-quality EWAS associations, involving 113 tissues/cell lines and covering 293 traits, 1,914 cohorts and 390 ontology entities, which are completely based on manual curation from 713 studies reported in 414 publications. In addition, it is equipped with a powerful trait enrichment analysis tool, which is capable of profiling trait-trait and trait-epigenome relationships. Future developments include regular curation of recent EWAS publications, incorporation of more epigenetic marks and possible integration of EWAS with GWAS. Collectively, EWAS Atlas is dedicated to the curation, integration and standardization of EWAS knowledge and has the great potential to help researchers dissect molecular mechanisms of epigenetic modifications associated with biological traits.

101 UniProtKB and Alzheimer’s Disease: Linking molecular defects to disease phenotype

Lussi, Yvonne
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)

Alzheimer’s disease (AD) is a progressive, neurodegenerative disease and the most common form of dementia in elderly people. Linkage analysis was the first milestone in unravelling the mutations in APP, PSEN1 and PSEN2 that cause early-onset AD. The development of next-generation sequencing methods over the last decade has increased the detection of genetic variants and the identification of disease-associated genes. However, establishing the relationship between the variants and disease phenotype remains a challenge. In this context, UniProtKB aims to link resources from genetic and medical information to protein sequences and associated biological knowledge.In a joint effort across the UniProt sites, over 180 proteins have been identified by text mining and by input from experts in the field to be associated with AD. By manual curation, information from peer-reviewed literature will be annotated on molecular function and involvement in disease, including disease-associated variant positions and variant characterization.By focusing our curation efforts on proteins involved in AD, we hope to shed light on the mechanisms leading to this devastating disease. We focus on a thorough review of available information on sequence variants and associated AD information, as well as normal protein function of proteins associated with the disease. The information on variants together with variant functional description, protein molecular function, structural data and protein-protein interaction should help researchers in the field of neurodegeneration, clinicians and biomedical researchers to gain a global view on the relation between variant and disease and help elucidating disease mechanism.

102 Atlas of Cancer Signaling Network: a resource of multi-scale biological maps to study disease mechanisms

Monraz Gomez, Luis Cristobal
Institute Curie

We present here the second edition of Atlas of Cancer Signaling Network (ACSN 2.0, https://acsn.curie.fr). ACSN is a web-based resource of multi-scale biological maps depicting molecular processes in cancer cell and tumor microenvironment. The core of the Atlas is a set of interconnected cancer-related signaling and metabolic network maps. Molecular mechanisms are depicted on the maps at the level of biochemical interactions, forming a large seamless network of above 8000 reactions covering close to 3000 proteins and 800 genes and based on more than 4500 scientific publications. Constructing and updating ACSN involves careful manual curation of molecular biology literature and the participation of experts in the corresponding fields.The maps of ACSN 2.0 are interconnected, the regulatory loops within cancer cell and between cancer cell and tumor microenvironment are systematically depicted. The cross-talk between signaling mechanisms and metabolic processes in the cancer cells is explicitly depicted thanks to new feature of the Atlas: ACSN 2.0 is now connected to RECON metabolic network, the largest graphical representation of human metabolism.The Atlas is a "geographic-like" interactive "world map" of molecular interactions leading the hallmarks of cancer as described by Hanahan and Weinberg. The Atlas is created with the use of systems biology standards and amenable for computational analysis. As of today, ACSN 2.0 is composed of 13 comprehensive maps of molecular interactions. There are six maps covering signaling processes involved in cancer cell and four maps describing tumor microenvironment. In addition, there are 3 cell type-specific maps describing signaling within different cells types frequently surrounding and interacting with cancer cells. This feature of ACSN 2.0 reflects the complexity of tumor microenvironment. The resource includes tools for map navigation, visualization and analysis of molecular data in the context of signaling network maps.

103 Large scale variant annotation in UniProt and tools for interpreting the molecular mechanisms of disease

Nightingale, Andrew
EMBL-EBI

Understanding the molecular mechanism(s) of a disease is an essential step towards the development of a cure. The origin of a disease is now routinely investigated using genome sequencing technologies that provide good evidence for the genetics underlying a disease but cannot be easily related to the molecular consequences; especially if the driving genetic variant is in a non-coding region of the genome. UniProt aims to support the scientific community, computational biologists and clinical researchers, by providing a comprehensive, high-quality functional annotations information. This includes a catalogue of protein altering variation data coupled with information about how these variants affect protein function. This variant data is provided either as expert reviewed from the scientific literature; or as large-scale variation data imported from a variety of sources including ClinVar, TCGA, COSMIC, ExAC and ESP. To facilitate access, visualization and interpretation of variant data, UniProt provides a suite of tools for programmatic access (Proteins REST API) and a track based graphical protein sequence viewer (ProtVista). These tools are freely available for integration into external tools and interpretation workflows like the Protein Variant Effect Predictor, a collaboration between UniProt, Ensembl, PDBe and the Janet Thornton research group. Here we illustrate these tools and services and how they can support clinical researchers by enhancing understanding of the link between variation and protein function

104 PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research

Schofield, Paul
University of Cambridge

Understanding the relationship between the pathophysiology of infectious disease, the biology of the causative agent and the development of therapeutic and diagnostic approaches is dependent on the synthesis of a wide range of types of information. Provision of a comprehensive and integrated disease phenotype knowledgebase has the potential to provide novel and orthogonal sources of information for the understanding of infectious agent pathogenesis, and support for research on disease mechanisms. We have developed PathoPhenoDB, a database containing pathogen-to-phenotype associations. PathoPhenoDB relies on manual curation of pathogen-disease relations, on ontology-based text mining as well as manual curation to associate phenotypes with infectious disease. Using Semantic Web technologies, PathoPhenoDB also links to knowledge about drug resistance mechanisms and drugs used in the treatment of infectious diseases. PathoPhenoDB is accessible at http://patho.phenomebrowser.net/, and the data is freely available through a public SPARQL endpoint.

105 BioModels, a repository of curated mathematical models

Sheriff, Rahuman
EMBL-EBI

Computational modelling of biological processes has become increasingly common in biological research. Models of cell signalling, metabolic and gene regulatory networks have been shown to divulge mechanistic insight into cellular regulation. To provide a platform to support universal sharing, easy accessibility and model reproducibility, BioModels (www.ebi.ac.uk/biomodels/), a repository for mathematical models was established in 2006 (Chelliah et al. 2015; Glont et al. 2018). Models submitted to BioModels are curated to verify the computational representation of the biological process and the reproducibility of the simulation results in the reference publication. With gradual growth in content over years, BioModels currently hosts over 2000 models from published literature in addition to patients derived genome-scale metabolic models and computationally generated path2models. With over 700 curated models, BioModels has become the world’s largest repository of curated models and emerged as third most used data resource after PubMed and Google Scholar among the scientists who use modelling in their research (Stanford et al. 2015; Szigeti et al. 2018). Models are encoded in standard SBML format and annotated with controlled vocabularies following MIRIAM guidelines (Le Novère et al. 2005). Model entities are cross-referenced to several data resources (such as UniProt, Ensembl gene, taxonomy, etc.,) as well as ontologies (such as Gene Ontology, ChEBI, Mathematical Modelling Ontology, Systems Biology Ontology, Brenda Tissue Ontology, etc.,). Targeted curations are carried out to enrich literature-based models on diabetes (Lloret-Villas et al. 2017), neurodegenerative diseases (Ajmera et al. 2013), blood coagulation, cell cycle and immuno-oncology. Thus, BioModels benefit modellers by providing access to reliable and semantically enriched curated models in standard formats that are easy to share, reproduce and reuse.

106 CCDR: a Corpus for Chemical Disease Semantic Relations in Chinese

Sun, Yueping
Institute of Medical Information & Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, China

In the biomedical data mining and knowledge discovery field, high-quality manually curated corpora is still the most valuable resource for the supervised learning of relations. To provide a corpus for curating disease-centered relations from biomedical literature in Chinese, we constructed a Chinese biomedical semantic relation corpus (CCDR) from a collection of Chinese biomedical abstracts, both of the disease-centered entities and relations are collaboratively curated. The CCDR corpus construction work followed domain-existing guidelines including the NCBI disease annotation guideline and the BioCreative CDR annotation guideline, and incorporated a web-based annotation tool Chinese Biomedical Semantic Annotation System (CBSAS) for assisting the standard annotation process. The resulting corpus get 339 Chinese biomedical articles with 2,367 annotated chemicals, 2,113 diseases, 237 symptoms, 164 chemical-induce-disease relations, 163 chemical-induce-symptom relations, and 805 chemical-treat-disease relations. Each entity annotation includes both the mention text spans and normalized concept identifiers while each relationship annotation includes the related normalized concept identifiers. The corpus quality is measured by F score which indicates the inter-annotator agreement. The resulting CCDR get an inter-annotator agreement score of 0.883 for chemical entities, 0.791 for disease entities, and 0.479 for chemical-treat-disease relations. Through the quality analysis and the distribution analysis, the corpus has shown up good characteristics to support model training for both the chemical, disease and symptom entity recognition and the chemical-treat-disease relation extraction.

107Systemic analysis and targeted Curation of Blood coagulation cascade modelsTiwari, KrishnaBabraham InstituteBlood coagulation is a highly dynamic and complex process in mammalian hemostasis. The intricate interplay between multiple enzymes and proteins makes it relevant in diseases like Von Willebrand’s Disease, Hemophilia A & B. The complexity of this phenomenon warrants investigation from multiple perspectives including disease and drugs. Quantitative mathematical modelling provides insight into blood coagulation, control parameters, as well as tools to computationally test hypotheses of biological mechanisms and optimise drug dosing strategies to control disease. Herein, we are reviewing the blood coagulation models landscape by performing targeted curation of mathematical models published over last three decades.We focused on curation of ordinary differential equation (ODE) models of blood coagulation pathways. Over 50 ODE models were identified through scientific literature survey, encoded in Systems Biology Markup Language (SBML) format, and manually curated to reproduce simulation results. Further models were semantically enriched by cross-referencing model entities to ontologies (GO, SBO, ChEBI) and databases (UniProt, Ensembl). All curated models were deposited to BioModels repository (https://www.ebi.ac.uk/biomodels/) and are publicly available to the broader scientific community. Some of the prominent models we curated include Wajima et al. 2009, Chatterjee et al. 2010, Diamond et al. 2016 etc. We reviewed all models and identified the need to model the link between blood coagulation cascade, platelet activation pathways and the effect of approved and under-trial drugs.This work provides a detailed overview of the blood coagulation modelling landscape as well as a wealth of curated resource including models and model parameters via BioModels repository to facilitate blood coagulation research.

108 LOINC2HPO: Curation of Phenotype Data from the Electronic Health Records using the Human Phenotype Ontology

Vasilevsky, Nicole
Oregon Health & Science University

Electronic Health Record (EHR) data are often encoded using Logical Observation Identifier Names and Codes (LOINC) in the United States, which is a universal standard for coding medical laboratory observations in EHRs. LOINC encoded clinical tests can be inferred as phenotypic outcomes, and offer the potential for secondary reuse of EHR data for patient phenotyping. The Human Phenotype Ontology (HPO) has been widely used for deep phenotyping in research and for diagnostic purposes. It contains over 13,500 classes that represent phenotypic abnormalities encountered in human diseases. Mapping the LOINC codes to HPO terms provides an opportunity to structure this phenotypic information and enable automatic extraction of detailed deep phenotypic profiles of laboratory results for downstream studies. Multiple LOINC codes can code for lab tests that can be represented as a single phenotype; for example multiple codes exist for measurements of nitrate levels in urine, which could be interpreted as HP:0031812 Nitrituria. To harmonize lab test data from EHRs to HPO, we developed a curation tool that converts EHR observations into HPO terms. To date, over 2400 LOINC codes have been mapped to HPO terms and our mapping library is freely available online (https://w3id.org/loinc2hpo/annotations). To demonstrate the utility of these mapped codes, we performed a pilot study with de-identified data from asthma patient’s health records. We were able to convert 83% of real-world laboratory tests into HPO-encoded phenotypes. Analysis of the LOINC2HPO-encoded data showed known and new phenotypes were are enriched in asthma patients such as eosinophilia and several other abnormal laboratory measurements. This preliminary evidence suggests that LOINC data converted to HPO can be used for machine learning approaches to support genomic phenotype-driven diagnostics for rare disease patients, and to perform EHR based mechanistic research.

109 Curating the authorship of clinical records and biomedical abstracts

Vishnyakova, Dina
F. Hoffmann-La Roche Ltd

The work of biomedical curation usually involves publication triage and annotation. We describe a type of biocuration task in which curators are required to have a more proactive role in both seeking information and using deductive reasoning. The task involves determining whether the authors of two different types of scientific biomedical documents are in fact the same person. The documents involved are MEDLINE abstracts and, for the first time analyzed in this study, ClinicalTrials.gov records. Determining the scientific authorship of a clinical finding is important to certify its validity or to gather additional information concerning it. Scientific author names, however, can be highly ambiguous and information about affiliation is often lacking in both MEDLINE and ClinicalTrials.gov. Thus, for this task we encouraged curators to seek information in the internet and make judgments based on the outcome of those searches. For their preparation we gave them methodological training to find adequate information resources and to reason over all available information. In setting up the task, we evaluated both crowdsourcing and expert curators. Crowdsourcing curators performed rather poorly, even those with a trusted track record in past curation tasks. Expert curators with appropriate training, on the other hand, were able to seek information on the internet effectively and performed with over 94% agreement. We additionally checked their judgments by emailing a set of scientific authors directly and the responses we received were in agreement with those of our curators in 98% of the cases. Thus, our experience shows that even in this apparently simplistic assignment motivated curators are necessary to independently gather appropriate information resources and produce correct annotations in a proactive fashion. Parts of this work have been approved for publication at the journal JAMIA.

110 Study on the Construction of Knowledge Graph of Tumor based on Chinese Electronic Medical Records

Xiu, Xiao
Leiimicams

Clinicians urgently need to use the existing large number of electronic medical records for experience summary, mining potential and effective diagnosis and treatment relationship, in order to learn from the treatment of auxiliary follow-up patient, and ultimately reduce the mortality rate of tumors.Therefore, in view of the urgent needs of clinicians and the lack of semantic considerations based on the existing medical knowledge graph, this paper would like to explore a complete set of tumor knowledge graph construction framework based on Chinese electronic medical records, in order to provide experience for the construction of knowledge graph in specific fields.The construction frame diagram of tumor knowledge map based on Chinese electronic medical record is shown in Figure 1. In this paper, a framework of tumor knowledge graph based on CEMRs is constructed by referring to the construction method of medical knowledge graph at home and abroad.Firstly, this paper constructs a knowledge model of tumor concept with five kinds of entities and eight kinds of relationships by using the relevant terminology thesaurus. Secondly, combined with the structure, content and language characteristics of the tumor electronic medical record, the NER and RE are carried out by using the constructed external feature set. Then, the extracted tumor entities and relationships are stored in the graphical database (neo4j) and the knowledge graph is drawn. Finally, taking the Chinese electronic medical record of renal tumor as an example, this paper makes an experimental verification of the frame constructed.This paper explores the construction of knowledge graph of renal tumor, which could provide assistance for doctors’ diagnosis and treatment. And this construction framework will provide empirical reference for the construction of knowledge graph in specific fields.

Biocuration 2019