Database Journal | Biocuration 2019

Odd number posters will be presented on Monday, 8th April and even numbered posters on Tuesday, 9th April.

Posters 111 - 122.

111 ccPDB 2.0: An updated version datasets of created and compiled from Protein Data Bank

(Note: 112 in Abstract book)

Agrawal, Piyush
CSIR-Institute of Microbial Technologycc

PDB 2.0 (http://webs.iiitd.edu.in/raghava/ccpdb ) is an updated version of manually curated database ccPDB that maintains datasets required for developing methods to predict the structure and function of proteins. The number of datasets compiled from literature increased from 45 to 141 in ccPDB 2.0. Similarly, the number of protein structures used for creating datasets also increased from ~74000 to ~137000 (PDB March 2018 release). ccPDB 2.0 provides the same web services and flexible tools which were present in the previous version of the database. In the updated version, links of the number of methods developed in the past few years have also been incorporated. This updated resource is built on responsive templates which is compatible with smartphones (mobile, iPhone, iPad, tablets etc.) and large screen gadgets. In summary, ccPDB 2.0 is a user-friendly web-based platform that provides comprehensive as well as updated information about datasets.Database URL: http://webs.iiitd.edu.in/raghava/ccpdb

112 Annotation of gene product function from high-throughput studies using the Gene Ontology

(Note: 111 in Abstract book)

Attrill, Helen
FlyBase, University of Cambridge

High-throughput studies constitute an essential and valued source of information for researchers. However, high-throughput experimental workflows are often complex, with multiple data sets that may contain large numbers of false positives. The representation of high-throughput data in the Gene Ontology (GO) therefore presents a challenging annotation problem, when the overarching goal of GO curation is to provide the most precise view of a gene's role in biology. To address this, representatives from annotation teams within the GO Consortium reviewed high-throughput data annotation practices. We present an annotation framework for high-throughput studies that will facilitate good standards in GO curation and, through the use of new high-throughput evidence codes, increase the visibility of these annotations to the research community.

113 Building Deep Learning Models for Evidence Classification from the Open Access Biomedical Literature

Burns, Gully A
Information Sciences Institute

We investigate the application of deep learning to biocuration tasks that involve classification of text associated with biomedical evidence in primary research articles. We developed a large-scale corpus of molecular papers derived from PubMed and PMC open access records and used it to train deep learning word embeddings under the GloVe, FastText, and ELMo algorithms. We applied those models to a distant supervised method classification task based on text from figure captions or fragments surrounding references to figures in the main text using a variety or models and parameterizations. We then developed document classification (triage) methods for molecular interaction papers by using deep learning mechanisms of attention to aggregate classification-based decisions over selected paragraphs in the document. We were able to obtain triage performance with an accuracy of 0.82 using a combined convolution neural network (CNN), bi-directional Long-Short Term Memory (LSTM) architecture augmented by attention to produce a single decision for triage. In this work, we hope to encourage biocuration systems developers to apply deep learning methods to their specialized tasks by repurposing large scale word embedding to apply to their data.

114 Validation of protein-protein interactions in databases and resources: the need to identify interaction detection methods that provide binary or indirect experimental evidences

de Las Rivas, Javier
Bioinformatics and Functional Genomics Group

Cancer Research Center (CiC-IBMCC, CSIC/USAL) The collection and integration of all the known protein–protein physical interactions within a proteome framework are critical to allow proper exploration of the protein interaction networks that drive biological processes in cells at molecular level. APID Interactomes is a public resource of biological data (http://apid.dep.usal.es) that provides a comprehensive and curated collection of `protein interactomes’ for more than 1100 organisms, including 30 species with more than 500 interactions, derived from the integration of experimentally detected protein-to-protein physical interactions (PPIs). We have performed an update of APID database including a redefinition of several key properties of the PPIs to provide a more precise data integration and to avoid false duplicated records. This includes the unification of all the PPIs from five primary databases of molecular interactions (BioGRID, DIP, HPRD, IntAct and MINT), plus the information from two original systematic sources of human data and from experimentally resolved 3D structures (i.e. PDBs, Protein Data Bank files, where more than two distinct proteins have been identified). Thus, APID provides PPIs reported in published research articles (with traceable PMIDs) and detected by valid experimental interaction methods that give evidences about such protein interactions (following the `ontology and controlled vocabulary’: www.ebi.ac.uk/ols/ontologies/mi; developed by `HUPO PSI-MI’). Within this data mining framework, all interaction detection methods have been grouped into two main types: (i) `binary’ physical direct detection methods and (ii) `indirect’ methods. As a result of these redefinitions, APID provides unified protein interactomes including the specific `experimental evidences’ that support each PPI, indicating whether the interactions can be considered `binary’ (i.e. supported by at least one binary detection method) or not.

115 An enhanced workflow for variant interpretation in UniProtKB/Swiss-Prot improves consistency and reuse in ClinVar

Famiglietti, Maria Livia
SIB Swiss Institute of Bioinformatics

Personalized genomic medicine depends on integrated analyses that combine genetic and phenotypic data from individual patients with reference knowledge of the functional and clinical significance of sequence variants. Sources of this reference knowledge include the ClinVar repository of human genetic variants, a community resource that accepts submissions from external groups, and UniProtKB/Swiss-Prot, an expert curated resource of protein sequences and functional annotation. UniProtKB/Swiss-Prot provides knowledge on the functional impact and clinical significance of over 30,000 human protein coding sequence variants, curated from peer reviewed literature reports.Here we present a pilot study that lays the groundwork for the integration of curated knowledge of protein sequence variation from UniProtKB/Swiss-Prot with ClinVar. We show that existing interpretations of variant pathogenicity in UniProtKB/Swiss-Prot and ClinVar are highly concordant, with 88% of variants that are common to the two resources having interpretations of clinical significance that agree. Re-curation of a subset of UniProtKB/Swiss-Prot variants according to ACMG guidelines using ClinGen tools further increases this level of agreement, mainly due to the reclassification of supposedly pathogenic variants as benign, based on newly available population frequency data. We have now incorporated ACMG guidelines and ClinGen tools into the UniProtKB curation workflow, and routinely submit variant data from UniProtKB/Swiss-Prot to ClinVar. These efforts will increase the usability and utilization of UniProtKB variant data and will facilitate the continuing (re)evaluation of clinical variant interpretations as datasets and knowledge evolve.

116 Increased Interactivity and Improvements to the GigaScience Database, GigaDB.

Hunter, Christopher
GigaScience, BGI

With a large increase in the volume and type of data archived in GigaDB since its launch in 2011, we have studied the metrics and user patterns to assess the important aspects needed to best suit current and future use. This has led to new front-end developments and enhanced interactivity and functionality that greatly improves user experience.In this article, we present an overview of the current practices including the Biocurational role of the GigaDB staff, the broad usage metrics of GigaDB datasets, and an update on how the GigaDB platform has been overhauled and enhanced to improve the stability and functionality of the codebase. Finally, we report on future directions for the GigaDB resource.Database URL: http://gigadb.org/

117 Towards comprehensive annotation of Drosophila melanogaster enzymes in FlyBase

Marygold, Steven
FlyBase, University of Cambridge

The catalytic activities of enzymes can be described using Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. These annotations are available from numerous biological databases and are routinely accessed by researchers and bioinformaticians to direct their work. However, enzyme data may not be congruent between different resources, while the origin, quality and genomic coverage of these data within any one resource is often unclear. GO/EC annotations are assigned either manually by expert curators or inferred computationally, and there is potential for errors in both types of annotation. If such errors remain unchecked, false positive annotations may be propagated across multiple resources, significantly degrading the quality and usefulness of these data. Similarly, the absence of annotations (false negatives) from any one resource can lead to incorrect inferences or conclusions. We are systematically reviewing and enhancing the functional annotation of the enzymes of Drosophila melanogaster, focusing on improvements within the FlyBase (www.flybase.org) database. We have reviewed 4 major enzyme groups to date: oxidoreductases, lyases, isomerases and ligases. Herein, we describe our review workflow, the improvement in the quality and coverage of enzyme annotations within FlyBase, and the wider impact of our work on other related databases.

118 Curating Gene Sets: Challenges and Opportunities for Integrative Analysis

Mukherjee, Gaurab
The Jackson Laboratory

Genomic data interpretation often requires analyses that move from a gene-by-gene focus to a focus on sets of genes that are associated with biological phenomena such as molecular processes, phenotypes, diseases, drug interactions, or environmental conditions. Unique challenges exist in the curation of gene sets beyond the challenges in curation of individual genes. Here we highlight a literature curation workflow whereby gene sets are curated from peer-reviewed published data into GeneWeaver (GW), a data repository and analysis platform. We describe the system features that allow for a flexible yet precise curation procedure. We illustrate the value of curation by gene sets through analysis of independently curated sets that relate to the integrated stress response, showing that sets curated from independent sources all share significant Jaccard similarity. A suite of reproducible analysis tools is provided in GeneWeaver as services to carry out interactive functional investigation of user-submitted gene sets within the context of over 150,000 gene-sets constructed from publicly available resources and published gene lists. A curation interface supports the ability of users to design and maintain curation workflows of gene-sets, including assigning, reviewing, and releasing gene-sets within a curation project context.

119 ImmunoSPdb: An Archive of Immunosuppressive Peptides

Usmani, Salman Sadullah
CSIR-Institute of Microbial Technology

Immunosuppression proved as a captivating therapy in several autoimmune disorders, asthma as well as in organ transplantation. Immunosuppressive peptides are specific for reducing efficacy of immune system with wide range of therapeutic implementations. “ImmunoSPdb” is a comprehensive, manually curated database of around 500 experimentally verified immunosuppressive peptides compiled from literature. The current version comprises of 553 entries providing extensive information including peptide name, sequence, chirality, chemical modification, origin, nature of peptide, its target as well as mechanism of action, amino acid frequency and composition, etc. Data analysis revealed that most of the immunosuppressive peptides are linear (91%), shorter in length i.e. upto 20 amino acids (62%) and have L form of amino acids (81%). 29% peptide are either chemically modified or have end terminal modification. Most of the peptides are either derived from proteins (41%) or naturally (27%) exist. Blockage of Potassium ion channel (24%) is one a major target for immunosuppressive peptides. In addition, we have annotated tertiary structure by using PEPstrMOD and I-TASSER. Many user-friendly, web-based tools have been integrated to facilitate searching, browsing, and analyzing the data. We have developed a user-friendly responsive website to assist a wide range of users.Database URL: http://webs.iiitd.edu.in/raghava/immunospdb/

120 Integrated curation and data mining for disease and phenotype models at the Rat Genome Database

Wang, Shur-Jen
Marquette University and Medical College of Wisconsin

Rats have been used as research models in biomedical research for over 150 years. These disease models arise from naturally-occurring mutations, selective breeding and, more recently, genome manipulation. Through the innovation of genome-editing technologies, genome-modified rats provide precision models of disease by disrupting or complementing targeted genes. To facilitate the use of these data produced from rat disease models, the Rat Genome Database (RGD) organizes rat strains and annotates these strains with disease and qualitative phenotype terms as well as quantitative phenotype measurements. From the curated quantitative data, the expected phenotype profile ranges were established through a meta-analysis pipeline using inbred rat strains in control conditions. The disease and qualitative phenotype annotations are propagated to their associated genes and alleles if applicable. Currently, RGD has curated nearly 1300 rat strains with disease/phenotype annotations and about 11% of them have known allele associations. All of the annotations (disease and phenotype) are integrated and displayed on the strain, gene, and allele report pages. Finding disease and phenotype models at RGD can be done by searching for terms in the ontology browser, browsing the disease or phenotype ontology branches, or entering keywords in the general search. Use cases are provided to show different targeted searches of rat strains at RGD.

121 Integration of Macromolecular Complex Data into the Saccharomyces Genome Database

Wong, Edith
Stanford University

Proteins seldom function individually. Instead, they interact with other proteins or nucleic acids to form stable macromolecular complexes that play key roles in important cellular processes and pathways. One of the goals of Saccharomyces Genome Database (SGD; www.yeastgenome.org) is to provide a complete picture of budding yeast biological processes. To this end, we have collaborated with the Molecular Interactions team that provides the Complex Portal database at EMBL-EBI to manually curate the complete yeast complexome. These data, from a total of 589 complexes, were previously available only in SGD’s YeastMine data warehouse (yeastmine.yeastgenome.org) and the Complex Portal (www.ebi.ac.uk/complexportal). We have now incorporated these macromolecular complex data into the SGD core database and designed complex-specific reports to make these data easily available to researchers. These web pages contain referenced summaries focused on the composition and function of individual complexes. In addition, detailed information about how subunits interact within the complex, their stoichiometry, and the physical structure are displayed when such information is available. Finally, we generate network diagrams displaying subunits and Gene Ontology (GO) annotations that are shared between complexes. Information on macromolecular complexes will continue to be updated in collaboration with the Complex Portal team and curated as more data become available.Website URL: www.yeastgenome.org

122 Using Deep Learning to Identify Translational Research in Genomic Medicine Beyond Bench to Bedside

TBA

Tracking scientific research publications on the evaluation, utility and implementation of genomic applications is critical for the translation of basic research to impact clinical and population health. In this work, we utilize state-of-the-art machine learning approaches to identify translational research in genomics beyond bench to bedside from the biomedical literature. We apply the convolutional neural networks (CNNs) and support vector machines (SVMs) to the bench/bedside article classification on the weekly manual annotation data of the Public Health Genomics Knowledge Base database. Both classifiers employ salient features to determine the probability of curation-eligible publications, which can effectively reduce the workload of manual triage and curation process. We applied the CNNs and SVMs to an independent test set (n = 400), and the models achieved the F-measure of 0.80 and 0.74, respectively. We further tested the CNNs, which perform better results, on the routine annotation pipeline for 2 weeks and significantly reduced the effort and retrieved more appropriate research articles. Our approaches provide direct insight into the automated curation of genomic translational research beyond bench to bedside. The machine learning classifiers are found to be helpful for annotators to enhance the efficiency of manual curation.