Odd number posters will be presented on Monday, 8th April and even numbered posters on Tuesday, 9th April.
Posters 162 - 194.
162 In the Know About GO: A Newly Redesigned Website for the Gene Ontology
The Gene Ontology resource (GO; http://geneontology.org) is a major bioinformatics initiative to provide a comprehensive, computational representation of our evolving knowledge of the biological functions of genes and gene products in all organisms. GO has been cited in over 148,000 published papers. In order to coordinate the efforts of the GO Consortium with thousands of users worldwide, it is necessary to keep the website current and concise. We have recently restructured the GO website in order to simplify the user interface, clarify citation and standardization practices for the various GO data products, and to make educational documentation a prominent aspect of the website. Here we present the newly overhauled GO site, which has been specifically targeted toward two groups of users that are of particular importance: (1) novice users and (2) researchers who need a specific GO annotation or ontology file. Novice users will find training documentation and introductory material to ensure proper use and understanding of GO data products, including annotation and ontology file formats. More experienced users are still able to download GO files,but now through a more intuitive interface. Current GO release dates are prominently featured on the website and emphasis has been placed on the citation policy of GO, allowing for greater reproducibility in GO enrichment analyses and other uses of GO. Additionally, updated training documentation and guidelines are available to encourage and enable outside groups to contribute or suggest modifications to annotations and the ontology. Overall, the redesigned website is expected to enhance involvement of the research community with GO, and is a welcome improvement to an already well-known and heavily used resource.
163 Because my favorite protein paper should be in UniProt
Center for Bioinformatics and Computational Biology, University of Delaware
UniProt Knowledgebase (UniProtKB) is a publicly available database with access to a vast amount of protein sequence and function information. Expert curation at UniProt includes a critical review of experimental data from the literature and predicted data from a range of sequence analysis tools. A representative set of literature articles are selected for annotation to maximize entry information content. Thus, many related articles with potential useful content may not be included (e.g., those with information overlapping with existing annotations, or those with main focus out-of-scope for UniProt). Also, UniProt expert curation effort focuses on selected species; therefore, proteins from some organisms for which experimental data may exist, are not actively annotated. To facilitate access to the additional literature related to an entry, UniProt compiles and organizes publications from external sources (biological databases) and text mining results. These combined sources add a total of 37,483,000 UniProt AC-PMID pairs, covering over 918,000 unique papers and over 367,000 protein entries (Release 2018_11). This bibliography is classified, via a neural network-based method or based on the source databases, into different topics in the entry, similar to the curated references. Publications are available in the Publications section of the UniProt entry, under “Computationally mapped”. Still, many experts request articles and, in some cases suggest annotations, to be added to entries. To respond to this need, we are developing a “Community” section where researchers are able to add directly the articles that they deem relevant to a protein entry, along with performing several optional tasks, such as classifying the article. ORCIDs will be used to validate and credit the contributors. In this way, UniProt will provide access to a more comprehensive set of articles for each protein. We expect this community contribution will also facilitate the curation effort.
164 Combining text mining and author participation to improve curation at WormBase
California Institute of Technology
Biological databases rely on expert biocuration of the primary research literature to maintain an up-to-date collection of data organized in a machine-readable form. In order to enter information into databases curators need to: i) identify papers containing relevant data for curation, a process called triaging; ii) recognize named entities; and iii) extract data. WormBase (WB), the authoritative repository for research data on the biology, genetics and genomics of Caenorhabditis elegans and other nematodes, uses various Text Mining (TM) approaches on full text to automate, or semi-automate, each aspect of the curatorial pipeline. In addition, we engage authors, via an Author First Pass (AFP) pipeline, to help classify data types and entities in their recently published papers. We will present how we are combining TM and AFP into a single application to enhance community curation at WB. Specifically, we use string-searching algorithms and statistical methods (e.g. Support Vector Machines (SVMs)) to extract biological entities and flag data types, respectively, and present the results of these methods to authors in a web-based form. With this new approach, authors simply need to validate the results of machine-based pipelines, rather than enter all information de novo. By combining these previously separate pipelines, we hope to lessen the participatory burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. Wherever possible, we are also providing links from the new AFP interface to structured data submission forms, e.g. phenotypes and expression patterns, giving authors the opportunity to contribute more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and thus could readily be applied to additional literatures and databases that would like to engage their user community in assisting with curation.
165 Towards comprehensive quality assessment of proteome space
European Bioinformatics Institute (EMBL-EBI)
Dramatic advancements in whole-genome sequencing coupled to drastically reduced cost has led to an unprecedented rise in the release rate of de novo genomes. At UniProt we provide comprehensive protein-centric views of such genomes through the Proteomes portal (http://www.uniprot.org/proteomes/). To achieve this, we have put in place pipelines that gather data accurately and consistently from source databases followed by analyses to determine the quality and completeness of the final datasets.The vast majority of currently available proteomes (189,365 proteomes, UniProt release 2018_11) is based on the translation of genomes submitted to the EMBL/GenBank/DDBJ databases. As there are currently no stringent standards required of genome submissions by databases or journals, there is substantial variability in quality and completeness of the resulting proteomes.Over the last few years we established procedures to identify and incorporate new, publicly available genomes by carefully evaluating proteome size ranges within phylogenetic neighbourhoods. We are working on extending this metric to include both assembly (contig and scaffold N50 measures) and annotation (species-specific transcript and protein sequences) specific parameters. Additionally, analysis of core/conserved gene sets will aid in evaluating evolutionary outliers including parasitic genomes that have undergone extreme reduction in genome size. We hope that these will aid in the better selection of datasets for large-scale comparative studies and benefit sequencing groups, annotators and the larger research community.In addition to the import of new proteomes, maintenance and updating of existing proteomes to reflect improvements in genome assemblies and genebuild procedures is a vital part of the proteomes project. We will present recent developments in these areas and discuss ongoing work within the proteomes database aimed at maintaining the high quality and comprehensiveness of proteome data.
166 How Structural Biologists and the Protein Data Bank Contributed to Recent US FDA New Drug Approvals
RCSB Protein Data Bank
Discovery and development of 210 new molecular entities [NMEs, new drugs] approved by the US Food and Drug Administration 2010-2016 was facilitated by 3D structural information generated by structural biologists worldwide and distributed on an open access basis by the Protein Data Bank [PDB]. The molecular targets for 94% of these NMEs are known. The PDB archive contains 5,914 structures containing one of the known targets and/or a new drug, providing structural coverage for 88% of the recently approved NMEs across all therapeutic areas. More than half of the 5,914 structures were published and made available by the PDB at no charge with no restrictions on usage >10 years before drug approval. Citation analyses revealed that these 5,914 PDB structures significantly impacted the very large body of publicly-funded research reported in publications on the NME targets that motivated biopharmaceutical company investment in discovery and development programs that produced the NMEs.RCSB PDB is jointly funded by NSF, NIGMS, NCI, and DOE (DBI-1338415).Keywords: Protein Data Bank, PDB, Worldwide Protein Data Bank, wwPDB, Research Collaboratory for Structural Bioinformatics Protein Data Bank, RCSB PDB, Macromolecular Crystallography, FAIR Principles, Open Access, Data Deposition, New Molecular Entities, Drugs, Biopharmaceutical Industry
167 Gephebase, a Database of Genotype-Phenotype Relationships for natural and domesticated variation in Eukaryotes
CNRS - Institut Jacques Monod
We will present Gephebase, a manually-curated database compiling published data about the genes and the mutations responsible for evolutionary changes in all Eukaryotes (mostly animals, plants and yeasts). Biology researchers can easily browse published data for their topic of interest and perform searches at various levels – phenotypic, genetic, taxonomic or bibliographic (transposable elements, epigenetic mutations, snakes, carotenoid content, etc.). This database allows to perform meta-analysis to extract global trends about evolution, genetics, as well as sociology in the field of evolutionary genetics. Gephebase should also help breeders, conservationists and others identify the most promising target genes for traits of interest, with potential applications in many fields (crop improvement, parasite and pest control, bioconservation, genetic diagnostic). It is freely available at www.gephebase.org.
168 PHI-Canto: introducing the concept of the meta-genotype to curate information on multi-species interactions
The Canto community curation tool (Rutherford et al., (2014) doi: 10.1093/bioinformatics/btu103) was developed to enable the literature curation of biochemical and phenotype data by the publication authors with little or no curation training.We have extended Canto with configurable files to capture data for the pathogen-host interactions database PHI-base (www.phi-base.org) by creating PHI-Canto. PHI-Canto enables the curation of pathogen-host interaction phenotypes (i.e. alterations in pathogenicity and virulence) connected to the underlying genome-level changes. These Canto adaptations involved ‘originating’ the concept of a ‘multi-species genotype’ defined as the ‘meta-genotype’ to facilitate the capture of changes in pathogenicity observed by alterations to pathogen and host genes singly, or in combination. A second major change was the ability to handle multiple species and natural strains simultaneously and unambiguously. A formal logically defined precomposed ontology named phipo (pathogen-host interaction phenotype ontology) is registered with the OBO foundry (http://www.obofoundry.org/ontology/phipo.html) and is currently being developed using the ODK (ontology development kit). These species neutral phenotype terms will be used within PHI-Canto and are described in our other poster ‘Capturing phenotypes for inclusion in a multi-species interaction database’.The PHI-canto rationale, implementation, and resulting curation workflow are described.The PomBase Canto tool is funded by the Wellcome Trust (104967/Z/14/Z). PHI-base and PHI-Canto development is supported by the UK Biotechnology and Biological Sciences Research Council (BBSRC) (BB/I/001077/1, BB/K020056/1). PHI-base receives additional support from the BBSRC as a National Capability (BB/J/004383/1).
169 The ELIXIR Data Platform in 2019
ELIXIR (https://www.elixir-europe.org/) unites Europe’s leading life science organisations in managing and safeguarding the increasing volume of data being generated by publicly funded research. It coordinates, integrates and sustains bioinformatics resources across its member states and enables users in academia and industry to access services that are vital for their research. There are currently 23 Nodes in ELIXIR, and we work together using a ‘Hub and Nodes’ model.ELIXIR's activities are coordinated across five 'Platforms': Data, Tools, Interoperability, Compute and Training. The goal of the ELIXIR Data Platform (https://www.elixir-europe.org/platforms/data) is to drive the use, re-use and value of life science data. It aims to do this by providing users with robust, long-term sustainable data resources within a coordinated, scalable and connected data ecosystem. This presentation will outline the initiatives currently underway in the ELIXIR Data Platform. Topics will include the ELIXIR Core Data Resources, selected on the basis of a set of Indicators that demonstrate their fundamental importance to the wider life-science community, and the related set of Deposition Databases for the long-term preservation of biological data. Our work on Literature-Data Integration and Scalable Curation for biocurators, which builds on our text mining work with EuropePMC, will be summarised. Our commitment to Long Term Sustainability of life science data resources, including our contribution to the Global Biodata Coalition, will also be covered. Work on all these topics will continue through the new ELIXIR Scientific Programme set for 2019-2023. Lastly, the Data Platform is currently engaged in seven Implementation Studies, involving fifteen ELIXIR Nodes working with 40 Data Resources across Europe. These studies are due to be completed mid-2019, and the tasks they are engaged in will be introduced.
170 Introducing Project FREYA: opportunities for biocurators
FREYA is an EU-funded initiative seeking to illustrate the importance of connected, open identifiers for discovery, access and use of research resources. Persistent identifiers (PIDs) such as DOI for articles, accession numbers for datasets, and ORCIDs for researchers, are the lynchpin for the global information e-infrastructure that is the ultimate vision. FREYA works alongside organisations like the RDA who seek to promote data sharing. Project partners include twelve ‘big data’ organisations across Europe such as EMBL-EBI, CERN, and PANGAEA, who are working to engage and increase PID users across the research community and to build on the growing PID infrastructure put in place through the efforts of predecessor projects such as THOR (https://project-thor.readme.io/). The programme aims are necessarily ambitious: to grow the reach of existing systems e.g. by encouraging further ORCID uptake, improving literature-data integration, promoting uptake of established services in additional disciplines; working to fill gaps in connectivity to include research organisations, facilities such as synchrotron and research vessels, grant information, samples; collating user stories to prioiritize new PID services, gathering requirements for the most promising of these and developing prototypes. The presentation will include a status update on several of these initiatives, and will illustrate the local PID graph we’re developing and extending around Europe PMC, a biomedical literature repository based at the EBI. We invite the audience to consider how they can apply the emerging PIDs and services in their own work. FREYA partners view the biocuration community as influential stakeholders in the PID forum and key to sustaining the PID infrastructure beyond the lifetime of the project.
171 Integration and Presentation of Glycobiology Resources in GlyGen
George Washington University
The growing body of glycobiology knowledge is underutilized if it cannot be easily searched and integrated with related curated resources. To improve the ability of researchers to find and create developments from this growing data we launched the GlyGen glycobiology knowledgebase in September 2018 (http://glygen.org).GlyGen harmonizes and integrates data from a diverse set of collaborators (including EMBL-EBI, NCBI, UniCarbKB, GlyTouCan, PDB, and academic partners) under one easy to navigate interface. The data, ranging from genes to glycoproteins to disease and function annotations to glycan compositions, are stored in a unified extensible data model which facilitates mappings between distinct sources.A key strength of this approach is the ability to develop ‘quick search’ features or advanced database queries which reduces the number of searches required to answer complicated questions. Examples include: “Which proteins have been shown to bear glycan X and which site is this glycan attached to?” and “Which glycosyltransferases are known to be involved in disease X?” We encourage users to inform us of further queries which would be of use to them.All data sets collected and integrated by GlyGen are available for access through web-based APIs as well as for download trough http://data.glygen.org. In order to improve transparency and reproducibility, all harmonization and quality control processes required to integrate a given data table have been compiled in BioCompute Objects.
172 Downloading Data from SGD
Data at the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is accessed through a dynamic faceted search backed by Elasticsearch technology, a full-text search and analytics engine. Elasticsearch provides a powerful window into SGD’s complex data. Traditionally, SGD users used the File Transfer Protocol (FTP) server to download data files, but we are moving towards a more flexible approach by leveraging our search tool. File metadata has been loaded into the database and integrated into the greater SGD search. As a result, users can search for files the same way they might search for a gene or phenotype data. Results are matched based on metadata, including the file name, description, keywords, and the PMID if the file is associated with a published reference. The files themselves are stored using the AWS (Amazon Web Services) S3 storage service. Our new approach to downloading SGD data files provides our users with a more customized, end-to-end experience.
173 Protein Data Bank in Europe Knowledge Base (PDBe-KB) - a new community-driven resource for functional annotations of macromolecular structures
The Protein Data Bank is the single global archive of macromolecular structures and associated data. Since its inception in 1971, it maintained open access policies and more recently is striving to follow the FAIR principles of Findability, Accessibility, Interoperability and Reusability. Over the years, the PDB data has been used to derive a variety of structural and functional annotations, such as structural domains, analysis of ligand binding sites, predictions of other functional sites or key functional residues, effects of mutations and genetic variants and many more. These annotations are housed in a large number of smaller specialist resources and research groups, which causes the fragmentation of this rich data, impeding its findability, accessibility, interoperability and reusability. In an effort to bring this structural bioinformatics research community together and make this rich annotation data more FAIR, the Protein Data Bank in Europe Knowledge Base (PDBe-KB; https://pdbe-kb.org) was established in 2018. The PDBe-KB resource also aims to increase the visibility of partner resources.Active PDBe-KB partners are based in the UK, USA, India, Czech Republic, Italy, Belgium and Israel with other prospective partners having expressed an interest to join the consortium. The contributed annotations should originate from a method described in a peer-reviewed publication. Software-generated predictions should be updated at least once a year, while high-quality manual curations could be a one-off contribution. All partners are invited to regular workshops where the project progress is presented, the governance is discussed and data exchange standards are developed and agreed. PDBe-KB is a use case for the new ELIXIR 3DBioInfo community.
174 User-driven PomBase website redesign improves knowledge search, display, analysis, and reuse
University of Cambridge
PomBase, the model organism database for fission yeast, has completely redeveloped its web presence to provide a more fully integrated, better-performing service. The new infrastructure supports daily data updates as well as fast, efficient querying, and smoother navigation within and between pages. New pages for publications and genotypes provide routes to all data curated from a single source, and to all phenotypes associated with a specific genotype, respectively. Improved displays of ontology-based annotations balance comprehensive data coverage with ease of use. The default view now uses ontology structure to provide a concise, non-redundant summary that can be expanded to reveal underlying details and metadata, and phenotype annotations can be filtered by ontology structure or supporting evidence. New front page features highlight recent fission yeast research, PomBase features, and community-curated publications. PomBase has also integrated an instance of the JBrowse genome browser, facilitating straightforward loading of and intuitive access to new genome-scale datasets. PomBase's newest tool, QuiLT, displays gene lists graphically based on multiple orthogonal annotation types. Taken together, the new PomBase implementation enables users to probe connections among different types of data to form a comprehensive view of fission yeast biology, and provides a rich set of modular, reusable code that can be deployed to create new or enhance existing organism-specific databases.
175 ELIXIR 5 years on : Providing a coordinated European Infrastructure for Life Science Data and Services
Life Science data is complex and fragmented with diverse formats and metadata standards, risking a significant barrier to data integration and reuse. Since 2014, ELIXIR, the European Life-science Infrastructure for Biological Information, has worked to address these challenges. It consolidates Europe’s national centres, services, and core bioinformatics resources into a single, coordinated infrastructure and there are currently 23 countries involved in ELIXIR, bringing together more than 200 institutes and 600 scientists.ELIXIR’s activities are coordinated across five areas called 'Platforms'. The Data Platform has developed a process to identify data resources that are of fundamental importance to research and committed to long term preservation of data, known as core data resources. The Tools Platform has services to help search appropriate software tools, workflows, benchmarking as well as a Biocontainer’s registry. The Compute Platform has services to store, share and analyse large data sets and has developed the Authorization and Authentication Infrastructure (AAI) single-sign on service across ELIXIR. The Interoperability Platform develops and encourages adoption of standards, and the Training Platform helps scientists and developers find the training they need via the Training e-Support System (TeSS). ELIXIR has also established a number of ‘Communities’, based around a specific research area, major technology, or specialist user interest group. The Communities drive and support the development and integration of strategically important areas with the Platforms. ELIXIR also has the capability to fund short technical pilot studies, called Implementation Studies, with the aim to drive service development and drive standard adoption. Successful Implementations Studies, such as Beacons and AAI, can lead to adoption and further collaborations with the wider research communities, for example with the Global Alliance for Genomics and Health.
176 International Mouse Phenotyping Consortium: Capturing Multidimensional Large-Scale Phenotyping Data
MRC Harwell Institute
The International Mouse Phenotyping Consortium (IMPC) is a collaborative project designed to knockout each protein coding mouse gene and characterise the resulting mutants through broad-based, high-throughput phenotyping pipelines. To date, phenotyping data is available for over 6000 knockout lines from 12 centres worldwide. The data is managed by the Mouse Phenotyping Informatics Infrastructure (MPI2) consisting of the Data Co-ordination Centre (DCC) at MRC Harwell Institute, Core Data Archive (CDA) at the European Bioinformatics Institute and a team at the Queen Mary University, London (QMUL).Aspects of mouse biology are assessed through a battery of tests including fertility, viability, expression and a range of in vivo and terminal tests. In vivo tests include behavioural, metabolic and morphological assessments among others. In vivo and terminal test are carried out on 9-16 week-old mice and, more recently, from 52 weeks onwards as part of the late adult pipeline.Due to the large quantities of data, stringent procedures have been introduced to ensure the consistency and quality of the data produced. All of the procedures are defined on IMPReSS (https://www.mousephenotype.org/impress/), a publicly available database, specifying how and what data is collected. When data is submitted to the DCC it goes through a validation process to ensure the structure of the data is as expected. The preliminary data is then made available to the public (www.mousephenotype.org) while it goes through a rigorous quality control process. Any potential issues are reported back to the phenotyping centres for review and correction where needed.Quality confirmed data is regularly exported from the DCC to the CDA where up-to-date statistical analysis is applied. Human disease associations are also explored at QMUL. This analysed data is then made available to the public via the web portal as distinct data releases.
177 STRENDA DB – monitoring the completeness of enzyme function data
Scientific research is in part a creative act producing new results or theories through observation and experimentation and part verification through reproduction and comparison. This requires the experimental data to be reported completely including all necessary meta-data. Discussions with scientists have shown up many deficiencies in the way that the data are currently reported, resulting often in incomplete and even unusable data sets that are not suitable for subsequent research and knowledge generation. Since more than a decade ago, the Beilstein-Institut supports the STRENDA Commission (Standards for Reporting Enzymology Data) which has developed community based recommendations to authors for the reporting of enzymology data – the STRENDA Guidelines. Today, more than 55 biochemistry journal recommend their authors to refer to these guidelines when reporting enzyme kinetics data.In parallel, the Commission has developed STRENDA DB, a robust web-based validation and storage system for functional enzyme data that incorporates the STRENDA Guidelines allowing authors to easily check manuscript data for compliance with the Guidelines prior to or during the publication process. The data is stored in an open access database which is going to be incorporated in a network of domain-specific repositories which will provide a knowledgebase for researchers. This talk highlights the benefits of STRENDA DB for journals, authors, reviewers, and readers.
178 Facilitating community-based curation of transcription factor binding profiles in JASPAR
Centre for Molecular Medicine Norway (NCMM), University of Oslo, Norway
JASPAR (http://jaspar.genereg.net) is an open-access database of curated and non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) and TF flexible models (TFFMs) for TFs across species in six taxonomic groups. JASPAR database has been extensively updated in the previous seven releases. Currently, the JASPAR CORE collection includes 1,404 non-redundant PFMs (579 for vertebrates, 489 for plants, 176 for fungi, 133 for insects, 26 for nematodes, and 1 for urochordata). Furthermore, it contains 446 TFFMs (225 for vertebrates, 218 for plants and 3 for insects). The 2018 release incorporates clusters of similar PFMs in each taxon and each TF class per taxon. Furthermore, we have used PFMs from JASPAR CORE collection to predict TF-binding sites in several species, which are available through UCSC and Ensembl Genome Browser track hubs. Finally, the seventh release comes with a new web framework with an interactive and responsive user-interface, along with a Representational State Transfer (REST) application programming interface (API) to access the JASPAR database programmatically. In the eight release, we are planning to introduce community-based curation for profiles that were not added into JASPAR due to lack of experimental evidence from existing literature. We will encourage researchers to perform experiments and/or point us to literature that our curators missed in order to support these profiles. All the underlying data can be browsed and downloaded using the website and can be retrieved programmatically using the RESTful API (http://jaspar.genereg.net/api) and through the R/Bioconductor package.
179 Measuring the value of data curation as a part of the publishing process
Journals and publishers have an important role to play in the drive to increase the reproducibility of published science. Since its launch in 2014, the Nature Research journal Scientific Data has established a reputation for publishing data papers (‘Data Descriptors’) that are highly reusable, as evidenced by a strong citation record. One of the ways in which Scientific Data ensures maximum reusability of published data is via the in-house data curation workflow applied to every Data Descriptor. In 2017, Springer Nature launched its Research Data Support (RDS) service to provide data curation expertise to researchers publishing at other Springer Nature journals.During curation at Scientific Data and RDS, our data editors familiarise themselves with the related manuscript and perform a thorough check of each data archive. This ensures the descriptions in the manuscript match the metadata and data at the data repositories. The curation process facilitates the identification of any discrepancies between the manuscript text and the information held at the data repository.Over the last year, the curation team have been recording the types of discrepancies rectified as a direct result of our curation process. At Scientific Data approximately 10% of the discrepancies the team find are significant enough to potentially have warranted a formal correction had the issue had not been resolved prior to publication.
180 The Gene Regulation Consortium (GRECO) and the COST Action GREEKC
The GREEKC COST Action (www.greeck.org) aims to develop one standardised “Gene Regulation Knowledge Commons” (GRKC) for the Life Sciences. The Knowledge Commons is a collection of freely accessible information resources, with data quality criteria meeting standards that allow seamless integration and interoperability as well as automated computational access with third party software. GREEKC is an initiative of the GRECO: the Gene Regulation Consortium, which is a global consortium (www.thegreco.org) that focuses on the development of curated resources for the study of gene regulation processes. GREEKC has started its coordination work in 2017, and by April 2019 it will have organized 6 workshops where not only the European biocuration community and users of the GRKC contributed, but also selected experts from outside the EU. Many domain experts who would not normally meet have come together in GREEKC workshops to share and discuss their views towards the future of constructing and using curated knowledge resources. The presentation will focus on the consortium, the main objectives of the COST Action, its first results, the main challenges and the plans until September 2020.
181 PomBase at a Glance
University College London
A graphical overview of the fission yeast research community, literature, and data curation in PomBase.
182 Facilitating researcher engagement with curation during the data paper publication
One thing that differentiates the Nature Research journal Scientific Data from other data journals is our commitment to provide extensive, expert curation for datasets associated with each Data Descriptor (our principle article type) we publish.In doing this, our goal is to create a rich, structured and machine-readable metadata record of the data generation workflow which renders the data more Findable and Accessible. Our metadata utilises standardised terminology (from a trusted selection of ontologies) to describe and link the data to other, similar data, and we interact with the curation community to adhere to formal and up-to-date policies (e.g., Identifiers.org). One of our data curators confirms that each data file is available in an acceptable format and no data is missing or inconsistent with what is described in the data descriptor.However, such an in-depth in-house curation process is time-consuming and prone to human error, since the curator may not be familiar with the experimental technique. The manuscript author is better placed to generate metadata about their data, but our experience shows that authors often do not understand how best to formulate structured metadata.This presentation describes our curation process, including how we engage with authors to produce the metadata and what tools and policies we have designed to overcome the challenges of efficient data curation. We provide details of the openly-available metadata templates we created to facilitate formalised metadata generation by authors, and how these have been received. We also present statistics surrounding author engagement in our data curation process and how this impacts curation times and quality. We hope that by sharing our experiences, we can demonstrate our commitment to working with the curation community to achieve best practices, and champion the benefits of curated data descriptors to the scientific community.
183 Involving researchers in the biocuration of plant genes and pathways
Oregon State University
Curated genomic resources and databases that provide systems-level frameworks for visualization and analysis of high-throughput omics data cannot keep pace with an ongoing explosion in the generation of genomic data without the active engagement and support of the research community and other stakeholders, such as academic institutions, grant agencies, and publishers. Researchers who participate in biocuration activities can acquire Big Data literacy and additional analytical skills useful for conducting and publishing their own research. Based on a few on-site and online biocuration activities organized by Plant Reactome (http://plantreactome.gramene.org) curators, we will discuss the strategy, workflow, and outcomes of our efforts in involving researchers and database users in the curation of plant genes and pathways. The Plant Reactome database is funded by NSF award #IOS-1127112 to the Gramene project. It is produced with intellectual and infrastructure support provided by the Human Reactome award (NIH: P41 HG003751, ENFIN LSHG-CT-2005-518254), Ontario Research Fund, and European Bioinformatics Institute (EBI) Industry Programme).
184 microPublication – incentivizing authors to publish research findings in a machine readable format
California Institute of Technology
A significant proportion of useful biomedical research output is not published in a timely fashion – or ever – because of the pressure to publish complete narratives. In addition, the effort, and hence cost, of curation at archival repositories can be inhibitory to getting these data into publicly-funded community knowledge bases. Authors are in the best position to curate their own data; however researchers need guidance in order to attach appropriate controlled vocabulary terms and associated metadata to their observations. We are capturing these “lost” data through both depositing it into archival repositories and publishing the findings as micropublications in the peer-reviewed, open access journal “microPublication Biology”. Our platform encourages researchers to share datasets in a metadata-driven fashion by giving them credit for participation through a citable publication that is peer-reviewed. At the same time, we guide authors through curation by having them submit their findings through carefully designed forms that enable the use of standard vocabularies via autocomplete fields. Peer-review ensures that content in the database is of high quality. The microPublication platform is revolutionizing the way in which scholarly and scientific information is communicated and shared, and speeds the rate at which data gets from the bench to the public. Further our platform aims to maximize efficiency in data curation and accessibility of research findings according to findable, accessible, interoperable, and reproducible (FAIR) data principles. We have launched our journal, microPublication Biology, within the C. elegans field and are quickly expanding to other model organism research communities and beyond. We will share our experience with submissions to our new journal from the C. elegans, Drosophila, and Xenopus communities.
185 Coordination and Collection of Data for a Community Global Biodiversity Initiative
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)
The UniEuk project (https://unieuk.org/project/) is an open, inclusive, community-based and expert driven international initiative to build a flexible, adaptive universal taxonomic framework for eukaryotes. It unites three complementary modules, EukBank, EukMap and EukRef, which use environmental metabarcoding surveys, phylogenetic markers and expert knowledge to inform the taxonomic framework. I will focus here on the Eukbank module, the aim of which is to standardise observations of global eukaryotic diversity across biomes (e.g., saturation, relative frequencies, phylogeny), and allow identification and preliminary naming of novel eukaryotic lineages of ecological and phylogenetic relevance. This module involves analysis of high-throughput metabarcoding datasets using V4 regions of 18S rRNA from various ecosystems. The European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena) has been directly involved in the collection and coordination of data for EukBank. In late 2017 a letter was circulated among the protist community calling for 18S V4 rRNA metabarcoding datasets. Data submitters were encouraged to contact ENA and associate their public and confidential data to a “datahub” at ENA accessible by the EukBank team. A dedicated sample checklist was created for this project to ensure the capture of high-quality metadata standards. There was an excellent response from the community with 190 contributors submitting approximately 160 datasets. The data have geographic coverage across various planetary biomes from oceans to freshwater environments, soils, and forests. Preliminary analysis has begun on the data and final results are expected to be available by early 2019. In this presentation, I will cover the mechanisms put into place to mobilise a community and their data around a particular scientific challenge; describe how these leverage an existing core data resource and community standards and deliver FAIR practice; and preview the key outputs so far from the analysis.
186 Data Integration and Visualization at the 4D Nucleome Data Portal
Harvard Medical School
The 4D Nucleome Consortium aims to elucidate the nature of DNA conformation in the nucleus both spatially and temporally. The 4DN Data Portal (https://data.4dnucleome.org/) was created as a community resource to collect and disseminate data generated by the Consortium with complete experimental metadata following a FAIR approach. The data portal contains many innovative experiment types that probe the 3D interactions of chromatin, including Hi-C methods, DamID-seq, Repli-seq, super-resolution microscopy, and more. In addition to making data available, a key goal in our data portal design is to allow users to easily visualize and compare datasets interactively. HiGlass (http://higlass.io) is a tool that allows visualization of pairwise contact matrices (a primary output of chromatin conformation experiments) and 1D genomics tracks (such as ChIP-seq peak calls). HiGlass allows users to compare multiple files at once, as well as to zoom in and out or navigate to other genomic regions by dragging the mouse, much like Google Maps. A HiGlass data server has been incorporated into the 4DN Data Portal and we have implemented a new HiGlass Workspace feature. Combining the responsive interactive visualization capabilities of HiGlass with the portal query interface based on the curated metadata is an incentive to ensure that the data is well curated and directly adds value to the curation work. The workspace starts with a blank display, and allows users to select files to visualize. Then after navigating to a region of interest, the display can be saved, preserving the view with location and zoom level. This becomes an easy and engaging way to compare results of different Hi-C experiments, as well as a means to compare Hi-C contact matrices to results from other types of experiments like DamID-seq and Repli-seq. Thus, the HiGlass Workspace represents a crucial step towards the ability of our data portal to provide and represent data integration from various experiment types.
187 BioStudies – database of biological studies
BioStudies is a new database at the European Bioinformatics Institute (EBI) that aims to address current limitations within the traditional structured data archives available to scientists.Biostudies is able to accept and store data from new and emergent technologies where data is produced that cannot be supported by the current EBI data resources. Biostudies is also able to link to data in other databases, this is in particular advantageous in multiomic studies where data has been deposited in a number of repositories but with no central description. Due to the unstructured nature, Biostudies is also able to store the supplementary data that is associated with publications.A simple tab-delimited text format, PageTab has been developed to enable to capture all the information about a study including metadata, protocols, links and files and allows for a hierarchical structure of study parts, if required.Submissions from users can be submitted through an online tool allowing the submitter the input of metadata, including data release date, direct upload of files, links to already deposited data and associated publication information. The tool enables users to maintain and edit their own biostudies records. The online tool can be customised with a template for specific projects.BioStudies is currently working with the community to develop templates to capture biological image data. The archiving of most biological image data is in its infancy and as such BioStudies and the community are driving a resource to enable that the meta data associated with biological image studies is consistent and will enable other members of the community to filter, download and reanalyse biological image data.
188 User-focused development of the NHGRI-EBI Genome-Wide Association Studies Catalog
European Molecular Biology Laboratory, European Bioinformatics Institute
The NHGRI-EBI GWAS Catalog is a comprehensive and widely-used database of published genome-wide association studies (GWAS), providing links between genetic loci and complex traits. As of December 2018, the Catalog contains over 3,600 publications and 81,000 unique SNP-trait associations. The increasing complexity of GWAS publications in recent years presents challenges for the accurate and user-friendly display of Catalog data. To meet this challenge, we engaged with the GWAS community to develop a new user interface, released in September 2018. We analysed the most common search terms and conducted user interviews to define and prioritise improvements. We also published a “labs” site for users to test proposed features and conducted face-to-face user testing throughout the development process. Following release, we sought feedback and continued to make incremental changes in response to users’ needs.The new interface enables more specific and intuitive access to Catalog data. Individual pages for each publication, study, trait, variant and gene, allow us to make available information tailored to each of these key concepts. Each page includes supporting information to provide context, a table of associations and links to external resources. New visualisations support further analyses of Catalog data. A plot of the linkage disequilibrium landscape allows users to investigate a region of association tagged by a GWAS variant and prioritise causal variants. Users can also visualise associations across the genome by trait in a LocusZoom plot. The GWAS Catalog now hosts full p-value summary statistics in addition to curated associations. Access to summary statistics is now integrated within the user interface, providing users with direct links to available files, in addition to API access.These changes respond to developments in the GWAS field as well as user feedback and will ensure that the GWAS Catalog continues to be a valuable resource for the scientific community.
189 TriTrypDB: A web-based resource offering the improvement of structural gene models of pathogenic kinetoplastids through community annotation using Apollo.
University of Liverpool
Kinetoplastids, a diverse group of protists including Trypanosomaand Leishmania, are the aetiological agents of sleeping sickness, Chagas disease and visceral and cutaneous leishmaniasis. Transmitted by invertebrate vectors, kinetoplastids have multiple life stage forms and cause mortality and morbidity world-wide, particularly in tropical and subtropical regions.TriTrypDB (http://TriTrypDB.org) is an innovative functional genomics database and a free, online resource that facilitates breakthroughs in translational research. TriTrypDB integrates a broad array of cutting-edge and historical datasets with advanced search and data visualization tools for mining data without command line programming. The resource is a component site of the larger Eukaryotic Pathogen Bioinformatics Resource Center (http://EuPathDB.org) in collaboration with GeneDB and is supported by the Wellcome Trust (UK) and NIH/NIAID (USA). The database hosts over 45 organism genomes featuring the important parasites Trypanosoma brucei, Trypanosoma cruzi, Leishmania sp and closely related species. Genomic sequence and annotation are integrated with data types that include transcriptomics, proteomics, epigenomics, metabolomics, population resequencing, clinical data, and host-pathogen interactions as well as data from TrypTag, TrypanoCyc and LeishCyc projects. In-house functional genome curation is supplemented with community annotation through the Comment platform.Shortly TriTrypDB will offer the Apollo platform for community genome structural annotation. With Apollo, users can improve current gene models based on multi-omic data presented on JBrowse tracks. User-generated improvements are reviewed by the TriTrypDB curatorial team and incorporated into the genome annotation where appropriate. Community-based Apollo structural annotation will take advantage of the expertise within the kinetoplastid community to improve the accuracy of these important genomes.
190 FlyBase community curation and outreach activities
Urbano, Jose M.
FlyBase-University of Cambridge
FlyBase (www.flybase.org), the primary genetics and genomics resource for Drosophila melanogaster, is an essential online resource for the fruit fly research community. The role of FlyBase extends beyond the distribution of curated data. We also engage with our users to promote new database content (e.g. Newsletters, homepage commentaries, Twitter), to educate about FlyBase tools (e.g. video tutorials), to seek feedback and input (e.g. surveys, community curation), and to facilitate communication between database users (e.g. researcher directory, FlyBase forum). Here, we present these different outreach and communication activities. Our ultimate aim is to improve the utility of FlyBase for our core community of Drosophila researchers as well as to attract additional users.
191 CNGBdb: China National GeneBank DataBase
Giga Science Database, BGI-Hong Kong
The China National GeneBank (CNGB) based in Shenzhen was established in January 2011 and officially opened on 22 September 2016 as a platform for integrating resources and capabilities to support the life sciences and bio-economy in China. Based on underlying big data and cloud computing technologies CNGB is able to provide a variety of data services, including: sequence archive (CNSA); Controlled access database (CDA) for restricted access biological data; analysis (e.g. BLAST, DISSECT); and knowledge search (CNGBdb Search). The CNSA (CNGB Nucleotide Sequence Archive) accepts sequence read data submissions from global researchers and includes data synchronization to the internationally recognized NCBI and EBI SRA databases. By December 2018, CNSA has supported 3035 research projects with an archived data volume of nearly 600TB. The metadata in CNSA are accessible, and indexed in a searchable resource (CNGBdb). To improve the exposure and citation rate of data, CNSA is also able to offer a DOI service to submitters of data that meet required standards. Here, we present CNGBdb (https://db.cngb.org/), a unified platform for biological big data sharing and application services in both the Chinese and English language. CNGBdb aggregates data from multiple sources and adopts and promotes data structures and standards of relevant international organization in omics, health, and medicine, such as INSDC, Global Alliance for Genomics and Health GA4GH (GA4GH), Global Genome Biodiversity Network (GGBN), American College of Medical Genetics and Genomics (ACMG). CNGBdb follows the FAIR Data Principles, and the FORCE Data Citation Principles. CNGBdb has already integrated large amount of biological data from CNGB, NCBI, EBI and other sources, and will continue to expand to fulfil the goal of becoming a global resource for life science researchers and industry. All public data and data services provided by CNGBdb are freely available to all users worldwide.
192 Maximising community participation in the FAIR-sharing of data from small-scale publications
University of Cambridge
Two major outputs of biological research are the data produced, and the knowledge gained from these data. Integration of standardized published data into online databases is a widely acknowledged mechanism to accelerate knowledge distribution, and thereby increase the value of research efforts and funding.To make data integration more comprehensive and efficient, PomBase has developed Canto, an intuitive online curation tool that enables publication authors to participate directly in the semantically standardized "encoding" of detailed functional data. Critically, curation in Canto leads to the FAIR-sharing of data from small-scale publications where newly acquired, detailed biological knowledge from hypothesis-driven experiments would otherwise remain buried. Community curation also helps PomBase increase knowledge integration and dissemination despite current constraints on funding for biocuration activities.As the fission yeast community's curation response rate approaches 50%, we review the project's procedures and metrics, the unanticipated added value from co-curation by publication authors and professional curators, and incentives and "nudges" we deploy to maximize participation.
193 Research on the Collaborative Mechanisms for Journals and Data Repositories in the Integrated Publication of Papers and Scientific Data
Institute of Medical Information & Library, Chinese Academy of Medical Sciences
OBJECTIVE: To clarify collaborative mechanisms for journals and data repositories in the integrated publication of papers and scientific data by investigating well-known journals and data repositories in the field of biomedicine , and to provide reference for the optimization of this mechanism,which thereby promote the integration, management and sharing of scientific data.Methods: Ten well-known journals were selected to survey , including BioMed Central, Nature, Scientific Data, etc. Ten typical data repositories were selected as well, including GenBank, UniProt, Dryad, Genome Sequence Archive and so on. The research compares and analyzes how different journals and data repositories carry out the integrated submission, associated review, linked release, synchronized update, joint reference of papers and data by the mutual authentication and collaborative cooperation.RESULTS: Three different collaborative mechanisms were summarized, and the integrated publication workflow of papers and data in three different situations will be detailed and comparatively analyzed in the text.Conclusion: Finally, We have provided suggestions for further cooperation between journals and repositories, such as establishing an integrated submission platform that allows authors to submit data and papers from one entry; conducting a dual data review in a semi-manual or semi-automatic manner; collecting as much detailed metadata as possible, and automatically builting bidirectional linking between papers and data; improving joint reference specifications for papers and data, and so on.
194 Enabling Findability, Accessibility, Interoperability, and Reusability with Improved Data Representation of Carbohydrates in the Protein Data Bank
RCSB Protein Data Bank
The Protein Data Bank (PDB) is the single global repository for experimentally determined 3D macromolecular structures and their ligand complexes. The Worldwide Protein Data Bank (wwPDB) is the international collaboration that manages the PDB Core Archive according to the FAIR Principles: Findability, Accessibility, Interoperability, and Reusability. PDB archive now holds > 145,000 biological macromolecules with ~10% carbohydrate containing structures.A major focus of the wwPDB is maintaining archival consistency and accuracy. As the PDB grows, developments in structure determination methods and technologies can challenge how structures are represented. The wwPDB addresses these challenges with remediation efforts to improve data representation.Carbohydrates play key roles in generation of energy, cell signaling, and recognition of markers. Understanding the structure of carbohydrates is critical to comprehending their biological roles in health and disease. PDB is the preeminent repository which 3D structural data are important for advancing the discovery and development of therapeutic agents and in generating insight into biological processes. However, the complex nature of carbohydrates places unique demands in data representation that were not envisioned when PDB was established in 1971. Working with glycoscience community, carbohydrate-appropriate annotation tools are being developed and implemented within the wwPDB OneDep system for deposition, validation, and biocuration. These software tools will provide standard nomenclature following IUPAC/IUBMB and uniform oligosaccharide representation.The carbohydrate remediation project is a wwPDB collaborative project being carried out principally by RCSB PDB at Rutgers, The State University of New Jersey with funding from NIGMS Grant U01 CA221216 in collaboration with the Complex Carbohydrate Research Center at the University of Georgia. RCSB PDB is jointly funded by NSF, NIGMS, NCI, and DOE (DBI-1338415).