PubTator provides automatic annotations of biomedical concepts such as genes and mutations in PubMed abstracts and PMC full-text articles [3-4]. Annotations can be viewed in a web interface or downloaded via RESTful API or FTP. Downloaded annotations are provided in BioC JSON and BioC XML formats [5] (full-text articles) and in PubTator format (title and abstract), as described here.

PubTator annotations

Automated annotations for PubTator are created with automated concept recognition systems using machine learning and disambiguated with cutting-edge deep learning for improved accuracy. Concepts identified are linked to several biomedical resources:

  1. Genes and proteins are annotated by GNormPlus and linked to NCBI Gene.
  2. Chemicals are annotated by a concept recognition system using bluebert, an extension of the BERT deep learning transformer model, and linked to Medical Subject Headings (MeSH).
  3. Diseases are annotated by TaggerOne and linked to the MEDIC disease vocabulary, which includes both Medical Subject Headings (MeSH) and OMIM.
  4. Cell lines are annotated by TaggerOne and linked to Cellosaurus.
  5. Species are annotated by SR4GN and linked to NCBI Taxonomy.
  6. Genomic variants are annotated by tmVar and linked to dbSNP. NOTE: While our annotation tools are state-of-the-art, all automated tools are imperfect and their annotations will contain some errors.

