Gene names in different databases

Motivation

The GeneID labels differ across databases. This is a potential problem when gene age inference has been performed from protein sequences obtained from one database (e.g. UniProtKB) and the RNA-seq quantification has been performed on genes from another database (e.g. ENSEMBL).

Solution 1

In phylomapr, we try to mitigate this issue using the helper function convertID(), which essentially wraps around the function biomart() in the biomartr package. In this article, we give an example of how C. elegans phylomaps can be converted from the uniprot format to the wormbase format.

Typically, the gene IDs can be converted to the ENSEMBLE gene IDs, whose biomartr attributes are ensembl_gene_id and ensembl_peptide_id by default.

# default run of convertID
Caenorhabditis_elegans.PhyloMap.ENSEMBL <- 
  phylomapr::convertID(
    phylomap = phylomapr::Caenorhabditis_elegans.PhyloMap,
    mart = "ENSEMBL_MART_ENSEMBL",
    dataset = "celegans_gene_ensembl",
    filters = "uniprot_gn_id"
    )

If you run this, the gene IDs are named as e.g. WBGene00000001. However, gene IDs in other RNA-seq experiments employ other gene names. In the case of C. elegans this may be wormbase IDs (e.g. 2L52.1a). How can we convert the phylomaps to accommodate this?

First, the organismFilters function in biomartr can be used to obtain the attributes. Here, we ask for the topic worm because we are interested in wormbase IDs

c_elegans_BM.filt <- organismFilters(organism = "Caenorhabditis elegans", topic = "worm")

c_elegans_BM.filt

#> # A tibble: 72 × 4
#>    name                     description                                             dataset               mart                
#>    <chr>                    <chr>                                                   <chr>                 <chr>               
#>  1 with_wormbase_cds        With WormBase CDS ID(s)                                 celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#>  2 with_wormbase_gene       With WormBase Gene ID(s)                                celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#>  3 with_wormbase_gseqname   With WormBase Gene Sequence-name ID(s)                  celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#>  4 with_wormbase_locus      With WormBase Locus ID(s)                               celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#>  5 with_wormbase_transcript With WormBase Transcript ID(s)                          celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#>  6 with_wormpep_id          With Wormpep ID ID(s)                                   celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#>  7 wormbase_cds             WormBase CDS ID(s) [e.g. 2L52.1a]                       celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#>  8 wormbase_gene            WormBase Gene ID(s) [e.g. WBGene00000001]               celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#>  9 wormbase_gseqname        WormBase Gene Sequence-name ID(s) [e.g. WBGene00000001] celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#> 10 wormbase_locus           WormBase Locus ID(s) [e.g. WBGene00000001]              celegans_gene_ensembl ENSEMBL_MART_ENSEMBL
#> # ℹ 62 more rows

After scrolling through the results, we observe that the attribute wormbase_cds should work as the IDs are formatted in the 2L52.1a.1-form.

Next, we convert the GeneID labels from uniprot to wormbase using "wormbase_cds" as the attribute.

# here we use the dataset phylomapr::Caenorhabditis_elegans.PhyloMap
Celegans.PhyloMap <- phylomapr::Caenorhabditis_elegans.PhyloMap

Celegans.PhyloMap.wormbase <-
  phylomapr::convertID(
    phylomap = Celegans.PhyloMap,
    mart = "ENSEMBL_MART_ENSEMBL",
    dataset = "celegans_gene_ensembl",
    attributes = c("wormbase_cds"),
    filters    = "uniprot_gn_id")

#> Starting Gene ID conversion...

#> Starting BioMart query ...

#> 
#> 
#> Please cite: Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821.

> Celegans.PhyloMap
# A tibble: 19,818 × 2
   Phylostratum GeneID                   
          <dbl> <chr>                    
 1            2 sp|A0A061ACU2|PIEZ1_CAEEL
 2            3 sp|A0A078BQN7|FKH3_CAEEL 
 3            1 sp|A0A078BQP2|GCY25_CAEEL
 4           10 sp|A0A0K3AUE4|SEA2_CAEEL 
 5            1 sp|A0A0K3AUJ9|PRDX_CAEEL 
 6            1 sp|A0A0K3AV08|MLK1_CAEEL 
 7            1 sp|A0A0K3AWM6|MOM5_CAEEL 
 8            3 sp|A0A0K3AXH1|ARID1_CAEEL
 9            1 sp|A0A131MBU3|IRG7_CAEEL 
10            1 sp|A0A131MCZ8|CNNM3_CAEEL
# ℹ 19,808 more rows
# ℹ Use `print(n = ...)` to see more rows

> Celegans.PhyloMap.wormbase
# A tibble: 16,319 × 2
   Phylostratum GeneID  
          <dbl> <chr>   
 1            2 2RSSE.1a
 2            2 2RSSE.1b
 3            2 2RSSE.1c
 4           14 3R5.1a  
 5           14 3R5.1b  
 6            1 4R79.1a 
 7            1 4R79.1b 
 8            1 AC3.10  
 9           11 AC3.1a  
10           11 AC3.1b  
# ℹ 16,309 more rows
# ℹ Use `print(n = ...)` to see more rows

Tada! While the conversion issue between databases are mitigated, there are still some aspects to be improved. For example, we can see a drop in the number of genes in the phylomap. Furthermore, a bit of fidgeting with biomartr is needed. However, this should still be faster than running gene age inference from scratch!

Solution 2

If the helper function convertID() in phylomapr doesn’t work out well, there are other solutions involving fast and sensitive sequence alignment, though the sequences must be downloaded. To achieve this, diamond_protein_to_protein_best_hits() or diamond_protein_to_protein_best_reciprocal_hits() in rdiamond (or diamond_best() or diamond_rec() in orthologr) can be used. Here, we show the example with rdiamond::diamond_protein_to_protein_best_hits().

Setting up the example

To set up the example, we download the fasta sequences of each proteome using biomartr.

file_path.ENSEMBL <- biomartr::getProteome(
  db       = "ensembl",
  organism = "Caenorhabditis elegans",
  path     = file.path("_ensembl_downloads","proteomes") )

file_path.UNIPROT <- biomartr::getProteome(
  db       = "uniprot",
  organism = "Caenorhabditis elegans",
  path     = file.path("_uniprot_downloads","proteomes") )

Running solution for C. elegans

Now run we run rdiamond::diamond_protein_to_protein_best_hits() to obtain the best hits

# obtain the best hit from ENSEMBL in the UNIPROT dataset
best_rec_hits <- rdiamond::diamond_protein_to_protein_best_hits(
  query = file_path.ENSEMBL,
  subject = file_path.UNIPROT,
  sensitivity_mode = "fast", # change to ultra-sensitive to increase time.
  cores = 8)

# select only subject and query
ENSEMBL_to_UNIPROT <- best_rec_hits %>% 
  dplyr::filter(perc_identity == 100) %>% # filter hits with percent identity 100% (assuming no misalignments have occured)
  dplyr::select(1,2) %>%
  dplyr::rename(ensembl = 1) %>%
  dplyr::rename(GeneID = 2)

# replace previous UNIPROT IDs with the new ENSEMBL IDs
Celegans.PhyloMap.UNIPROT <- phylomapr::Caenorhabditis_elegans.PhyloMap

Celegans.PhyloMap.ENSEMBL <- Celegans.PhyloMap.UNIPROT %>%
  dplyr::left_join(ENSEMBL_to_UNIPROT) %>% # join the ENSEMBL IDs to UNIPROT PhyloMap 
  tidyr::drop_na() %>% # remove rows that do not have corresponding IDs
  dplyr::select(-GeneID) %>% # remove the previous UNIPROT IDs
  dplyr::rename(GeneID = ensembl) # rename the ensembl column as the GeneID

Tada! Previously, the phylomap had UNIPROT IDs, i.e.

> Celegans.PhyloMap.UNIPROT
# A tibble: 19,818 × 2
   Phylostratum GeneID                   
          <dbl> <chr>                    
 1            2 sp|A0A061ACU2|PIEZ1_CAEEL
 2            3 sp|A0A078BQN7|FKH3_CAEEL 
 3            1 sp|A0A078BQP2|GCY25_CAEEL
 4           10 sp|A0A0K3AUE4|SEA2_CAEEL 
 5            1 sp|A0A0K3AUJ9|PRDX_CAEEL 
 6            1 sp|A0A0K3AV08|MLK1_CAEEL 
 7            1 sp|A0A0K3AWM6|MOM5_CAEEL 
 8            3 sp|A0A0K3AXH1|ARID1_CAEEL
 9            1 sp|A0A131MBU3|IRG7_CAEEL 
10            1 sp|A0A131MCZ8|CNNM3_CAEEL
# ℹ 19,808 more rows
# ℹ Use `print(n = ...)` to see more rows

Now, we have ENSEMBL gene IDs.

> Celegans.PhyloMap.ENSEMBL
# A tibble: 26,425 × 2
   Phylostratum GeneID      
          <dbl> <chr>       
 1            2 C10C5.1g.1  
 2            2 C10C5.1i.1  
 3            2 C10C5.1k.1  
 4            2 C10C5.1l.1  
 5            1 Y105C5B.2a.1
 6           10 K10G6.3a.1  
 7           10 K10G6.3c.1  
 8           10 K10G6.3e.1  
 9           10 K10G6.3f.1  
10           10 K10G6.3g.1  
# ℹ 26,415 more rows
# ℹ Use `print(n = ...)` to see more rows

The main issue with solution 2 is the increased number of dependencies, including the sequence aligner DIAMOND, though one can imagine that other aligners could be used such as MMSeqs2 or BLAST. The most important thing here is that the corresponding IDs can be obtained and that users cite the package/software that has been used.

2023-10-23

Motivation

Solution 1

Solution 2

Setting up the example

Running solution for C. elegans