
Adding phylomaps to phylomapr
2023-10-23
Adding_phylomaps.RmdThis package aims to provide easy access to pre-computed phylomaps.
But the tree of life is huge (there are at least millions of
eukaryotic species) and so the production of phylomaps, be it for
evolutionary transcriptomics or other
genomics/transcriptomics/proteomics studies, is a joint effort. For
advanced gene age mappers who have generated their own phylomaps and
would like to contribute to phylomapr, here is a short
tutorial to do just that!
Easy option
The easiest way (easy here for you, not for me) to contribute is to
just post a github issue with the link to where the phylomap is hosted
or the paper that contains the phylomap and describe the method that was
employed to obtain it (e.g. the version of gene age inference tools such
as GenEra and parameters of the sequence aligner)! The
maintainers of phylomapr can then add the phylomaps manually from the
link and description provided.
Advanced option
For even more advanced phylomappers, here is how you can contribute.
Note: this might change if we move the site where phylomaps are hosted.
- Fork the
phylomaprrepository and clone it in R (see tutorial here: read until the end ofB. Rstudiosection).
If a link to where the phylomap is hosted exists:
- Open the
DATASET.Rin the directorydata-rawand add the download line, e.g.
######### Manley et al., 2023 #########
# download the Phylostratigraphic Maps from Manley et al., 2023
# Rhizophagus irregularis
download.file( url = "https://zenodo.org/record/7713976/files/Rhizophagus_irregularis_DAOM197198_1432141_phyloranks.tsv",
destfile = "data-raw/Rhizophagus_irregularis_DAOM197198_1432141_phyloranks.tsv")Here you would describe the dataset (in #) differently
and have a different download link.
- In
DATASET.R, create a[Species].PhyloMapobject from the downloaded file. There are many ways to do this and this is just an example. Make sure that the name of the[Species].PhyloMapobject isn’t duplicated. If so, add some more information e.g. the geneID convention i.e.[Species].ENSEMBL.PhyloMap.
# load package readxl
library(readr)
### Phylostratigraphic Maps
# Rhizophagus irregularis
Rhizophagus_irregularis.data <-readr::read_tsv("data-raw/Rhizophagus_irregularis_DAOM197198_1432141_phyloranks.tsv")
Rhizophagus_irregularis.PhyloMap <-
dplyr::select(
Rhizophagus_irregularis.data,
Phylostratum = PS,
GeneID
)The most important thing is that the [Species].PhyloMap
object has the first column with phylostratum titled
Phylostratum and the second column with the GeneID titled
GeneID.
> phylomapr::Rhizophagus_irregularis.PhyloMap
# A tibble: 31,217 × 2
Phylostratum GeneID
<dbl> <chr>
1 8 g100-T1
2 6 g1000-T1
3 2 g10000-T1
4 9 g10001-T1
5 1 g10002-T1
6 1 g10003-T1
7 1 g10004-T1
8 1 g10005-T1
9 1 g10006-T1
10 1 g10007-T1
# ℹ 31,207 more rows
# ℹ Use `print(n = ...)` to see more rows- At the bottom of
DATASET.R, runusethis::use_data()on the[Species].PhyloMapobject you’ve created, e.g.
usethis::use_data(Rhizophagus_irregularis.PhyloMap2, overwrite = TRUE)which leads to this:
✔ Saving 'Rhizophagus_irregularis.PhyloMap' to 'data/Rhizophagus_irregularis.PhyloMap.rda'
• Document your data (see 'https://r-pkgs.org/data.html')- Run the lines you’ve added in
DATASET.R
For the next steps go to the section Documenting the new phylomap
If you want to load your phylomaps (.tsv or .csv files etc.) manually:
Add the raw output of gene age inference (e.g.
[taxid]_gene_ages.tsvfrom GenEra) to the directorydata-raw.In
DATASET.R, create a[Species].PhyloMapobject from the added raw file. There are many ways to do this and this is just an example. Make sure that the name of the[Species].PhyloMapobject isn’t duplicated. If so, add some more information e.g. the geneID convention i.e.[Species].ENSEMBL.PhyloMap.
######### Strongylocentrotus purpuratus GenEra test #########
Strongylocentrotus_purpuratus.data <-readr::read_tsv("data-raw/7668_gene_ages.tsv")
Strongylocentrotus_purpuratus.PhyloMap <-
dplyr::select(
Strongylocentrotus_purpuratus.data,
Phylostratum = rank,
GeneID = `#gene`
)The most important thing is that the [Species].PhyloMap
object has the first column with phylostratum titled
Phylostratum and the second column with the GeneID titled
GeneID.
> phylomapr::Strongylocentrotus_purpuratus.PhyloMap
# A tibble: 38,475 × 2
Phylostratum GeneID
<dbl> <chr>
1 1 NP_001001474.1
2 1 NP_001001475.1
3 1 NP_001001476.1
4 2 NP_001001477.1
5 1 NP_001001478.1
6 4 NP_001001768.2
7 1 NP_001001906.1
8 2 NP_001003798.1
9 1 NP_001005725.1
10 1 NP_001008790.1
# ℹ 38,465 more rows
# ℹ Use `print(n = ...)` to see more rows- At the bottom of
DATASET.R, runusethis::use_dataon the[Species].PhyloMapobject you’ve created, e.g.
usethis::use_data(Strongylocentrotus_purpuratus.PhyloMap, overwrite = TRUE)✔ Saving 'Strongylocentrotus_purpuratus.PhyloMap' to 'data/Strongylocentrotus_purpuratus.PhyloMap.rda'
• Document your data (see 'https://r-pkgs.org/data.html')- Run the lines you’ve added in
DATASET.R
For the next steps go to the section Documenting the new phylomap
Documenting the new phylomap
- Add the description for your new phylomapr using
usethis::use_r(), i.e.usethis::use_r("[Species].R")(note: the format used inphylomaprdoesn’t contain the.PhyloMapsuffix). For example, if I want to add the description for Strongylocentrotus purpuratus
usethis::use_r("Strongylocentrotus_purpuratus.R")The description file contains how gene ages are inferred, any notes
on the parameters, the number of rows in the data frame, the number of
variables, the source (under @source) and finally
"[Species].PhyloMap".
#' Phylomap of Strongylocentrotus purpuratus
#'
#' Gene ages inferred using [GenEra](https://github.com/josuebarrera/GenEra) on reference protein sequences from NCBI protein coding genes (https://www.ncbi.nlm.nih.gov/datasets/gene/taxon/7668/?gene-type=Protein-coding).
#' Note: [DIAMOND](https://github.com/bbuchfink/diamond) was run using the `ultra-sensitive mode`.
#'
#' @format A tibble with 38,475 rows and 2 variables:
#' \describe{
#' \item{Phylostratum}{dbl Phylostratum (or gene age) assignment}
#' \item{GeneID}{proteinID annotation from NCBI}
#' }
#' @source
#' Bermejo (2023) _Unpublished_
#' \url{https://github.com/LotharukpongJS/phylomapr}
"Strongylocentrotus_purpuratus.PhyloMap"
Document your new phylomap using
devtools::document()orroxygen2::roxygenise(). Tada! The phylomap is documented such that users can run?[Species].PhyloMapand obtain the source and other information about the phylomap! Optionally, rundevtools::check()to ensure that the package is looking good after the changes.Under the
Gittab in Rstudio (where the other tabs include Environment, History, Connection, Buld, etc.), stage the changed files, commit the changes (with description) and push your changes to your fork. The staging, commit and push can alternatively be done on the command line. Then in GitHub, make a pull request.
Done!