Adding phylomaps to phylomapr
2023-10-23
Adding_phylomaps.Rmd
This package aims to provide easy access to pre-computed phylomaps.
But the tree of life is huge (there are at least millions of
eukaryotic species) and so the production of phylomaps, be it for
evolutionary transcriptomics or other
genomics/transcriptomics/proteomics studies, is a joint effort. For
advanced gene age mappers who have generated their own phylomaps and
would like to contribute to phylomapr
, here is a short
tutorial to do just that!
Easy option
The easiest way (easy here for you, not for me) to contribute is to
just post a github issue with the link to where the phylomap is hosted
or the paper that contains the phylomap and describe the method that was
employed to obtain it (e.g. the version of gene age inference tools such
as GenEra
and parameters of the sequence aligner)! The
maintainers of phylomapr can then add the phylomaps manually from the
link and description provided.
Advanced option
For even more advanced phylomappers, here is how you can contribute.
Note: this might change if we move the site where phylomaps are hosted.
- Fork the
phylomapr
repository and clone it in R (see tutorial here: read until the end ofB. Rstudio
section).
If a link to where the phylomap is hosted exists:
- Open the
DATASET.R
in the directorydata-raw
and add the download line, e.g.
######### Manley et al., 2023 #########
# download the Phylostratigraphic Maps from Manley et al., 2023
# Rhizophagus irregularis
download.file( url = "https://zenodo.org/record/7713976/files/Rhizophagus_irregularis_DAOM197198_1432141_phyloranks.tsv",
destfile = "data-raw/Rhizophagus_irregularis_DAOM197198_1432141_phyloranks.tsv")
Here you would describe the dataset (in #
) differently
and have a different download link.
- In
DATASET.R
, create a[Species].PhyloMap
object from the downloaded file. There are many ways to do this and this is just an example. Make sure that the name of the[Species].PhyloMap
object isn’t duplicated. If so, add some more information e.g. the geneID convention i.e.[Species].ENSEMBL.PhyloMap
.
# load package readxl
library(readr)
### Phylostratigraphic Maps
# Rhizophagus irregularis
Rhizophagus_irregularis.data <-readr::read_tsv("data-raw/Rhizophagus_irregularis_DAOM197198_1432141_phyloranks.tsv")
Rhizophagus_irregularis.PhyloMap <-
dplyr::select(
Rhizophagus_irregularis.data,
Phylostratum = PS,
GeneID
)
The most important thing is that the [Species].PhyloMap
object has the first column with phylostratum titled
Phylostratum
and the second column with the GeneID titled
GeneID
.
> phylomapr::Rhizophagus_irregularis.PhyloMap
# A tibble: 31,217 × 2
Phylostratum GeneID <dbl> <chr>
1 8 g100-T1
2 6 g1000-T1
3 2 g10000-T1
4 9 g10001-T1
5 1 g10002-T1
6 1 g10003-T1
7 1 g10004-T1
8 1 g10005-T1
9 1 g10006-T1
10 1 g10007-T1
# ℹ 31,207 more rows
# ℹ Use `print(n = ...)` to see more rows
- At the bottom of
DATASET.R
, runusethis::use_data()
on the[Species].PhyloMap
object you’ve created, e.g.
usethis::use_data(Rhizophagus_irregularis.PhyloMap2, overwrite = TRUE)
which leads to this:
'Rhizophagus_irregularis.PhyloMap' to 'data/Rhizophagus_irregularis.PhyloMap.rda'
✔ Saving data (see 'https://r-pkgs.org/data.html') • Document your
- Run the lines you’ve added in
DATASET.R
For the next steps go to the section Documenting the new phylomap
If you want to load your phylomaps (.tsv or .csv files etc.) manually:
Add the raw output of gene age inference (e.g.
[taxid]_gene_ages.tsv
from GenEra) to the directorydata-raw
.In
DATASET.R
, create a[Species].PhyloMap
object from the added raw file. There are many ways to do this and this is just an example. Make sure that the name of the[Species].PhyloMap
object isn’t duplicated. If so, add some more information e.g. the geneID convention i.e.[Species].ENSEMBL.PhyloMap
.
######### Strongylocentrotus purpuratus GenEra test #########
Strongylocentrotus_purpuratus.data <-readr::read_tsv("data-raw/7668_gene_ages.tsv")
Strongylocentrotus_purpuratus.PhyloMap <-
dplyr::select(
Strongylocentrotus_purpuratus.data,
Phylostratum = rank,
GeneID = `#gene`
)
The most important thing is that the [Species].PhyloMap
object has the first column with phylostratum titled
Phylostratum
and the second column with the GeneID titled
GeneID
.
> phylomapr::Strongylocentrotus_purpuratus.PhyloMap
# A tibble: 38,475 × 2
Phylostratum GeneID <dbl> <chr>
1 1 NP_001001474.1
2 1 NP_001001475.1
3 1 NP_001001476.1
4 2 NP_001001477.1
5 1 NP_001001478.1
6 4 NP_001001768.2
7 1 NP_001001906.1
8 2 NP_001003798.1
9 1 NP_001005725.1
10 1 NP_001008790.1
# ℹ 38,465 more rows
# ℹ Use `print(n = ...)` to see more rows
- At the bottom of
DATASET.R
, runusethis::use_data
on the[Species].PhyloMap
object you’ve created, e.g.
usethis::use_data(Strongylocentrotus_purpuratus.PhyloMap, overwrite = TRUE)
'Strongylocentrotus_purpuratus.PhyloMap' to 'data/Strongylocentrotus_purpuratus.PhyloMap.rda'
✔ Saving data (see 'https://r-pkgs.org/data.html') • Document your
- Run the lines you’ve added in
DATASET.R
For the next steps go to the section Documenting the new phylomap
Documenting the new phylomap
- Add the description for your new phylomapr using
usethis::use_r()
, i.e.usethis::use_r("[Species].R")
(note: the format used inphylomapr
doesn’t contain the.PhyloMap
suffix). For example, if I want to add the description for Strongylocentrotus purpuratus
usethis::use_r("Strongylocentrotus_purpuratus.R")
The description file contains how gene ages are inferred, any notes
on the parameters, the number of rows in the data frame, the number of
variables, the source (under @source
) and finally
"[Species].PhyloMap"
.
#' Phylomap of Strongylocentrotus purpuratus
#'
#' Gene ages inferred using [GenEra](https://github.com/josuebarrera/GenEra) on reference protein sequences from NCBI protein coding genes (https://www.ncbi.nlm.nih.gov/datasets/gene/taxon/7668/?gene-type=Protein-coding).
#' Note: [DIAMOND](https://github.com/bbuchfink/diamond) was run using the `ultra-sensitive mode`.
#'
#' @format A tibble with 38,475 rows and 2 variables:
#' \describe{
#' \item{Phylostratum}{dbl Phylostratum (or gene age) assignment}
#' \item{GeneID}{proteinID annotation from NCBI}
#' }
#' @source
#' Bermejo (2023) _Unpublished_
#' \url{https://github.com/LotharukpongJS/phylomapr}
"Strongylocentrotus_purpuratus.PhyloMap"
Document your new phylomap using
devtools::document()
orroxygen2::roxygenise()
. Tada! The phylomap is documented such that users can run?[Species].PhyloMap
and obtain the source and other information about the phylomap! Optionally, rundevtools::check()
to ensure that the package is looking good after the changes.Under the
Git
tab in Rstudio (where the other tabs include Environment, History, Connection, Buld, etc.), stage the changed files, commit the changes (with description) and push your changes to your fork. The staging, commit and push can alternatively be done on the command line. Then in GitHub, make a pull request.
Done!