These functions remap the IDs in the FASTA file from RefSeq IDs or UniProt accessions to Gene symbols. If a single gene matches more than one protein sequence, only the longest sequence is retained.
Usage
remap_accessions_refseq_to_gene_fasta(
path_to_FASTA,
organism_name,
conversion_table
)
remap_accessions_uniprot_to_gene_fasta(path_to_FASTA)
Arguments
- path_to_FASTA
(character) path to FASTA file.
- organism_name
(character) scientific name of organism (e.g.
"Homo sapiens"
,"Rattus norvegicus"
,"Mus musculus"
, etc.). Not required ifconversion_table
is supplied.- conversion_table
(data.frame) optional data frame with two columns: REFSEQ or UNIPROT and SYMBOL (in that order). If not provided, it will be created with
fetch_conversion_table
.
Examples
if (FALSE) {
path_to_FASTA <- system.file(
"extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz",
package = "PlexedPiperTestData"
)
temp_work_dir <- tempdir() # can be set to "." or getwd(), if done carefully
file.copy(path_to_FASTA, temp_work_dir)
path_to_FASTA <- file.path(temp_work_dir, basename(path_to_FASTA))
library(Biostrings)
readAAStringSet(path_to_FASTA) # RefSeq IDs
path_to_new_FASTA <- remap_accessions_refseq_to_gene_fasta(
path_to_FASTA, organism_name = "Rattus norvegicus"
)
readAAStringSet(path_to_new_FASTA) # gene symbols
}