Skip to contents

These functions remap the IDs in the FASTA file from RefSeq IDs or UniProt accessions to Gene symbols. If a single gene matches more than one protein sequence, only the longest sequence is retained.

Usage

remap_accessions_refseq_to_gene_fasta(
  path_to_FASTA,
  organism_name,
  conversion_table
)

remap_accessions_uniprot_to_gene_fasta(path_to_FASTA)

Arguments

path_to_FASTA

(character) path to FASTA file.

organism_name

(character) scientific name of organism (e.g. "Homo sapiens", "Rattus norvegicus", "Mus musculus", etc.). Not required if conversion_table is supplied.

conversion_table

(data.frame) optional data frame with two columns: REFSEQ or UNIPROT and SYMBOL (in that order). If not provided, it will be created with fetch_conversion_table.

Value

Path to remapped FASTA file.

Examples

if (FALSE) {
path_to_FASTA <- system.file(
  "extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz",
  package = "PlexedPiperTestData"
)
temp_work_dir <- tempdir() # can be set to "." or getwd(), if done carefully
file.copy(path_to_FASTA, temp_work_dir)
path_to_FASTA <- file.path(temp_work_dir, basename(path_to_FASTA))
library(Biostrings)
readAAStringSet(path_to_FASTA) # RefSeq IDs
path_to_new_FASTA <- remap_accessions_refseq_to_gene_fasta(
  path_to_FASTA, organism_name = "Rattus norvegicus"
)
readAAStringSet(path_to_new_FASTA) # gene symbols
}