4.3 Conversion Using FASTA Headers
If specifically converting to gene symbols, it is recommended to use the information in the headers of the FASTA file that was used for the database search. The gene symbol is always given by GN=...
, so we can use a regular expression to extract it. For UniProt FASTA files, there is a function in MSnID
called parse_FASTA_names
that will extract the components of the FASTA headers and create a data.frame
.
## Read FASTA file
<- system.file("extdata/uniprot_rat_small.fasta.gz",
fst_path package = "MSnID")
<- parse_FASTA_names(path_to_FASTA = fst_path)
conv_tbl3 head(conv_tbl3)
## feature database uniprot_acc isoform entry_name
## 1 sp|P63088|PP1G_RAT sp P63088 NA PP1G_RAT
## 2 sp|Q4FZV7|TMUB2_RAT sp Q4FZV7 NA TMUB2_RAT
## 3 sp|O55159|EPCAM_RAT sp O55159 NA EPCAM_RAT
## 4 sp|Q80VJ4|GPCP1_RAT sp Q80VJ4 NA GPCP1_RAT
## 5 sp|Q66MI6|T10IP_RAT sp Q66MI6 NA T10IP_RAT
## 6 sp|O70453|HMOX3_RAT sp O70453 NA HMOX3_RAT
## description
## 1 Serine/threonine-protein phosphatase PP1-gamma catalytic subunit
## 2 Transmembrane and ubiquitin-like domain-containing protein 2
## 3 Epithelial cell adhesion molecule
## 4 Glycerophosphocholine phosphodiesterase GPCPD1
## 5 Testis-specific protein 10-interacting protein
## 6 Putative heme oxygenase 3
## organism organism_id gene protein_existence sequence_version
## 1 Rattus norvegicus 10116 Ppp1cc 1 1
## 2 Rattus norvegicus 10116 Tmub2 2 1
## 3 Rattus norvegicus 10116 Epcam 1 1
## 4 Rattus norvegicus 10116 Gpcpd1 1 1
## 5 Rattus norvegicus 10116 Tsga10ip 2 2
## 6 Rattus norvegicus 10116 Hmox3 5 1