4.3 Conversion Using FASTA Headers

If specifically converting to gene symbols, it is recommended to use the information in the headers of the FASTA file that was used for the database search. The gene symbol is always given by GN=..., so we can use a regular expression to extract it. For UniProt FASTA files, there is a function in MSnID called parse_FASTA_names that will extract the components of the FASTA headers and create a data.frame.

## Read FASTA file
fst_path <- system.file("extdata/uniprot_rat_small.fasta.gz",
                        package = "MSnID")
conv_tbl3 <- parse_FASTA_names(path_to_FASTA = fst_path)
head(conv_tbl3)
##               feature database uniprot_acc isoform entry_name
## 1  sp|P63088|PP1G_RAT       sp      P63088      NA   PP1G_RAT
## 2 sp|Q4FZV7|TMUB2_RAT       sp      Q4FZV7      NA  TMUB2_RAT
## 3 sp|O55159|EPCAM_RAT       sp      O55159      NA  EPCAM_RAT
## 4 sp|Q80VJ4|GPCP1_RAT       sp      Q80VJ4      NA  GPCP1_RAT
## 5 sp|Q66MI6|T10IP_RAT       sp      Q66MI6      NA  T10IP_RAT
## 6 sp|O70453|HMOX3_RAT       sp      O70453      NA  HMOX3_RAT
##                                                        description
## 1 Serine/threonine-protein phosphatase PP1-gamma catalytic subunit
## 2     Transmembrane and ubiquitin-like domain-containing protein 2
## 3                                Epithelial cell adhesion molecule
## 4                   Glycerophosphocholine phosphodiesterase GPCPD1
## 5                   Testis-specific protein 10-interacting protein
## 6                                        Putative heme oxygenase 3
##            organism organism_id     gene protein_existence sequence_version
## 1 Rattus norvegicus       10116   Ppp1cc                 1                1
## 2 Rattus norvegicus       10116    Tmub2                 2                1
## 3 Rattus norvegicus       10116    Epcam                 1                1
## 4 Rattus norvegicus       10116   Gpcpd1                 1                1
## 5 Rattus norvegicus       10116 Tsga10ip                 2                2
## 6 Rattus norvegicus       10116    Hmox3                 5                1