29
Nov
2009
admin

file formats

This page serves as a simple reference on file formats commonly used by sequence analysis programs. Each entry is accompanied by properly formatted DNA and amino acid sequences where appropriate.

ASN.1

Abstract Syntax Notation 1 form, the computer-readable form of the data used by NCBI. All databases entries are available from Entrez in this format.

EMBL UniProt

EMBL is the nucleotide database of EBI. UniProt is the collaborative amino acid database of EBI, SIB and PIR.

FASTA

The definition line and sequence character format used by NCBI. All database entries from Entrez are available in this format. A sequence in Fasta format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length to faciliate viewing and editing.

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

GenBank/GenPept

The nucleotide (GenBank) and protein (GenPept) database entries are available from Entrez in this format.

NEXUS

This is the file format used by many popular programs like GARLI, GDA, MacClade, Mesquite, ModelTest, MrBayes and PAUP*. Nexus file names often have a .nxs or .nex extension.

A formal description of the NEXUS format can be found in Maddison et al. (1997).

Conversion of an interleaved NEXUS file to a non-interleaved NEXUS file: execute the file in PAUP*, and export the file as non-interleaved NEXUS file. You can also type the commands:

export file=yourfile.nex format=nexus interleaved=no;

PHYLIP

The PHYLIP format came from Joe Felsenstein's phylogeny inference package and is now used by several phylogenetics programs. PHYLIP file names often have have a .phy or .ph extension.

NBRF and PIR

National Biomedical Research Foundation (NBRF) maintains nucleotide and protein sequence databases. PIR file names often have a .pir extension. The header line of a nucleotide sequence file in pir-format begins with a greater than sign ">" followed by DL.

Protein information resource (PIR) is an annotated, non-redundant and cross-referenced database of protein sequences at the NBRF. The header line of protein sequence file in pir-format begins with a greater than sign ">" followed by P1.

Sequence Alignment/MAP (SAM) and BAM Format

The SAM format is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. SAM is a tab-delimited text format. It is easy to understand, easy to parse, and easy to generate and check for errors. However, SAM is a bit slow to parse. A binary equivalent to SAM was developed to deal with this issue and is known as BAM which is used in intensive data processing. BAM is useful in most production pipelines, while SAM may be useful for interconversion with external applications and for exploratory analysis. A full description of SAM/BAM can be read in this PDF .