Workshop on Molecular Evolution  
Centers for Disease Control and Prevention
HomePeopleScheduleSoftwareResourcesForums

Resources
File Formats
FASTA DNA (aligned)
FASTA DNA (unaligned)
FASTA protein (aligned)
FASTA protein (unaligned)
GCG-MSF DNA
GCG-MSF protein
GenBank DNA
GenPept protein
NEXUS DNA
NEXUS protein
PHYLIP DNA
PHYLIP protein
PIR DNA
PIR protein
Glossary
References
Substitution Models
Converting File Formats
Likelihood Ratio Test
Online Journals
Scientific Societies
Tree Formats
UNIX Tutorial


File Formats

This page serves as a simple reference on file formats commonly used by sequence analysis programs. Each entry is accompanied by properly formatted DNA and amino acid sequences where appropriate.

ASN.1

Abstract Syntax Notation 1 form, the computer-readable form of the data used by NCBI. All databases entries are available from Entrez in this format.

EMBL UniProt

EMBL is the nucleotide database of EBI. UniProt is the collaborative amino acid database of EBI, SIB and PIR.

FASTA

The definition line and sequence character format used by NCBI. All database entries from Entrez are available in this format. A sequence in Fasta format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length to faciliate viewing and editing.

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

GCG

GCG-MSF format is recognised by one of the following:

  • the word PileUp at the start of the file.
  • the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT at the start of the file.
  • the word MSF on the first line of the line, and the characters at the end of this line.

These file names usually have a .msf extension.

GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of the file. These files have a .rsf extension.

GenBank/GenPept

The nucleotide (GenBank) and protein (GenPept) database entries are available from Entrez in this format.

NEXUS

This is the file format used by many popular programs like GARLI, GDA, MacClade, Mesquite, ModelTest, MrBayes and PAUP*. Nexus file names often have a .nxs or .nex extension.

A formal description of the NEXUS format can be found in Maddison et al. (1997).

Conversion of an interleaved NEXUS file to a non-interleaved NEXUS file: execute the file in PAUP*, and export the file as non-interleaved NEXUS file. You can also type the commands:

export file=yourfile.nex format=nexus interleaved=no;

PHYLIP

The PHYLIP format came from Joe Felsenstein's phylogeny inference package and is now used by several phylogenetics programs. PHYLIP file names often have have a .phy or .ph extension.

NBRF and PIR

National Biomedical Research Foundation (NBRF) maintains nucleotide and protein sequence databases. PIR file names often have a .pir extension. The header line of a nucleotide sequence file in pir-format begins with a greater than sign ">" followed by DL.

Protein information resource (PIR) is an annotated, non-redundant and cross-referenced database of protein sequences at the NBRF. The header line of protein sequence file in pir-format begins with a greater than sign ">" followed by P1.

.......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... ..........

Maintained by Adam Bazinet
Direct questions and comments to Michael Cummings