multiple sequence alignment activity

table of contents

expected learning outcomes

The objective of this activity is to work with a partner to help you learn to use some features of several multiple alignment and alignment visualization programs, including data input and output, basic functions, alignment options, and differences between nucleotide and amino acid alignments. While a large number of alignment programs have been developed, we are going to focus on two of them: ClustalW2 and MAFFT. ClustalW2 (and its graphical user interface version ClustalX2) is the latest version of a very popular alignment program, and MAFFT is by several measures one of the best performing alignment programs (according to benchmark tests of BAliBASE). For visualization we will use the program SeaView, from which you can also use alignment programs. We are going to use the data set from the NCBI activity, and two additional data sets for this activity.

INSTRUCTIONS FOR INTEGRATING SEAVIEW AND MAFFT ON MAC

If you have a mac, please download the following file: fixmafft.command

Then simply browse to the file and double-click on it.

After doing this you will need to reboot your computer.

getting started

  1. Download these data sets from the course web site, making note of where you put them:
    1. Atg5.fasta: file you should have obtained at the end of the NCBI activity.
    2. 1ped.fasta: nucleotide sequences of alcohol dehydrogenase from a variety of organisms; modified from BAliBASE.
    3. bacRNApol.fasta: amino acid sequences of the RNA polymerase subunit beta from several bacteria and mitochondria.
  2. Start Seaview:
    1. Mac users: Double-click on the SeaView icon in the Molecular Evolution Apps folder.
    2. Linux users: type seaview in the terminal window.
  3. Add MAFFT to the list of alignment programs in SeaView (ClustalW2 is installed by default):
    1. In SeaView, select Align > Alignment options > Add external method.
    2. Click on Select External Program.
    3. In the Name field, write /your/path/to/mafft. On Ubuntu Linux, this will be /usr/local/bin/mafft. On Mac OS X, this will be /Users/[your-username]/molevol/bin/mafft
    4. Type --auto %f.pir > %f.out in the Arguments field. Click OK. The auto option lets MAFFT determine the best sequence alignment algorithm for the given data set.

exercise 1: basic functions in SeaView

In these exercises you will be working in pairs. Each member of the pair will be either A or B (identify yourselves). Each person will follow only those steps listed under A or B. The two groups will use different alignment options and parameters for the same data set which will enable you to see how they affect the alignment.

Note that SeaView lacks an undo function. Do not worry if you mess up the data set from the NCBI activity in Exercise 1, we will not need it again. However, if you make mistakes in Exercises 2, 3 or 5, please close the file without saving and repeat the last steps. We recommend to save often.

Both A and B:

  1. Open SeaView.
  2. Go to File > Open and open the data set from the NCBI activity, Atg5.fasta. Have a look at the data. Is it aligned?
  3. Try some of the basic commands:
    1. To select a taxon, click on any taxon name in the left panel.
    2. To select all sequences at once, click Edit > Select all or type Command-A (on Macs) and Control-A (on Linux).
    3. To deselect all sequences at once, shift-click on the left panel (For Linux/Ubuntu: shift-click, and drag the cursor up and down to deselect).
    4. To select multiple taxa, drag through a range of them.
    5. To move selected sequences to another point in the data set, highlight a taxon, go to where you want to move the taxon, and hold down the control key and click (This function cannot be performed in Linux/Ubuntu).
    6. Use the < and > keys to move your view frame 50 characters left or right, or the [ and ] keys to move your view frame by 5 characters.
  4. Sequences can be edited manually:
    1. Go to any position in the data set, and click the spacebar. The spacebar allows you to insert a gap at that particular site.
    2. The backspace/delete key will remove gaps to the left of the cursor.
    3. To insert a particular base or amino acid, you will need to turn on the option Props > allow seq. edition and then type the appropriate letter at any position in the data set. The base will be inserted to the right of the cursor.
    4. In order to add gaps to all sequences but one, go to anywhere in the sequence and press the + key. The _ (underscore) key will remove a gap from all sequences.
    5. If you want to edit multiple bases at once, type a number before typing the command. For example, type 100 and press the delete key to delete 100 bases to the left of the cursor. If you would like to delete bases in all sequences at once, select all before executing the command. Again, note that the option Props > allow seq. edition must be activated.
  5. Close the file with or without saving.

exercise 2: comparison of two different alignment programs (ClustalW2 and MAFFT) using nucleotide sequences

Both A and B:

  1. Open SeaView.
  2. Open data set 1ped.fasta in SeaView.
A only:
  1. Make ClustalW2 the default alignment algorithm by clicking Align > Alignment options > clustalw2.
  2. Perform a basic alignment with ClustalW2 by clicking Align > align all.
  3. Once the alignment process is completed, an OK will appear on the bottom right of the pop-up window. Click OK.
B only:
  1. Switch to MAFFT as default alignment algorithm by clicking Align > Alignment options > /your/path/to/mafft
  2. Perform a basic alignment with MAFFT by clicking Align > align all.
  3. Once the alignment process is completed, an OK will appear on the bottom right of the pop-up window. Click OK.
Both A and B:
  1. Compare the times needed for completion of the alignment. Is MAFFT faster than ClustalW2? (When doing this comparison, take into account that the speed of your computers may differ.)
  2. Compare the alignments of group A and B. Are they different? Which one do you prefer, the MAFFT or the ClustalW2 alignment? Why? (Hint: these are protein coding genes.)
  3. Export the nucleotide alignment in NEXUS format by clicking File > Save. Choose a new filename, and NEXUS as format in the opening dialog. Do not close the alignment window.
  4. Click Props > View as proteins. (Note that these sequences have been edited so that all of them begin with the first base of a triplet codon. Thus, by translating them to proteins, they are in the right translation frame.)
  5. Compare the alignments of group A and B again. Which one is better? What criteria are you using? What do the "X"s and Asterisks in one of the alignments mean?
  6. Click Props > View as proteins. Export the amino acid alignment in NEXUS format by clicking File > Save prot alignment. Choose a new filename, and NEXUS as format in the opening dialog.

exercise 3: comparison of alignments of nucleotide and protein sequences

A only:

  1. Switch back to nucleotide view by unchecking Props > View as proteins.
  2. Click Edit > Select All
  3. Click Align > De-align selection and confirm Remove gaps in the pop-up window to undo the alignment.
  4. Click Props > View as proteins
  5. In order to switch to mafft as default alignment program, click Align > Alignment options > /your/path/to/mafft
  6. Click Align > Align all
  7. Click OK in the pop-up window once the alignment is completed.
  8. Uncheck Props > View as proteins to revert to nucleotide view (otherwise you cannot save the alignment as nucleotides).
  9. Save the alignment file (click File > Save) and close SeaView.
B only:
  1. Sit back and relax, but leave your SeaView window with the results of Exercise 2 open until group A has finished.
Both A and B:
  1. Compare both alignments. Which one do you prefer? Does it make sense to align protein-coding sequences using the protein translation, or should you instead build alignments from nucleotide sequences?

exercise 4: exploring the MAFFT settings

We will now run MAFFT from the command line, where we can change its settings more easily than within SeaView. We will first run MAFFT in interactive mode, and then by passing arguments to the program directly when we execute it.

Both A and B:

  1. Open the data set bacRNApol.fasta in SeaView to see how it looks unaligned.
  2. Close SeaView and open the Terminal.
  3. In the Terminal, navigate to the location where you saved the file bacRNApol.fasta by typing cd, followed by a space and the path to the file.
  4. Start MAFFT on the command line by typing mafft. This will open MAFFT in interactive mode.
  5. When MAFFT asks for an input file, type bacRNApol.fasta.
  6. Type the name of the output file, (e.g., bacRNApol_mft.fasta) and press Enter.
  7. You will be asked for the number of tree rebuilding strategies, where the default value is indicated by FFT-NS-2. Type 3 and press Enter to confirm this default value. This is followed by a number of other parameters/arguments. Confirm default values by hitting Enter. Hit Enter again to start the alignment process.
  8. Once the alignment process has finished, examine the alignment in SeaView and leave it open.
  9. Return to the Terminal window. You should already be located in a folder containing your input file. We will now pass arguments directly to MAFFT, and we are going to change gap penalties.
A only:
  1. Type mafft --auto --op 20 bacRNApol.fasta > bacRNApol_mft20.fasta. This setting will run MAFFT using the auto mode, but incurring a very high penalty for opening gaps inside the alignment (--op 20 vs. default value 1.53).
B only:
  1. Type mafft --auto --op 0.1 bacRNApol.fasta > bacRNApol_mft01.fasta. This setting will run MAFFT using the auto mode, but incurring a very low penalty for opening gaps inside the alignment (--op 0.1 vs. default value 1.53).
Both A and B:
  1. Use SeaView to open the new alignments (File > Open and browse for bacRNApol_mft20.fasta (if you are A) or bacRNApol_mft01.fasta (if you are B) without closing the alignment based on default parameters. Compare this alignment to the default one, and compare the alignments with increased (A) and decreased (B) gap penalties to each other. Which one of the three alignments do you prefer, and why?
  2. Optional: Try to set and combine other gap parameters (ep, lep, lop... see the MAFFT manual for details) and compare results.

exercise 5: regional alignments

Regional alignment refers to locally aligning a particular region of a sequence that is poorly aligned. In SeaView, regional alignment is done through the Sites menu.

Both A and B:

  1. Open SeaView again, if it is not already opened.
  2. Open the aligned nucleotide data set 1ped.fasta again (you will have saved it with a new name).
  3. Revert to protein view by clicking Props > View as proteins.
  4. Click Sites > Create set. A window will appear. Give your new sites set an appropriate region name. You can use the default "all sites" and click OK
  5. A new sequence line of white "X"s will appear at the bottom of your data. This means all sites are currently selected. Shift-click anywhere in the line of white "X"s to deselect all sites.
  6. Now find a region of the alignment that you think is poorly aligned (for example, sites 119-159) and would like to try to improve. Both partners, A and B, should have this part of alignment showing on their screen.
  7. Once you have located a region you would like to improve, A retains the original sequence and B will click on the left-hand, upstream end of the poor region of the white sites sequence, and then drag the cursor to the right part of the poor sequence. A line of "X"s will appear delineating the region, and the region around the area will shade, further indicating regions that you have selected.
  8. Make sure all sequences are selected (Edit > Select all or Command-A). Then execute MAFFT on that region by selecting Align > Align selected sites (not to be confused with Align selected sequences!). A new window will appear asking you to choose a Reference Sequence. The reference sequence refers to the sequence in which the original gaps will be preserved and propagated into the subsequent alignment. For this purpose, select 496117 as the reference.
  9. Compare the original with the new aligned region.
  10. Regional realignments can be applied to multiple poor regions of an alignment. Regional alignment can leave columns of gaps in your alignment. To remove these, click Edit > Delete gap-only sites.