Table of contents
- expected learning outcome
- exercise 1: basic biological analysis and molecular sequence retrieval from NCBI databases
- exercise 2: beach mice
expected learning outcomes
The objective of this activity is to help you become familiar with some of the resources and tools from the NCBI (National Center for Biotechnology Information) creates and maintains public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information. The NCBI site is constantly being updated and some of the changes include new databases and tools for data mining.
For more information, see:
NCBI Resources
NCBI Tutorial
During this activity you will become familiar with NCBI, and learn how to search molecular databases, limit your searches, and save the results.
At the end of this activity, you should know some basic biological information about the atg5 gene found in eukaryotes and the protein it encodes, including the taxonomic distribution of the gene, the cellular location and function of the protein product and the protein domain architecture. In addition, you will have assembled a small subset of molecular sequence data ready for further analysis.
In many cases there are multiple ways to obtain the answers to the questions below. The tutorial recommends one of many paths available to answer these types of questions.
exercise 1: basic biological analysis and molecular sequence retrieval from NCBI databases
Access the main NCBI page and search all databases (top of the page) for the gene "atg5".
You will see summarized, quantitative results for the number of times the term atg5 occurs in each NCBI database. From here you should be able to answer some very basic questions about the atg5 gene. For example, how many articles have been published containing the term atg5? How many fully sequenced genomes contain the atg5 gene? This type of information, while simple, can be useful for making a quick assessment about the amount of published work and available molecular data available for your gene or protein of interest.
Navigate to the Nucleotide database. From here, you will find more general information about the atg5 gene.
- How many entries for atg5 are available for Homo sapiens? Bos taurus?
From here you will be able to readily access the NCBI Gene database entry for atg5 and answer the following questions. Select the Gene Database entry for Drosophila melanogaster atg5.
- What is the taxonomic order for D. melanogaster?
- How many 3D structures are available in the NCBI database for D. melanogaster? Genome sequences?
- What two genes are immediately upstream from atg5 in the D. melanogaster genome?
- What biological pathway is the atg5 protein involved in?
- What cellular compartment is the atg5 protein located in? Nuclear, membrane, cytoplasm, other?
In the "Genomic regions, transcripts and products" section on the atg5 Gene page, select the protein (labeled NP_572390.1 in red on the graphic), choose FASTA and wait for the FASTA version of the protein sequence to load on a new page.
- What conserved domain does the atg5 protein contain?
- How many eukaryotic proteins have a similar domain architecture?
Access BLAST from the page displaying the FASTA version of the protein sequence for atg5. At the right of the page, access BLAST and perform a BLASTp search with the default settings.
- How many hits were obtained that are found in Bos taurus?
- How many hits for arthropods?
Select the top 8 hits from the BLASTp search and select "Get Selected Sequences".
Change the default "Display" setting to FASTA and the "Send to" to Text. Select all of the resulting text and copy/paste it into a new text file. Name this file whatever you would like and save it to a folder of your choice. This file will be used for the multiple sequence alignment activity.
exercise 2: beach mice
Let us say you are attempting to do a comparative study on the color variation in beach mice (Peromyscus polionotus). These interesting creatures inhabit the Southeastern United States. They differ from their inland relatives with reduced pigmentation on their faces, flanks and tails (Steiner et al. 2009). You are interested in the genes and proteins involved in fur color changes associated with different environments. After the first exercise, you know that you can use NCBI to find information about much of what is available on a specific gene. So now you may want to try some different searches on your own.
- Open the web browser and navigate to PubMed at the NCBI website. Search for the following terms and note how many records you find: coloration, beach mice, coloration mice, mice coloration (does this give different results than the previous searches?)
- Using boolean operators (AND, OR, NOT) through the advanced search link, do a search for studies on mice coloration other than those in beach mice. How many papers are there?
- You seem to remember some work was done on coloration in pocket mice. Now do a search for coloration in beach mice or pocket mice. Try searching multiple ways using boolean operators and parentheses to see how this affects your search results.
- After this search you realize that Hopi Hoekstra's lab has done a lot of research in this area. Using either advanced search, your search history, or search limits look for some of Hopi Hoekstra's recent papers (say in the last 5 years). Next limit your search to first author: Steiner CC. Focus on this specific article: Steiner CC. Römpler H, Boettger LM, Schöneberg T, Hoekstra HE. 2009. The genetic basis of phenotypic convergence in beach mice: similar pigment patterns but different genes. Mol Bio Evol 26(1):35-45.
- Does this article have links to other databases? Which databases?
- What loci did they use? Which one is the major contributor to the coloration pattern?
- Using what you have learned from the last exercise see if you can find this nuclear locus in the house mouse (Mus musculus) genome? To which chromosome does this locus map?
- Returning to Steiner's 2009 study on beach mice from above, download the melanocortin-1 receptor (Mc1r) nuclear DNA sequences in FASTA format and create a data set that you could later align and analyze using your favorite phylogenetic software. Hint: The accession numbers are FJ389418-FJ389441.
Once you have completed this exercise, you have theoretically produced a dataset that you can align in a multiple sequence alignment program and then enter as an infile in your favorite phylogenetic software.