# Genomic Databases - In Class Exercises

Imagine that you are a grad student who has just begun work in a mouse lab. Your advisor, Dr. Stoker, works on a novel mouse phenotype, which has been dubbed Vampiric. Mice with this phenotype have the physiological feature of exceptionally sharp teeth and the behavioral feature of biting other mice. Dr. Stoker has developed a genetic mutagenesis screen for this phenotype using sunlight as a negative selector. In one strain of Vampiric mice, the mutation has been narrowed to an approximately 250-kilobase (kb) region of chromosome 12. Your first task is to investigate what is known about candidate genes in this region.

** Part I.	Explore the basic functionality of the UCSC Genome Browser **

1. Go to http://genome.ucsc.edu in your favorite browser. Take a moment to read the “About the USCS Genome Bioinformatics Site” section and browse the “News” updates on the main page.
2. Click the link at the top left that reads “Genomes” to access the genomes.
Your queries should go in the text box under “position” or "search term”. Read through the “Sample Position Queries” further down the page for an explanation of how to search for a particular region.
3. Choose a different species from the “genome” drop-down list.
Note that the entire page changes.

** Part II.	Investigate the genes in Dr. Stoker’s candidate region.**

1.	Search for the region on chromosome 12 between bases 56,532,042 and 56,785,902 in the most recent assembly of the Mouse genome as following: choose "Mouse" from the “genome” drop-down list and "mm10" from the "assembly" drop-down list. Type or copy-paste "chr12:56,532,042-56,785,902" in the "search term" cell. Click "submit".
2. You should be taken to a page with an image displaying tracks of annotation information for this region. Take a moment to explore this page. Controls for shifting the display region or zooming in/out are above the track image. Some tracks can be expanded or compressed by clicking on them. In other cases, individual annotations on tracks (such as genes) can be clicked for more details. Below the track image is a list of the available tracks for the genome assembly that you are viewing. You can control which tracks are shown by changing the drop-down boxes next to each track name and then clicking “refresh” (the button on the page, not on the browser toolbar). You can also guarantee that you are looking at the normal set of tracks by clicking the “default tracks” button. Base your answers to the following questions on information in the available tracks.


To answer the following 3 questions, scroll down to the “Genes and Gene Prediction Tracks” and set RefSeq Genes to “pack”; and GeneScan Genes to “dense”; also scroll down to “mRNA and EST Tracks” and set Mouse mRNAs to “pack”; You may want to hide other tracks to simplify the view.

Q1.	What RefSeq Mouse Genes are in this region?

RefSeq genes are a good place to start your search for candidate genes, however it is possible that the mutation for Vampiric is in an unknown gene. Note that the Mouse mRNA track has many more annotations than the RefSeq track. RefSeq is a manually curated non-redundant gene database, while GenBank is a larger database with potentially redundant experimental data. For each gene in RefSeq there is typically one or more corresponding mRNA in GenBank that aligns to the same position in the genome. However, some mRNAs in GenBank may be unconfirmed as real genes, and therefore do not have a corresponding RefSeq entry. Such mRNAs would be further candidates for your mutation.

Q2.	How many GeneBank mRNAs in this region do not correspond to RefSeq genes? What are they? (Hint: Zoom in to see which mRNAs overlap with the smaller genes)

You should also consider the possibility that your mutation is in a gene that has never been characterized experimentally. GenScan is a computational tool for predicting genes, which we will discuss in more detail later in this course. For now, just take a look at the annotation track for genes predicted by GenScan. You should always take these predictions with a grain of salt, but they may be useful if you don’t get any interesting results from known genes or mRNAs.

Q3.	How many GenScan predicted genes are in this region?

** Part III.	Get detailed information about one of the candidate genes, Pax9.**

Turn on the UCSC Gene track to “pack” and click on the Pax9 to get more information.
Now it’s time to look at what is known about our candidate genes. The RefSeq genes will have the most useful information as they are often based on multiple experiments and have been validated in some way.

Q4.	What is the RefSeq Accession Number for Pax9?

Q5.	What is the genomic size (the number of base pairs in the entire transcribed mRNA, including introns and untranslated regions) of Pax9?

Q6.	 Is Pax9 on the forward or reverse strand of chromosome 12?

Explore the links listed under “Links to sequence”. This is one way that you can download the protein, mRNA, or genomic sequence data in fasta format.

Click on the “Genome Browser” link.
This will take you back to the track view, but will now be zoomed in on this specific gene. The intron/exon structure of this gene is clearly shown in the RefSeq track. The thickest lines represent exons, the medium lines represent untranslated regions, and the thin lines represent introns, with arrows indicating the direction of transcription.

Q7.	How many exons are in Pax9?

Return to the Pax9 information page, and click on the link embedded in the gene id under “Entrez Gene”.
This brings you to an NCBI page with more detailed information on this gene. The NCBI Entrez Gene database is linked to GenBank, PubMed, and many other useful databases. Scroll down to the section marked “Genomic regions, transcripts, and products”. You should see an image composed with a green colored line for genomic information, purple for mRNA information and red for protein information. If you only see green lines (genes), click on them and the green line will expand to purple line and red line. You can click on “Configure” of the image to find other tracks. Reconfirm the genomic length and the number of exon in Pax9 by mousing over the green line.

Q8.	How many nucleotides are there in Pax9 mRNA? If there are multiple isoforms, just pick one. How many amino acid are there in Pax9 protein? Do they follow the 3:1 ratio (as 3 mRNA nucleotides code for 1 amino acid)? If not, do you know which biological process causes the discrepancy?

Right click on the green line and go to “View & Tools” option. You will see many options to display the genomic, mRNA, and protein information of Pax9 in different format. Follow the link to FASTA View; you should obtain the same sequence information as found on the Genome Browser site.

Scroll down and look at the other information on this page.
Note that the GeneRIF, Gene Ontology, and Interaction sections provide useful information on what is known about the function of this gene. Use the GeneRIF, Gene Ontology, and Interactions sections to answer the following questions about Pax9:

Q9.	What evidence (if any) supports Pax9 as a likely mutant related to the Vampiric phenotype?

Q10.	What other basic biological function (if any) does Pax9 have?

Q11.	What genes or proteins (if any) does Pax9 interact with? (Hint: look under heading Interactions)

** Part IV.	Get information on Pax9 in other (non-mouse) species. **

Imagine that you are able to confirm that Pax9 is in fact responsible for the Vampiric phenotype in Mouse. Now Dr. Stoker wants to try inducing this phenotype in other organisms. He asks you to find out which species have Pax9 genes in GenBank.
1.	Go to: http://www.ncbi.nlm.nih.gov/Genbank/
Read the introductory information for GenBank. One common way to search GenBank is to submit a sequence as a BLAST search. However, we will cover BLAST and other sequence homology tools later on in the class. In the meantime, we will search the Gene database using keywords.
2.	Go to: http://www.ncbi.nlm.nih.gov/gene
Read the “Help” section for extra hints on ways you can search.
3.	Search for “Pax9” in the search bar at the top of the page.
You will notice that you get a list of many genes but not all of them are called Pax9. This is because you have searched all fields in the database and are also seeing genes related to Pax9. Narrow your search by clicking “Advanced” search and changing your search field to “Gene Name”. You can also accomplish the same thing by changing your search string to: Pax9[sym], 'sym’ stands for the gene symbol. Species are typically listed on the first line of the item summary in brackets (e.g., [Gallus gallus])

Dr. Stoker suggests trying to induce the Vampiric phenotype in human subjects but you point out that that would be against ethical guidelines. He agrees and suggests instead that you use the species Gallus gallus (Chicken).

Q12.	What chromosome is Pax9 on in Chicken? 

Q13.	If you wanted to go back and view this region of the chicken genome in UCSC Genome Browser, what search string would you use? This is a trick question!!!

** List of Genomic Databases **

NCBI Entrez - http://www.ncbi.nlm.nih.gov/sites/gquery - huge database that encompasses other databases, including:
- PubMed for Journal Articles - http://www.ncbi.nlm.nih.gov/pubmed/
- GenBank for Raw Sequence - http://www.ncbi.nlm.nih.gov/genbank/
- RefSeq for Non-Redundant Sequence - http://www.ncbi.nlm.nih.gov/RefSeq/
- OMIM for Genetic Diseases - http://www.ncbi.nlm.nih.gov/omim?db=omim
- dbSNP for Polymorphisms - http://www.ncbi.nlm.nih.gov/snp?db=snp
- GEO for Gene Expression Data - http://www.ncbi.nlm.nih.gov/geo/

ExPASy - http://expasy.org/ - Another large database encompassing other databases:
- Uniprot for Protein Sequence/Annotation - http://www.uniprot.org/
- PROSITE for Protein Sequence Patterns - http://prosite.expasy.org/

ENSEMBL - http://useast.ensembl.org/index.html - An alternative to RefSeq and UniProt
Genome Browser - http://genome.ucsc.edu/ - Track-based portal to databases of genomic sequence and annotations
GeneCards - http://www.genecards.org/ - Gene-centered portal to information from many other databases
ENCODE - http://www.genome.gov/10005107 - Encyclopedia of DNA Elements
HapMap - http://hapmap.ncbi.nlm.nih.gov/ - Database of human variation across populations
Gene Ontology (GO) - http://www.geneontology.org/ - Hierarchy of gene annotations
MGED - http://www.mged.org/ - Database of gene expression/microarray results 

This list is by no means complete, for more databases see the most recent Database Summary Paper Alpha List: http://www.oxfordjournals.org/nar/database/a/

# Homework exercise (**10 Points**)

Your colleague has just finished an extensive karyotyping study across samples from many different types of human cancers. She specifically looked for regions of the genome that have a statistically significant rate of chromosomal aberrations (including inversions, deletions, and translocations). She has asked you to help her analyze her results, starting with a region she identified on chromosome 6, ranging from base pairs 108,510,000 to 109,500,000 using NCBI build 36 (hg18). Use UCSC Genome Browser and/or other public databases to view information about known genes in this region.

Q1. What are the genes in this region?  (3 points) 

Q2. Hypothesize which gene you think is the most likely candidate to be related to human cancers, and provide evidence from at least 3 different public databases. Be sure to include the URL to each database entry on which you base your answer. (7 points) 
[Hint: There is Phenotype and Disease Associations category on UCSC genome browser.]
