Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Jupyter notebook 01_Databases/Online_database_in_class.ipynb

Views: 90
Kernel: Python 2 (SageMath)

Genomic Databases - In Class Exercises

Imagine that you are a grad student who has just begun work in a mouse lab. Your advisor, Dr. Stoker, works on a novel mouse phenotype, which has been dubbed Vampiric. Mice with this phenotype have the physiological feature of exceptionally sharp teeth and the behavioral feature of biting other mice. Dr. Stoker has developed a genetic mutagenesis screen for this phenotype using sunlight as a negative selector. In one strain of Vampiric mice, the mutation has been narrowed to an approximately 250-kilobase (kb) region of chromosome 12. Your first task is to investigate what is known about candidate genes in this region.

** Part I. Explore the basic functionality of the UCSC Genome Browser **

  1. Go to http://genome.ucsc.edu in your favorite browser. Take a moment to read the “About the USCS Genome Bioinformatics Site” section and browse the “News” updates on the main page.

  2. Click the link at the top left that reads “Genomes” to access the genomes. Your queries should go in the text box under “position” or "search term”. Read through the “Sample Position Queries” further down the page for an explanation of how to search for a particular region.

  3. Choose a different species from the “genome” drop-down list. Note that the entire page changes.

** Part II. Investigate the genes in Dr. Stoker’s candidate region.**

  1. Search for the region on chromosome 12 between bases 56,532,042 and 56,785,902 in the most recent assembly of the Mouse genome as following: choose "Mouse" from the “genome” drop-down list and "mm10" from the "assembly" drop-down list. Type or copy-paste "chr12:56,532,042-56,785,902" in the "search term" cell. Click "submit".

  2. You should be taken to a page with an image displaying tracks of annotation information for this region. Take a moment to explore this page. Controls for shifting the display region or zooming in/out are above the track image. Some tracks can be expanded or compressed by clicking on them. In other cases, individual annotations on tracks (such as genes) can be clicked for more details. Below the track image is a list of the available tracks for the genome assembly that you are viewing. You can control which tracks are shown by changing the drop-down boxes next to each track name and then clicking “refresh” (the button on the page, not on the browser toolbar). You can also guarantee that you are looking at the normal set of tracks by clicking the “default tracks” button. Base your answers to the following questions on information in the available tracks.

To answer the following 3 questions, scroll down to the “Genes and Gene Prediction Tracks” and set RefSeq Genes to “pack”; and GeneScan Genes to “dense”; also scroll down to “mRNA and EST Tracks” and set Mouse mRNAs to “pack”; You may want to hide other tracks to simplify the view.

Q1. What RefSeq Mouse Genes are in this region?

nkx2.1, C87198, nkx2.9, pax9, slc25a21

RefSeq genes are a good place to start your search for candidate genes, however it is possible that the mutation for Vampiric is in an unknown gene. Note that the Mouse mRNA track has many more annotations than the RefSeq track. RefSeq is a manually curated non-redundant gene database, while GenBank is a larger database with potentially redundant experimental data. For each gene in RefSeq there is typically one or more corresponding mRNA in GenBank that aligns to the same position in the genome. However, some mRNAs in GenBank may be unconfirmed as real genes, and therefore do not have a corresponding RefSeq entry. Such mRNAs would be further candidates for your mutation.

Q2. How many GeneBank mRNAs in this region do not correspond to RefSeq genes? What are they? (Hint: Zoom in to see which mRNAs overlap with the smaller genes)

9

You should also consider the possibility that your mutation is in a gene that has never been characterized experimentally. GenScan is a computational tool for predicting genes, which we will discuss in more detail later in this course. For now, just take a look at the annotation track for genes predicted by GenScan. You should always take these predictions with a grain of salt, but they may be useful if you don’t get any interesting results from known genes or mRNAs.

Q3. How many GenScan predicted genes are in this region?

8

** Part III. Get detailed information about one of the candidate genes, Pax9.**

Turn on the UCSC Gene track to “pack” and click on the Pax9 to get more information. Now it’s time to look at what is known about our candidate genes. The RefSeq genes will have the most useful information as they are often based on multiple experiments and have been validated in some way.

Q4. What is the RefSeq Accession Number for Pax9?

NM_011041.3

Q5. What is the genomic size (the number of base pairs in the entire transcribed mRNA, including introns and untranslated regions) of Pax9?

17352

Q6. Is Pax9 on the forward or reverse strand of chromosome 12?

Forward

Explore the links listed under “Links to sequence”. This is one way that you can download the protein, mRNA, or genomic sequence data in fasta format.

Click on the “Genome Browser” link. This will take you back to the track view, but will now be zoomed in on this specific gene. The intron/exon structure of this gene is clearly shown in the RefSeq track. The thickest lines represent exons, the medium lines represent untranslated regions, and the thin lines represent introns, with arrows indicating the direction of transcription.

Q7. How many exons are in Pax9?

4

Return to the Pax9 information page, and click on the link embedded in the gene id under “Entrez Gene”. This brings you to an NCBI page with more detailed information on this gene. The NCBI Entrez Gene database is linked to GenBank, PubMed, and many other useful databases. Scroll down to the section marked “Genomic regions, transcripts, and products”. You should see an image composed with a green colored line for genomic information, purple for mRNA information and red for protein information. If you only see green lines (genes), click on them and the green line will expand to purple line and red line. You can click on “Configure” of the image to find other tracks. Reconfirm the genomic length and the number of exon in Pax9 by mousing over the green line.

Q8. How many nucleotides are there in Pax9 mRNA? If there are multiple isoforms, just pick one. How many amino acid are there in Pax9 protein? Do they follow the 3:1 ratio (as 3 mRNA nucleotides code for 1 amino acid)? If not, do you know which biological process causes the discrepancy?

4437 nucleotides in Pax9 mRNA. 342 amino acids in that particular Pax9 protein isoform. mRNA follows the 3:1 ratio (4437/3 = 1,479), but number is greater than 342AA because not all the nucleotides are protein coding (not factoring in sequences belonging to untranslated regions).

Right click on the green line and go to “View & Tools” option. You will see many options to display the genomic, mRNA, and protein information of Pax9 in different format. Follow the link to FASTA View; you should obtain the same sequence information as found on the Genome Browser site.

Scroll down and look at the other information on this page. Note that the GeneRIF, Gene Ontology, and Interaction sections provide useful information on what is known about the function of this gene. Use the GeneRIF, Gene Ontology, and Interactions sections to answer the following questions about Pax9:

Q9. What evidence (if any) supports Pax9 as a likely mutant related to the Vampiric phenotype?

Functions in the expansion of taste progenitor fields in taste papillae (affinity for blood?), and has a genetic interaction with Msx1 in regulating several stages of tooth morphogenesis (promotes teeth development -- sharp teeth?).

Q10. What other basic biological function (if any) does Pax9 have?

DNA binding and transcriptional regulation

Q11. What genes or proteins (if any) does Pax9 interact with? (Hint: look under heading Interactions)

Msx1, Hoxa1, Tbx21

** Part IV. Get information on Pax9 in other (non-mouse) species. **

Imagine that you are able to confirm that Pax9 is in fact responsible for the Vampiric phenotype in Mouse. Now Dr. Stoker wants to try inducing this phenotype in other organisms. He asks you to find out which species have Pax9 genes in GenBank.

  1. Go to: http://www.ncbi.nlm.nih.gov/Genbank/ Read the introductory information for GenBank. One common way to search GenBank is to submit a sequence as a BLAST search. However, we will cover BLAST and other sequence homology tools later on in the class. In the meantime, we will search the Gene database using keywords.

  2. Go to: http://www.ncbi.nlm.nih.gov/gene Read the “Help” section for extra hints on ways you can search.

  3. Search for “Pax9” in the search bar at the top of the page. You will notice that you get a list of many genes but not all of them are called Pax9. This is because you have searched all fields in the database and are also seeing genes related to Pax9. Narrow your search by clicking “Advanced” search and changing your search field to “Gene Name”. You can also accomplish the same thing by changing your search string to: Pax9[sym], 'sym’ stands for the gene symbol. Species are typically listed on the first line of the item summary in brackets (e.g., [Gallus gallus])

Dr. Stoker suggests trying to induce the Vampiric phenotype in human subjects but you point out that that would be against ethical guidelines. He agrees and suggests instead that you use the species Gallus gallus (Chicken).

Q12. What chromosome is Pax9 on in Chicken?

Chromosome 5

Q13. If you wanted to go back and view this region of the chicken genome in UCSC Genome Browser, what search string would you use? This is a trick question!!!

chr5:36761995-36780257 in the chicken genome annotation (Dec 2015 (Gallus_gallus-5.0/galGal5) build)

** List of Genomic Databases **

NCBI Entrez - http://www.ncbi.nlm.nih.gov/sites/gquery - huge database that encompasses other databases, including:

ExPASy - http://expasy.org/ - Another large database encompassing other databases:

ENSEMBL - http://useast.ensembl.org/index.html - An alternative to RefSeq and UniProt Genome Browser - http://genome.ucsc.edu/ - Track-based portal to databases of genomic sequence and annotations GeneCards - http://www.genecards.org/ - Gene-centered portal to information from many other databases ENCODE - http://www.genome.gov/10005107 - Encyclopedia of DNA Elements HapMap - http://hapmap.ncbi.nlm.nih.gov/ - Database of human variation across populations Gene Ontology (GO) - http://www.geneontology.org/ - Hierarchy of gene annotations MGED - http://www.mged.org/ - Database of gene expression/microarray results

This list is by no means complete, for more databases see the most recent Database Summary Paper Alpha List: http://www.oxfordjournals.org/nar/database/a/

Homework exercise (10 Points)

Your colleague has just finished an extensive karyotyping study across samples from many different types of human cancers. She specifically looked for regions of the genome that have a statistically significant rate of chromosomal aberrations (including inversions, deletions, and translocations). She has asked you to help her analyze her results, starting with a region she identified on chromosome 6, ranging from base pairs 108,510,000 to 109,500,000 using NCBI build 36 (hg18). Use UCSC Genome Browser and/or other public databases to view information about known genes in this region.

Q1. What are the genes in this region? (3 points)

REFSEQ: NR2E1, SNX3, LACE1, FOXO3, LINC00222 (lincRNA), ARMC2, ARMC2-AS1 (antisense transcript), SESN1

Q2. Hypothesize which gene you think is the most likely candidate to be related to human cancers, and provide evidence from at least 3 different public databases. Be sure to include the URL to each database entry on which you base your answer. (7 points) [Hint: There is Phenotype and Disease Associations category on UCSC genome browser.]

SESN1. NCBI Gene database reveals that SESN1 is induced by p53 tumor suppression, and p53 responses tend to be downregulated in many cancers. Thus we can hypothesize that this gene may also be involved in promoting tumor suppression downstream of p53 (https://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=Graphics&list_uids=27244). UCSC browswer suggests that this entire chromosome region is pathogenic using the ClinGen CNVs track under "Phenotype and Disease Associations", and OMIM reveals that SESN1 maps to 6q21, a region associated with deletions in a number of cancers, and SESN1 is induced by p53 or by genotoxic agents in a p53-dependent manner in a colon carcinoma cell line, consistent with our hypothesis (http://omim.org/entry/606103). Using the Gene Expression Omnibus, SESN1 expression is also induced in response to genotoxic stress, such as X-rays or other types of radiation (https://www.ncbi.nlm.nih.gov/gds/?term=SESN1). Mutations in SESN1 that make it non-functional likely promote DNA damage, leading to more chromosomal abberations that are associated with human cancers. There are human variants associated with SESN1 (https://www.ncbi.nlm.nih.gov/snp/?term=SESN1), so it may be worth it to look at the 1000 genomes database or other databases to see if there is an increased rate of mutations in SESN1 associated with cancers.