Pre-lab: Stats and Genomic Databases¶

I. Statistics Primer and Review¶

Often in bioinformatics, we need to remind ourselves to think somewhat in statistical or probabilistic terms.

If you need a refresher, head to Canvas and review the slides and recorded lecture "Statistical Inference in Bioinformatics" (recorded by Shane Jensen, statistics). This is a very cursory review, not designed to be comprehensive. Then, answer the following questions.

Q1. Describe, in statistical terms, the concept of a "null hypothesis". How are ways a null hypothesis can be utilzed?

Q2. What is a test statistic, and how it is used?

Q3. If we reject the null hypothesis at alpha = 5% level, what does that mean?

Q4. Explain the issue of multiple hypothesis testing, and the implication. Give two statistical procedures to address this concern, and describe how to employ them.

II. Databases¶

One of the first steps in any computational project is to determine what data already exists that one can utilize to address scientific questions or gather information. Take 10 minutes to web browse and investigate one or more of the databases given below.

List of Genomic Databases

NCBI Entrez - http://www.ncbi.nlm.nih.gov/sites/gquery - huge database that encompasses other databases, including:

PubMed for Journal Articles - http://www.ncbi.nlm.nih.gov/pubmed/
GenBank for Raw Sequence - http://www.ncbi.nlm.nih.gov/genbank/
RefSeq for Non-Redundant Sequence - http://www.ncbi.nlm.nih.gov/RefSeq/
OMIM for Genetic Diseases - http://www.ncbi.nlm.nih.gov/omim?db=omim
dbSNP for Polymorphisms - http://www.ncbi.nlm.nih.gov/snp?db=snp
GEO for Gene Expression Data - http://www.ncbi.nlm.nih.gov/geo/

ExPASy - http://expasy.org/ - Another large database encompassing other databases:

Uniprot for Protein Sequence/Annotation - http://www.uniprot.org/
PROSITE for Protein Sequence Patterns - http://prosite.expasy.org/
ENSEMBL - http://useast.ensembl.org/index.html - An alternative to RefSeq and UniProt
GeneCards - http://www.genecards.org/ - Gene-centered portal to information from many other databases
ENCODE - http://www.genome.gov/10005107 - Encyclopedia of DNA Elements
HapMap - http://hapmap.ncbi.nlm.nih.gov/ - Database of human variation across populations
ExAC - http://exac.broadinstitute.org - database of human coding mutational variation across populations
Gene Ontology (GO) - http://www.geneontology.org/ - Hierarchy of gene annotations
MGED - http://www.mged.org/ - Database of gene expression/microarray results

This list is by no means complete, for more databases see the most recent Database Summary Paper Alpha List: http://www.oxfordjournals.org/nar/database/a/

In particular, the in class activity will focus on the UCSC genome brower

http://genome.ucsc.edu/

a portal that uses a Track-based system to summarize information from many databases of genomic sequence and annotations.

Spend 5 minutes on your own exploring this rich resource.