Pre-lab: Stats and Genomic Databases

I. Statistics Primer and Review

Often in bioinformatics, we need to remind ourselves to think somewhat in statistical or probabilistic terms.

If you need a refresher, head to Canvas and review the slides and recorded lecture "Statistical Inference in Bioinformatics" (recorded by Shane Jensen, statistics). This is a very cursory review, not designed to be comprehensive. Then, answer the following questions.

Q1. Describe, in statistical terms, the concept of a "null hypothesis". How are ways a null hypothesis can be utilzed?

Null hypothesis is a default hypothesis that there is no change or no relationship between two testing samples. We could utilize null hypothesis by accepting or rejecting null hypothesis based on test statistics and p-value to detetermine whether there is relationship between two samples.

Q2. What is a test statistic, and how it is used?

It measures the difference between sample data and null hypothesis if null hypothesis is true. If possibility of test statistic is smaller than p-value, we could reject null hypothesis; if it is smaller, we accept null hypothesis.

Q3. If we reject the null hypothesis at alpha = 5% level, what does that mean?

We accept alternative hypothesis. And there is 5% of possibility that we could be wrong and null hypothesis is correct.

Q4. Explain the issue of multiple hypothesis testing, and the implication. Give two statistical procedures to address this concern, and describe how to employ them.

If we set alpha = 5% for single test in multiple hypothesis testing, the possibility we will be wrong will be way higher than 5% since we have mutiple testing. That means for mutiple hypothesis testing we will get lots of false positive if we use normal statistical procedures. We could utilize Bonferroni and False discovery rate(FDR) to address this problem. For Bonferroni, if testing m hypothesis, set p-value as alpha/n and the accumulative possibility that we will be wrong is still alpha. For FDR, we control the rate that are false and it is less stringent than Bonferroni.

II. Databases

One of the first steps in any computational project is to determine what data already exists that one can utilize to address scientific questions or gather information. Take 10 minutes to web browse and investigate one or more of the databases given below.

** List of Genomic Databases **

NCBI Entrez - http://www.ncbi.nlm.nih.gov/sites/gquery - huge database that encompasses other databases, including:

PubMed for Journal Articles - http://www.ncbi.nlm.nih.gov/pubmed/
GenBank for Raw Sequence - http://www.ncbi.nlm.nih.gov/genbank/
RefSeq for Non-Redundant Sequence - http://www.ncbi.nlm.nih.gov/RefSeq/
OMIM for Genetic Diseases - http://www.ncbi.nlm.nih.gov/omim?db=omim
dbSNP for Polymorphisms - http://www.ncbi.nlm.nih.gov/snp?db=snp
GEO for Gene Expression Data - http://www.ncbi.nlm.nih.gov/geo/

ExPASy - http://expasy.org/ - Another large database encompassing other databases:

Uniprot for Protein Sequence/Annotation - http://www.uniprot.org/
PROSITE for Protein Sequence Patterns - http://prosite.expasy.org/
ENSEMBL - http://useast.ensembl.org/index.html - An alternative to RefSeq and UniProt
GeneCards - http://www.genecards.org/ - Gene-centered portal to information from many other databases
ENCODE - http://www.genome.gov/10005107 - Encyclopedia of DNA Elements
HapMap - http://hapmap.ncbi.nlm.nih.gov/ - Database of human variation across populations
ExAC - http://exac.broadinstitute.org - database of human coding mutational variation across populations
Gene Ontology (GO) - http://www.geneontology.org/ - Hierarchy of gene annotations
MGED - http://www.mged.org/ - Database of gene expression/microarray results

This list is by no means complete, for more databases see the most recent Database Summary Paper Alpha List: http://www.oxfordjournals.org/nar/database/a/

In particular, the in class activity will focus on the UCSC genome brower

http://genome.ucsc.edu/

a portal that uses a Track-based system to summarize information from many databases of genomic sequence and annotations.

Spend 5 minutes on your own exploring this rich resource.