Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 1571
Kernel: Python 2 (SageMath)

Regex – Introduction

  • Keep it simple: Operators

  • Example:

    • Find EcoRI restriction enzyme site in sequence:

string = "TGCATAGCGAATTCGGACGT" "GAATTC" in string
True
  • Find Eco13kl restriction site in sequence

  • CCNGG --> CCAGG, CCCGG, CCGGG or CCTGG

string = "CCTGGAGCCCAGGGGACGT" "CCNGG " in string
False
string = "CCTGGAGCCCAGGGGACGT" "CCAGG" in string
"CCTGG" in string
"CCCGG" in string
"CCGGG" in string

Regex – re.findall – exercise

  • Use re.findall to find:

EcoRI_site = "GAATTC"

  • sequence = "TGCATAGCGAATTCGAGCGT"

AG_nucl = "AG"

  • sequence = "TGCATAGCGAATTCGAGCGT"

Eco13kl_site = "CCNGG"

  • sequence = "CCTGGAGCCCAGGGAGCGT"

# Type here your code

Regex – metacharacters

Metacharacters are characters that represent one or multiple characters you want to search for in a string.

Some examples of metacharacters:

  • ^ Matches beginning of line

  • $ Matches end of line

  • . Matches any single character except newline

  • [...] Matches any single character in brackets

  • [^...] Matches any single character not in brackets

  • a | b Matches either a or b

  • Now repeat the Eco13kl.site question using [...]

# Type here your code

Regex – exercise

Explore the regex listed using the script below, try to find out what the difference is and why:

  1. CC vs ^CC

  2. G*G vs G.*G

  3. GT$ vs GT

  4. [AC] vs [^AC]

  5. GAG|GAC vs CAG|GAG

  6. TGA|TGG vs TG[AG]

  7. CC* vs CC+

  8. CC{1,2} vs CC {1,}

  9. \w\w\w vs \w\w\s

  10. \d\d\S vs \d\d\D

line = "CCTGGAG123CCCCAGGTGACGT\nTGT" find_output = re.findall("REGEX",line) print find_output

Regex - Raw string notation

>>> find_output = re.findall("\\\\",line)

This option searches for (escapes!) (\)\ (\)\ --> \\

>>> find_output = re.findall(r"\\",line)

This option searches for \\ --> \\

line = "this\nis\na\\ntest" find_output = re.findall("\\\\",line) print find_output
line = "this\nis\na\\ntest" find_output = re.findall(r"\\",line) print find_output

Regex – other “problems” with strings

Execute the regex below, what does it find?

line = "CctGGAGccCAggGGacGT" find_output = re.findall("CC[ACTG]GG",line) print find_output

Regex – FLAGS – exercise

Now try to use the Ignore case flag, what does it find now?

Remember that you can also always use string.upper() or .lower()

# Type here your code

Regex – FLAGS – exercise – re.S, re.M

  • Apply re.S on the example below

line = "CCTGGAGCCC\nAGGGGACGT" find_output = re.findall("CC.AGG",line) print find_output

Regex – FLAGS – exercise – re.S, re.M

  • Apply re.M on the example below, and after that combine both the re.S and re.M flag on this example.

line = "a\nmultiline test\nto\ntest the multi\nline flag" find_output = re.findall("^test.*",line) print find_output

Regex – re.sub – exercise

  • In the example below correct the sentence using re.sub

line = "the hedgehog is teh most dangerous animal in teh world" # Type here your code

Regex – re.sub – exercise

  • In the example below replace the two “colors” by red using re.sub and regex

line = "My computer should be grey and my car should also be gray" # Type here your code

Regex – re.split – exercise

  • Split the line below on the numbers and/including the spaces around them

  • What happened to the spaces and the numbers within the output?

line = "You 1 should pay attention 2 will pay attention 3 and otherwise you will fail" # Type here your code

Regex – re.split – groups – exercise

  • In the previous exercise you could split the line, however the number and spaces itself were "lost".

  • To keep the split parts of the string we can use groups

Exercise:

  • Split the line again only now use ”(\s*\d\s*)" what happens?

  • And what happens if you use "(\s*)(\d)(\s*)"

line = "You 1 should pay attention 2 will pay attention 3 and otherwise you will fail" # Type here your code
line = "You 1 should pay attention 2 will pay attention 3 and otherwise you will fail" # Type here your code

Regex – re.sub – groups – exercise

  • These groups are very handy for also substitutions

  • See what happens when you use grouping on the line below:

  • \g< 1 > stands for group 1 = the first group between ()

  • \g< 2 > stands for group 2, etc..

line = "My computer should be grey and my car should also be gray" find_output = re.sub("(gr[ae]y)", "\g<1>blue", line) print find_output

Regex – re.sub – groups – exercise

  • Try to understand what happens in the example below

line = "My computer should be grey and my car should also be gray" find_output = re.sub("(gr[ae]y)(\D*)(gr[ae]y)", "\g<1>blue\g<2>not\g<3>black", line) print find_output

Regex – re.search

  • example:

line = "TGCATAGCGAATTCGAGCGT" match_output = re.search("GAATTC",line) print match_output
if match_output: print "GAATTC site found!" print match_output.group() print match_output.start() print match_output.end() print match_output.span()

Regex – final exercise

  • We are going to digest the DNA sequence below with two restriction enzymes

    • BamH1 G|GATCC

    • AccI GT|MKAC (M=A/C, K=G/T)

  • It is forbidden to use str.split(), str.lower(), str.upper()!

Q1: How many times is each restriction enzyme found?

Q2: After digestion, how many DNA fragments are there and what is the length of each product (provide a list)?

Challenge:

Try to answer the questions in as few lines as possible: use groups and nesting

dna = "CGTGACCTTGGACCTCACTCACCATGTAGTACTCCTCTGAGAGGAATTGTACTAGAGGTGAAAACCGATAAGAAATCACAGTCTGATATGCGTGTGTGTCGACATGCATAATGTATACCCCTTACTGAGTCGTATGGGAATATCCGGCATGACGGGAGAAGCCGTAGACCAAAGGTGTGAGTGAGCATCGTTGTGAACAGTCTGGGTAAACGCGCATATGTAATGTAGTGGATCCTGACACACTCTGGACAAGGGCTCTCTGGGGAACTTGATTTTACTAATGGACTCCAAGAAGCGACGCGCACTCGGTTATGGCGCGCACACTAAAGCGAGGGATCCTAAAAGCTCATGAAGAGGTTCGATCGCTGACTAGTATGGTTATACCCGACACCGCACTGTCGCGTAGACCGCTCCTAGGATTAAATGATCACCCGCACATTGATGCGCGCGTTGCGGGTGAAAGTAGTGAACCCAAGAGTACTTGCCCGTCCGTGGCTCTAGCGTGCATACGTTACATTTTGACGCCTAAAGGTGTCTTGTCAGAGCACGTCCGGGCACAGTAGCAGATACCGGATATCTCATACGTCCGGAGCAGCGCGCGTACTCAAAGTGTGCCCAAGCTCGCATCCGAATTCGGATCCTGCCTTGCTCCCCTACACAAACTATCACGAATAAGCGCATATAAAGCGTCCACCACCTGTAACTTTACTGACCAAAGCATGTCGAGGCGATTAAAGTGGCCGTATGGACATCACAGCCCGTGCCCGACCATTATTAGCGCCGCTACTTCTCCGCGCGCATGTTGACGCTTCTGATGTAGGGTGTGCGGGTCCCAATTGATATATTTATTCGGAGTTACAAAACTGGTACAGAGGCTGTCCGTGCTCTA"
# Type here your code