Regex – Introduction
Keep it simple: Operators
Example:
Find EcoRI restriction enzyme site in sequence:
Find Eco13kl restriction site in sequence
CCNGG --> CCAGG, CCCGG, CCGGG or CCTGG
Regex – re.findall – exercise
Use re.findall to find:
EcoRI_site = "GAATTC"
sequence = "TGCATAGCGAATTCGAGCGT"
AG_nucl = "AG"
sequence = "TGCATAGCGAATTCGAGCGT"
Eco13kl_site = "CCNGG"
sequence = "CCTGGAGCCCAGGGAGCGT"
Regex – metacharacters
Metacharacters are characters that represent one or multiple characters you want to search for in a string.
Some examples of metacharacters:
^ Matches beginning of line
$ Matches end of line
. Matches any single character except newline
[...] Matches any single character in brackets
[^...] Matches any single character not in brackets
a | b Matches either a or b
Now repeat the Eco13kl.site question using [...]
Regex – exercise
Explore the regex listed using the script below, try to find out what the difference is and why:
CC vs ^CC
G*G vs G.*G
GT$ vs GT
[AC] vs [^AC]
GAG|GAC vs CAG|GAG
TGA|TGG vs TG[AG]
CC* vs CC+
CC{1,2} vs CC {1,}
\w\w\w vs \w\w\s
\d\d\S vs \d\d\D
Regex - Raw string notation
>>> find_output = re.findall("\\\\",line)
This option searches for (escapes!) (\)\ (\)\ --> \\
>>> find_output = re.findall(r"\\",line)
This option searches for \\ --> \\
Regex – other “problems” with strings
Execute the regex below, what does it find?
Regex – FLAGS – exercise
Now try to use the Ignore case flag, what does it find now?
Remember that you can also always use string.upper() or .lower()
Regex – FLAGS – exercise – re.S, re.M
Apply re.S on the example below
Regex – FLAGS – exercise – re.S, re.M
Apply re.M on the example below, and after that combine both the re.S and re.M flag on this example.
Regex – re.sub – exercise
In the example below correct the sentence using re.sub
Regex – re.sub – exercise
In the example below replace the two “colors” by red using re.sub and regex
Regex – re.split – exercise
Split the line below on the numbers and/including the spaces around them
What happened to the spaces and the numbers within the output?
Regex – re.split – groups – exercise
In the previous exercise you could split the line, however the number and spaces itself were "lost".
To keep the split parts of the string we can use groups
Exercise:
Split the line again only now use ”(\s*\d\s*)" what happens?
And what happens if you use "(\s*)(\d)(\s*)"
Regex – re.sub – groups – exercise
These groups are very handy for also substitutions
See what happens when you use grouping on the line below:
\g< 1 > stands for group 1 = the first group between ()
\g< 2 > stands for group 2, etc..
Regex – re.sub – groups – exercise
Try to understand what happens in the example below
Regex – re.search
example:
Regex – final exercise
We are going to digest the DNA sequence below with two restriction enzymes
BamH1 G|GATCC
AccI GT|MKAC (M=A/C, K=G/T)
It is forbidden to use str.split(), str.lower(), str.upper()!
Q1: How many times is each restriction enzyme found?
Q2: After digestion, how many DNA fragments are there and what is the length of each product (provide a list)?
Challenge:
Try to answer the questions in as few lines as possible: use groups and nesting