Regex
Regex¶
regular expression = a sequence of characters that specifies a search pattern
presenter: Juliana Gretz
date: 8.10.2021
character¶
command | meaning |
---|---|
. | any character |
\s | any whitespace character |
\S | any non-whitespace character |
\w | any word character |
\W | any non-word character |
\d | any digit |
\D | any non-digit |
range¶
command | meaning |
---|---|
[abc] | character of: a, b or c |
[^abc] | character except of: a, b or c |
[a-z] | character in the rage of: a-z |
[^a-z] | character not in the rage of: a-z |
[0-9] | character in the rage of: 0-9 |
[^0-9] | character not in the rage of: 0-9 |
[a|b] | a or b |
quantifiers¶
command | meaning |
---|---|
a? | zero or one of a |
a* | zero or more of a |
a+ | one or zero of a |
a{3} | 3 of a |
a{3,} | 3 or more of a |
a{3,6} | between 3 and 6 of a |
boundry¶
command | meaning |
---|---|
^ | start of string |
$ | end of string |
\b | word boundry |
\B | non-word boundry |
group¶
command | meaning |
---|---|
(xxx) | around the expression xxx to create group |
(?P\<name>xxx) | around the expression xxx to name group |
Examples¶
In [1]:
import re
import pandas as pd
In [2]:
datafiles = [
'20F-22003_DC8_D03_1032246_P1_UV-metric psKa.t3r',
'20F-29001_DC8_D03_1032246_P1_W_UV-metric psKa.t3r',
'20F-26012_DC29_F21_1088043_P1_UV-metric psKa.t3r',
'20G-03010_DC60_L03_1366347_P1_UV-metric pKa.t3r',
'20G-04001_DC28_F19_1087258_P1_W_UV-metric psKa.t3r',
'21I-27003_DC245_D13_7018930044_P1_UV-metric psKa.t3r',
'20H-26004_DC124_J03_1216895_P2_UV-metric pKa.t3r',
'20I-03024_DC168_L12_105750012_P2_UV-metric psKa.t3r',
'20J-05001_DC132_K06_1298433_P2_W_UV-metric psKa.t3r',
'21I-14006_DC218_F07_1056516_P3_UV-metric psKa.t3r',
]
just fitting a regular expression¶
"\d{2}\w-\d{5}_DC\d{1,3}_\w\d{2}_\d*_P[1-3]_W*_*UV-metric ps*Ka.t3r"g
https://regex101.com/r/K0vABT/4
In [3]:
df = pd.DataFrame({'filenames':datafiles})
df['match']=[re.search(r"\d{2}\w-\d{5}_DC\d{1,3}_\w\d{2}_\d*_P[1-3]_W*_*UV-metric ps*Ka.t3r",file)[0]
for file in df.filenames]
df
Out[3]:
creating groups¶
r"\d{2}\w-\d{5}_(DC\d{1,3})_(\w\d{2})_(\d{7,10})_(P[1-3])_(W_)*UV-metric (psKa|pKa).t3r"g
https://regex101.com/r/UkpsCd/3
In [4]:
df = pd.DataFrame({'filenames':datafiles})
df['matches']=[re.search(r"\d{2}\w-\d{5}_(DC\d{1,3})_(\w\d{2})_(\d*)_(P[1-3])_(W_)*UV-metric (psKa|pKa).t3r",file)
for file in df.filenames]
number_of_groups = len(df.matches[0].groups())
for i in range(number_of_groups):
df[f'group{i}']=[match.groups()[i] for match in df.matches]
df
Out[4]:
name groups¶
r"\d{2}\w-\d{5}_(?P<code>DC\d{1,3})_(?P<well_adress>\w\d{2})_(?P<id>\d*)_(?P<plate>P[1-3])_(?P<wdh>W_)*UV-metric (?P<solvent>psKa|pKa).t3r"g
https://regex101.com/r/IOqHvt/2
In [5]:
df = pd.DataFrame({'filenames':datafiles})
df['matches']=[re.search(r"\d{2}\w-\d{5}_(?P<code>DC\d{1,3})_(?P<well_adress>\w\d{2})_(?P<id>\d*)_(?P<plate>P[1-3])_"
r"(?P<wdh>W_)*UV-metric (?P<solvent>psKa|pKa).t3r",file)
for file in df.filenames]
for variable in ['code','well_adress','id','plate','solvent','wdh']:
df[variable]=[match.group(variable) for match in df.matches]
df
Out[5]:
In [ ]:
In [ ]:
In [ ]: