Regex
Regex¶
regular expression = a sequence of characters that specifies a search pattern
presenter: Juliana Gretz
date: 8.10.2021
character¶
| command | meaning |
|---|---|
| . | any character |
| \s | any whitespace character |
| \S | any non-whitespace character |
| \w | any word character |
| \W | any non-word character |
| \d | any digit |
| \D | any non-digit |
range¶
| command | meaning |
|---|---|
| [abc] | character of: a, b or c |
| [^abc] | character except of: a, b or c |
| [a-z] | character in the rage of: a-z |
| [^a-z] | character not in the rage of: a-z |
| [0-9] | character in the rage of: 0-9 |
| [^0-9] | character not in the rage of: 0-9 |
| [a|b] | a or b |
quantifiers¶
| command | meaning |
|---|---|
| a? | zero or one of a |
| a* | zero or more of a |
| a+ | one or zero of a |
| a{3} | 3 of a |
| a{3,} | 3 or more of a |
| a{3,6} | between 3 and 6 of a |
boundry¶
| command | meaning |
|---|---|
| ^ | start of string |
| $ | end of string |
| \b | word boundry |
| \B | non-word boundry |
group¶
| command | meaning |
|---|---|
| (xxx) | around the expression xxx to create group |
| (?P\<name>xxx) | around the expression xxx to name group |
Examples¶
In [1]:
import re
import pandas as pd
In [2]:
datafiles = [
'20F-22003_DC8_D03_1032246_P1_UV-metric psKa.t3r',
'20F-29001_DC8_D03_1032246_P1_W_UV-metric psKa.t3r',
'20F-26012_DC29_F21_1088043_P1_UV-metric psKa.t3r',
'20G-03010_DC60_L03_1366347_P1_UV-metric pKa.t3r',
'20G-04001_DC28_F19_1087258_P1_W_UV-metric psKa.t3r',
'21I-27003_DC245_D13_7018930044_P1_UV-metric psKa.t3r',
'20H-26004_DC124_J03_1216895_P2_UV-metric pKa.t3r',
'20I-03024_DC168_L12_105750012_P2_UV-metric psKa.t3r',
'20J-05001_DC132_K06_1298433_P2_W_UV-metric psKa.t3r',
'21I-14006_DC218_F07_1056516_P3_UV-metric psKa.t3r',
]
just fitting a regular expression¶
"\d{2}\w-\d{5}_DC\d{1,3}_\w\d{2}_\d*_P[1-3]_W*_*UV-metric ps*Ka.t3r"g
https://regex101.com/r/K0vABT/4
In [3]:
df = pd.DataFrame({'filenames':datafiles})
df['match']=[re.search(r"\d{2}\w-\d{5}_DC\d{1,3}_\w\d{2}_\d*_P[1-3]_W*_*UV-metric ps*Ka.t3r",file)[0]
for file in df.filenames]
df
Out[3]:
creating groups¶
r"\d{2}\w-\d{5}_(DC\d{1,3})_(\w\d{2})_(\d{7,10})_(P[1-3])_(W_)*UV-metric (psKa|pKa).t3r"g
https://regex101.com/r/UkpsCd/3
In [4]:
df = pd.DataFrame({'filenames':datafiles})
df['matches']=[re.search(r"\d{2}\w-\d{5}_(DC\d{1,3})_(\w\d{2})_(\d*)_(P[1-3])_(W_)*UV-metric (psKa|pKa).t3r",file)
for file in df.filenames]
number_of_groups = len(df.matches[0].groups())
for i in range(number_of_groups):
df[f'group{i}']=[match.groups()[i] for match in df.matches]
df
Out[4]:
name groups¶
r"\d{2}\w-\d{5}_(?P<code>DC\d{1,3})_(?P<well_adress>\w\d{2})_(?P<id>\d*)_(?P<plate>P[1-3])_(?P<wdh>W_)*UV-metric (?P<solvent>psKa|pKa).t3r"g
https://regex101.com/r/IOqHvt/2
In [5]:
df = pd.DataFrame({'filenames':datafiles})
df['matches']=[re.search(r"\d{2}\w-\d{5}_(?P<code>DC\d{1,3})_(?P<well_adress>\w\d{2})_(?P<id>\d*)_(?P<plate>P[1-3])_"
r"(?P<wdh>W_)*UV-metric (?P<solvent>psKa|pKa).t3r",file)
for file in df.filenames]
for variable in ['code','well_adress','id','plate','solvent','wdh']:
df[variable]=[match.group(variable) for match in df.matches]
df
Out[5]:
In [ ]:
In [ ]:
In [ ]: