return two values regex python - python

I have a data frame where the values I want are in the same cell like this - words and all:
depth: 3230 m - 3750 m
I'm trying to write a regex to return the first number and then the second into a new data frame.
so far, I can get the values with this:
top_depthdf=df[0].str.extract(r'depth:\s(\d+(?:\.\d+)?)', flags=re.I).astype(float)
base_depthdf=df[0].str.extract(r'-\s(\d+(?:\.\d+)?)', flags=re.I).astype(float)
where I am having an issue is that these patterns are not unique in this data, especially the base depth one. Other numbers have a similar pattern and my script is returning them instead of the base depth if they occur before the depth row. I was wondering if there is a way to write the base_depthdf in such a way that it looks for the 'depth:' part first and then looks for that pattern?

You can capture these numbers with two named capturing groups into two columns at once:
df_depth = df[0].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?)(?:\s*\w+)?\s*-\s*(?P<base_depth>\d+(?:\.\d+)?)')
See the regex demo. The (?P<top_depth>...) and (?P<base_depth>...) capture the details into separate columns.
I used (?:\s*\w+)?\s* to match a single optional word between the two patterns, but you may just use .*? if you are not sure what can appear between the two:
df_depth = df[0].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?).*?-\s*(?P<base_depth>\d+(?:\.\d+)?)')
Pandas test:
df = pd.DataFrame({'c':['depth: 3230 m - 3750 m']})
df_depth = df['c'].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?)(?:\s*\w+)?\s*-\s*(?P<base_depth>\d+(?:\.\d+)?)')
print(df_depth.to_string())
Output:
top_depth base_depth
0 3230 3750

Related

python remove variable substring patterns from a long string in a python dataframe column

I have a column in my dataframe, containing very large strings.
here is a short sample of the string
FixedChar{3bf3423 Data to keep}, FixedChar{5e0d20 Data to keep}, FixedChar{6cb86d9 Data to keep}, ...
I need to remove the recurring static "FixedChar{" and the variable substring after it that has static length of 6 and also "}"
and just keep the "Data to keep" strings that have variable lengths.
what is the best way to remove this recurring variable pattern?
It was easier than I thought.
At first I started to use re.sub() from re library.
regex \w* removes all the word characters (letters and numbers) after the "FixedChar" and the argument flags = re.I makes it case insensitive.
import re
re.sub(r"FixedChar{\w*","",dataFrame.Column[row],flags = re.I)
but I found str.replace() more useful and replaced the values in my dataFrame using loc, as I needed to filter my dataframe cause this pattern shows up only in specific rows.
dataFrame.loc['Column'] = dataFrame.Column.str.replace("FixedChar{\w* ",'',regex=True)
dataFrame.loc['Column'] = dataFrame.Column.str.replace("}",'',regex=True)

Filtering keywords/sentences in a dataframe pandas

Currently I have a dataframe. Here is an example of my dataframe:
I also have a list of keywords/ sentences. I want to match it to the column 'Content' and see if any of the keywords or sentences match.
Here is what I've done
# instructions_list is just the list of keywords and key sentences
instructions_list = instructions['Key words & sentence search'].tolist()
pattern = '|'.join(instructions_list)
bureau_de_sante[bureau_de_sante['Content'].str.contains(pattern, regex = True)]
While it is giving me the results, it is also giving me this UserWarning : UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
return func(self, *args, **kwargs).
Questions:
How can I prevent the userwarning from showing up?
After finding and see if a match is in the column, how can I print the specific match in a new column?
You are supplying a regex to search the dataframe. If you have parenthesis in your instruction list (like it is the case in your example), then that constitutes a match group. In order to avoid this, you have to escape them (i.e.: add \ in front of them, so that (Critical risk) becomes \(Critical risk\)). You will also probably want to escape all special characters like \ . " ' etc.
Now, you can use these groups to extract the match from your data. Here is an example:
df = pd.DataFrame(["Hello World", "Foo Bar Baz", "Goodbye"], columns=["text"])
pattern = "(World|Bar)"
print(df.str.extract(pattern))
# 0
# 0 World
# 1 Bar
# 2 NaN
You can add this column in your dataframe with a simple assignment (eg df["result"] = df.str.extract(pattern))

Selecting patterns in character sequence using regex

I would need to select all the accounts were 3 (or more) consecutive characters are identical and/or include also digits in the name, for example
Account
aaa12
43qas
42134dfsdd
did
Output
Account
aaa12
43qas
42134dfsdd
I am considering of using regex for this: [a-zA-Z]{3,} , but I am not sure of the approach. Also, this does not include the and/or condition on the digits. I would be interested in both for selecting accounts with at least one of these:
repeated identical characters,
numbers in the name.
Give this a try
n = 3 #for 3 chars repeating
pat = f'([a-zA-Z])\\1{{{n-1}}}|(\\d)+' #need `{{` to pass a literal `{`
df_final = df[df.Account.str.findall(pat).astype(bool)]
Out[101]:
Account
0 aaa12
1 43qas
2 42134dfsdd
Can you try :
x = re.search([a-zA-Z]{3}|\d, string)

Python: zfill with rsub padding zeroes

I wrote a script to standardize a bunch of values pulled from a data bank using (mostly) r.sub. I am having a hard time incorporating zfill to pad the numerical values at 5 digits.
Input
FOO5864BAR654FOOBAR
Desired Output
FOO_05864-BAR-00654_FOOBAR
Using re.sub I have so far
FOO_5864-BAR-654_FOOBAR
One option was to do re.sub w/ capturing groups for each possible format [i.e. below], which works, but I don't think that's the correct way to do it.
(\d) sub 0000\1
(\d\d) sub 000\1
(\d\d\d) sub 00\1
(\d\d\d\d) sub 0\1
Assuming your inputs are all of the form letters-numbers-letters-numbers-letters (one or more of each), you just need to zero-fill the second and fourth groups from the match:
import re
s = 'FOO5864BAR654FOOBAR'
pattern = r'(\D+)(\d+)(\D+)(\d+)(\D+)'
m = re.match(pattern, s)
out = '{}_{:0>5}-{}-{:0>5}_{}'.format(*m.groups())
print(out) # -> FOO_05864-BAR-00654_FOOBAR
You could also do this with str.zfill(5), but the str.format method is just much cleaner.

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Categories

Resources