Pandas: extractall - writing a capture group with or condition [duplicate]

Pandas: extractall - writing a capture group with or condition [duplicate] - python

I have columns in my dataframe (~2 milion rows) that look like this:
column
1/20/1"ADAF"
1/4/551BSSS
1/2/1AAAA
1/565/1 "AAA="
And I want to extract only:
1/20/1
1/4/551
1/2/1
1/565/1
I have tried with:
df['wanted_column'] = df['column'].str.extract(r'((\d+)/(\d+)/(\d+))', expand=True)
But I got an error:
ValueError: Wrong number of items passed 4, placement implies 1
Anyone knows where I am wrong? And if there is a better and faster solution for this, I would be thankful for a suggestion.
Thanks in advance.

If you want to extract a single part of a string into a single column, make sure your regex only contains a single capturing group. Remove all other capturing groups (if they are redundant) or convert them into non-capturing ones (if they are used as simple groupings for pattern sequences, e.g. (\W+\w+){0,3} -> (?:\W+\w+){0,3}).
Here, you can use
df['wanted_column'] = df['column'].str.extract(r'(\d+/\d+/\d+)', expand=True)
The point is to only use a single capturing group in the regex when you use it with str.extract to extract a value into a single column.
Mind that r'((\d+)/(\d+)/(\d+))' could be also re-written as r'((?:\d+)/(?:\d+)/(?:\d+))' for this use case, but these non-capturing groups would be redundant as they only group a single \d+ pattern in each of them, which makes no sense.
If you need to extract values into several columns, mind that the column number should be equal to the amount of capturing groups in the pattern, e.g.
df[['Val1', 'Val2', 'Val3']] = df['column'].str.extract(r'(\d+)/(\d+)/(\d+)', expand=True)
# 1 2 3 ^ 1 ^ ^ 2 ^ ^ 3 ^

Related

Need help splitting a column in my DataFrame (Python)

I have a Python DataFrame "dt", one of the dt columns "betName" is filled with objects that sometimes have +/- numbers after the names. I'm trying to figure out how to separate "betName" into 2 columns "betName" & "line" where "betName" is just the name and "line" has the +/- number or regular number
Please see screenshots, thank you for helping!
example of problem and desired result
dt["betName"]

Try this (updated) code:
df2=df['betName'].str.split(r' (?=[+-]\d{1,}\.?\d{,}?)', expand=True).astype('str')
Explanation. You can use str.split to split a text in the rows into 2 or more columns by regular expression:
(?=[+-]\d{1,}\.?\d{,}?)
' ' - Space char is the first.
() - Indicates the start and end of a group.
?= - Lookahead assertion. Matches if ... matches next, but doesn’t consume any of the string.
[+-] - a set of characters. It will match + or -.
\d{1,} - \d is a digit from 0 to 9 with {start, end} number of digits. Here it means from 1 to any number: 1,200,4000 etc.
\.? - \. for a dot and ? - 0 or 1 repetitions of the preceding expression group or symbol.
str.split(pattern=None, n=- 1, expand=False)
pattern - string or regular expression to split on. If not specified, split on whitespace
n - number of splits in output. None, 0 and -1 will be interpreted as return all splits.
expand - expand the split strings into separate columns.
True for placing splitted groups into different columns
False for Series/Index lists of strings in a row.
by .astype('str') function you convert dataframe to string type.
The output.

EDIT: Added a split before doing the regex. This applies the regex only to the cell information that comes after the last white space.
I think you need to extract the bet information with a regular expression.
df["line"] = df["betName"].apply(lambda x: x.split()[-1]).str.extract('([0-9.+-]+)')
Here's how the regex works - the () sets up a capture group, i.e. specifies what information you want to extract.
The stuff inside the square brackets is a character class, so here it matches any number from 0-9, + or - signs and a full stop.
Then plus sign after the square brackets mean match one or more repetitions of anything in the character class.

Finding a regx expression in pyspark?

I have a column in pyspark dataframe which contain values separated by ;
+----------------------------------------------------------------------------------+
|name |
+----------------------------------------------------------------------------------+
|tppid=dfc36cc18bba07ae2419a1501534aec6fdcc22e0dcefed4f58c48b0169f203f6;xmaslist=no|
+----------------------------------------------------------------------------------+
So, in this column any number of key value pair can come if i use this
df.withColumn('test', regexp_extract(col('name'), '(?<=tppid=)(.*?);', 1)).show(1,False)
i can extract the tppid but when tppid comes as last key-value pair in a row it not able to extract, I want a regx which can extract the value of a key where ever the location of it in a row.

You may use a negated character class [^;] to match any char but ;:
tppid=([^;]+)
See the regex demo
Since the third argument to regexp_extract is 1 (accessing Group 1 contents), you may discard the lookbehind construct and use tppid= as part of the consuming pattern.

in addition to the Wiktor Stribiżew's answer, you can use anchors. $ is denoting the end of the string.
tppid=\w+(?=;|\s|$)
Also this regex extract for you only the values without the tppid= part:
(?<=tppid=)\w+(?=;|\s|$)

delete a part of string before a specific pattern

I have a pandas dataframe with a column where I have to retrieve specific names. The only problem is, those names are not always at the same place and all the values of that columns do not have the same length, so I cannot use the split function . However, I have noticed that before those names, there is a always a combination of 4 to 7 digits. I believe it's the identifier for the name.
So how can I use regular expression to go through that column and retrieve the names I need.
Here is a example from the jupyter notebook:
df['info']
csx_Gb009_broken screen_231400_Iphone 7
000345_SamsungS8_tfes_Vodafone_is56t34_3G
Ins45_56003_Huawei P8_
What I want is something like this:
df['Phones']
Iphone 7
SamsungS8
Huawei P8
I want to have something like the above knowing that those names come before a combination of 4 to 7 digits and end by an underscore.

You may use
df['Phones'] = df['info'].str.extract(r'\d{4}_([^_]+)')
The pattern matches:
\d{4} - 4 digits
_ - an underscore
([^_]+) - Capturing group 1 (this value will be returned by str.extract): one or more chars other than _.
See the regex demo.

Python Regex behaviour with Square Brackets []

This the text file abc.txt
abc.txt
aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in
I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.
parser.py
import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
print('Regex found that site_line.group(2) = '+str(site_line.group(2))
Why is the output
Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2
Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2
But Why ?

Let's show a simplified example:
>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'
If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.
If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.
That said, as the comments suggest, regexes are overkill for this.
>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']

And first group is entire match by default.
If a groupN argument is zero, the corresponding return value is the
entire matching string.
So you should skip it. And check group(3), if you want last one.
Also, you should compile regexp before for-loop. It increase performance of your parser.
And you can replace (\w)* to (\w*), if you want match all symbols between :.

repeating multiple characters regex

Is there a way using a regex to match a repeating set of characters? For example:
ABCABCABCABCABC
ABC{5}
I know that's wrong. But is there anything to match that effect?
Update:
Can you use nested capture groups? So Something like (?<cap>(ABC){5}) ?

Enclose the regex you want to repeat in parentheses. For instance, if you want 5 repetitions of ABC:
(ABC){5}
Or if you want any number of repetitions (0 or more):
(ABC)*
Or one or more repetitions:
(ABC)+
edit to respond to update
Parentheses in regular expressions do two things; they group together a sequence of items in a regular expression, so that you can apply an operator to an entire sequence instead of just the last item, and they capture the contents of that group so you can extract the substring that was matched by that subexpression in the regex.
You can nest parentheses; they are counted from the first opening paren. For instance:
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(0)
'123 ABCDEF'
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(1)
'ABCDEF'
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(2)
'DEF'
If you would like to avoid capturing when you are grouping, you can use (?:. This can be helpful if you don't want parentheses that you're just using to group together a sequence for the purpose of applying an operator to change the numbering of your matches. It is also faster.
>>> re.search('[0-9]* (?:ABC(...))', '123 ABCDEF 456').group(1)
'DEF'
So to answer your update, yes, you can use nested capture groups, or even avoid capturing with the inner group at all:
>>> re.search('((?:ABC){5})(DEF)', 'ABCABCABCABCABCDEF').group(1)
'ABCABCABCABCABC'
>>> re.search('((?:ABC){5})(DEF)', 'ABCABCABCABCABCDEF').group(2)
'DEF'

ABC{5} matches ABCCCCC. To match 5 ABC's, you should use (ABC){5}. Parentheses are used to group a set of characters. You can also set an interval for occurrences like (ABC){3,5} which matches ABCABCABC, ABCABCABCABC, and ABCABCABCABCABC.
(ABC){1,} means 1 or more repetition which is exactly the same as (ABC)+.
(ABC){0,} means 0 or more repetition which is exactly the same as (ABC)*.

(ABC){5} Should work for you

Parentheses "()" are used to group characters and expressions within larger, more complex regular expressions. Quantifiers that immediately follow the group apply to the whole group.
(ABC){5}

As to the update to the question-
You can nest capture groups. The capture group index is incremented per open paren.
(((ABC)*)(DEF)*)
Feeding that regex ABCABCABCDEFDEFDEF, capture group 0 matches the whole thing, 1 is also the whole thing, 2 is ABCABCABC, 3 is ABC, and 4 is DEF (because the star is outside of the capture group).
If you have variation inside a capture group and a repeat just outside, then things can get a little wonky if you're not expecting it...
(a[bc]*c)*
when fed abbbcccabbc will return the last match as capture group 1, in this example just the abbc, since the capture group gets reset with the repeat operator.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: extractall - writing a capture group with or condition [duplicate] - python

Related

Need help splitting a column in my DataFrame (Python)

Finding a regx expression in pyspark?

delete a part of string before a specific pattern

Python Regex behaviour with Square Brackets []

repeating multiple characters regex

Categories

Resources