delete a part of string before a specific pattern - python

I have a pandas dataframe with a column where I have to retrieve specific names. The only problem is, those names are not always at the same place and all the values of that columns do not have the same length, so I cannot use the split function . However, I have noticed that before those names, there is a always a combination of 4 to 7 digits. I believe it's the identifier for the name.
So how can I use regular expression to go through that column and retrieve the names I need.
Here is a example from the jupyter notebook:
df['info']
csx_Gb009_broken screen_231400_Iphone 7
000345_SamsungS8_tfes_Vodafone_is56t34_3G
Ins45_56003_Huawei P8_
What I want is something like this:
df['Phones']
Iphone 7
SamsungS8
Huawei P8
I want to have something like the above knowing that those names come before a combination of 4 to 7 digits and end by an underscore.

You may use
df['Phones'] = df['info'].str.extract(r'\d{4}_([^_]+)')
The pattern matches:
\d{4} - 4 digits
_ - an underscore
([^_]+) - Capturing group 1 (this value will be returned by str.extract): one or more chars other than _.
See the regex demo.

Related

Pandas: extractall - writing a capture group with or condition [duplicate]

I have columns in my dataframe (~2 milion rows) that look like this:
column
1/20/1"ADAF"
1/4/551BSSS
1/2/1AAAA
1/565/1 "AAA="
And I want to extract only:
1/20/1
1/4/551
1/2/1
1/565/1
I have tried with:
df['wanted_column'] = df['column'].str.extract(r'((\d+)/(\d+)/(\d+))', expand=True)
But I got an error:
ValueError: Wrong number of items passed 4, placement implies 1
Anyone knows where I am wrong? And if there is a better and faster solution for this, I would be thankful for a suggestion.
Thanks in advance.
If you want to extract a single part of a string into a single column, make sure your regex only contains a single capturing group. Remove all other capturing groups (if they are redundant) or convert them into non-capturing ones (if they are used as simple groupings for pattern sequences, e.g. (\W+\w+){0,3} -> (?:\W+\w+){0,3}).
Here, you can use
df['wanted_column'] = df['column'].str.extract(r'(\d+/\d+/\d+)', expand=True)
The point is to only use a single capturing group in the regex when you use it with str.extract to extract a value into a single column.
Mind that r'((\d+)/(\d+)/(\d+))' could be also re-written as r'((?:\d+)/(?:\d+)/(?:\d+))' for this use case, but these non-capturing groups would be redundant as they only group a single \d+ pattern in each of them, which makes no sense.
If you need to extract values into several columns, mind that the column number should be equal to the amount of capturing groups in the pattern, e.g.
df[['Val1', 'Val2', 'Val3']] = df['column'].str.extract(r'(\d+)/(\d+)/(\d+)', expand=True)
# 1 2 3 ^ 1 ^ ^ 2 ^ ^ 3 ^

Need help splitting a column in my DataFrame (Python)

I have a Python DataFrame "dt", one of the dt columns "betName" is filled with objects that sometimes have +/- numbers after the names. I'm trying to figure out how to separate "betName" into 2 columns "betName" & "line" where "betName" is just the name and "line" has the +/- number or regular number
Please see screenshots, thank you for helping!
example of problem and desired result
dt["betName"]
Try this (updated) code:
df2=df['betName'].str.split(r' (?=[+-]\d{1,}\.?\d{,}?)', expand=True).astype('str')
Explanation. You can use str.split to split a text in the rows into 2 or more columns by regular expression:
(?=[+-]\d{1,}\.?\d{,}?)
' ' - Space char is the first.
() - Indicates the start and end of a group.
?= - Lookahead assertion. Matches if ... matches next, but doesn’t consume any of the string.
[+-] - a set of characters. It will match + or -.
\d{1,} - \d is a digit from 0 to 9 with {start, end} number of digits. Here it means from 1 to any number: 1,200,4000 etc.
\.? - \. for a dot and ? - 0 or 1 repetitions of the preceding expression group or symbol.
str.split(pattern=None, n=- 1, expand=False)
pattern - string or regular expression to split on. If not specified, split on whitespace
n - number of splits in output. None, 0 and -1 will be interpreted as return all splits.
expand - expand the split strings into separate columns.
True for placing splitted groups into different columns
False for Series/Index lists of strings in a row.
by .astype('str') function you convert dataframe to string type.
The output.
EDIT: Added a split before doing the regex. This applies the regex only to the cell information that comes after the last white space.
I think you need to extract the bet information with a regular expression.
df["line"] = df["betName"].apply(lambda x: x.split()[-1]).str.extract('([0-9.+-]+)')
Here's how the regex works - the () sets up a capture group, i.e. specifies what information you want to extract.
The stuff inside the square brackets is a character class, so here it matches any number from 0-9, + or - signs and a full stop.
Then plus sign after the square brackets mean match one or more repetitions of anything in the character class.

Python Regex ignore specific string to find next example

I have the following code that runs through and strips the data in the current column and creates a secondary column with just the code in parentheses and this works wonderfully in example 2 & 3. However in example one, i am seeing situations where the date is being picked up because it is also in parentheses. Is there a way to rework the code to ignore anything within the parenthesis that has a datestamp and continue to look for something else within that record, for example in scenario 1, scan record one, ignore(2018-03) and select (256). The datasets we worth with have 3,4,5 and other various of record codes, but this date type is unique and can be removed.
Code:
df1['Doc ID'] = df['Folder Path'].str.extract('.*\((.*)\).*',expand=True)
Data Table:
current column new column
1 /reports/support + admin. (256)/ Global (2018-03) (2018-03)
2 /reports/limit/sector(139)/2017 (139)
3 /reports/sector/region(147,189 and 132)/2018 (147,189 and 132)
You may use
df['Folder Path'].str.extract(r'\((?!\d{4}-\d{2}\)|Data Only\))([^()]*)\)',expand=True)
The regex matches
\( - an open parenthesis
(?!\d{4}-\d{2}\)|Data Only\)) - a negative lookahead that fails the match if there is
\d{4}-\d{2}\) - 4 digits, hyphen, 2 hyphens, )
| - or
Data Only\) - Data Only) substrinbg
([^()]*) - Group 1: any 0 or more chars other than open/close parentheses
\) - a close parenthesis
See the regex demo.

Using regex to detect a sku

I'm new to regex and I have some trouble dectecting the sku (unique ids) of a product in a column.
My skus can take any form: all they have in common basically is:
to be words made of a combination of letters and numbers
to have 6 characters
Here is an example of what I have in my column:
LX0051
N41554
shoes
handbag
1B1F25
1V1F8L
store near me
M90947
M90844
How can I identify the rows that contain a sku using regex?
If I understand correctly you mean that it must have at least on digit, and at least one letter and be exactly 6 characters... Try
^(?=.*\d)(?=.*[a-z])[a-z\d]{6}$
It uses two look-aheads to ensure there's at least one digit and one letter in the string. then it simply matches 6 characters. (Remember the i flag if both common and capital letters should be allowed.)
See it here at regex101.

Look Around and re.sub()

I want to know how re.sub() works.
The following example is in a book I am reading.
I want "1234567890" to be "1,234,567,890".
pattern = re.compile(r"\d{1,3}(?=(\d{3})+(?!\d))")
pattern.sub(r"\g<0>,", "1234567890")
"1,234,567,890"
Then, I changed "\g<0>" to "\g<1>" and it did not work.
The result was "890,890,890,890".
Why?
I want to know exactly how the capturing and replacing of re.sub()and look ahead mechanism is working.
You have 890 repeated because it is Group 1 (= \g<1>), and you replace every 3 digits with the last captured Group 1 (which is 890).
One more thing here is (\d{3})+ that also captures groups of 3 digits one by one until the end (because of the (?!\d) condition), and places only the last captured group of characters into Group 1. And you are using it to replace each 3-digit chunks in the input string.
See visualization at regex101.com.

Categories

Resources