Python Regex ignore specific string to find next example - python

I have the following code that runs through and strips the data in the current column and creates a secondary column with just the code in parentheses and this works wonderfully in example 2 & 3. However in example one, i am seeing situations where the date is being picked up because it is also in parentheses. Is there a way to rework the code to ignore anything within the parenthesis that has a datestamp and continue to look for something else within that record, for example in scenario 1, scan record one, ignore(2018-03) and select (256). The datasets we worth with have 3,4,5 and other various of record codes, but this date type is unique and can be removed.
Code:
df1['Doc ID'] = df['Folder Path'].str.extract('.*\((.*)\).*',expand=True)
Data Table:
current column new column
1 /reports/support + admin. (256)/ Global (2018-03) (2018-03)
2 /reports/limit/sector(139)/2017 (139)
3 /reports/sector/region(147,189 and 132)/2018 (147,189 and 132)

You may use
df['Folder Path'].str.extract(r'\((?!\d{4}-\d{2}\)|Data Only\))([^()]*)\)',expand=True)
The regex matches
\( - an open parenthesis
(?!\d{4}-\d{2}\)|Data Only\)) - a negative lookahead that fails the match if there is
\d{4}-\d{2}\) - 4 digits, hyphen, 2 hyphens, )
| - or
Data Only\) - Data Only) substrinbg
([^()]*) - Group 1: any 0 or more chars other than open/close parentheses
\) - a close parenthesis
See the regex demo.

Related

Regex to check if sequence of characters exists before or after delimiter

I have a string such as ID123456_SIT,UAT where ID###### will always be hardcoded.
I need a python regex that will allow me to check whether ID123456_ and (SIT or UAT) exists before (without a comma) or after a comma in a particular string.
Scenarios:
ID123456_SIT,UAT - should match with regex
ID123456_UAT,SIT - should match with regex
ID123456_SIT - should match with regex
ID123456_UAT - should match with regex
ID123456_TRA,SIT,UAT - should match with regex
As of right now the following only works if 1 comma is specified (1 & 2 above), but does not work for single values (3 & 4) if a comma is not specified (bottom 2 scenarios). Also does not work if there was more than 1 comma specified, at which point I should be checking if the word exists between any of the commas (Scenario 5):
(^ID123456_)(SIT|UAT),(SIT|UAT) - works for Scenarios 1 & 2 only
Also open to other suggestions for solving the same problem: checking if ID123456 & SIT/UAT is present in a pandas column's values.
Thanks in advance!
You can use
^ID123456_(?=.*(?:SIT|UAT)).*
See the regex demo.
This matches
^ - start of string
ID123456_ - text that the string should start with
(?=.*(?:SIT|UAT)) - there must be either SIT or UAT after any zero or more chars other than line break chars as many as possible
.* - the rest of the line.

Need help splitting a column in my DataFrame (Python)

I have a Python DataFrame "dt", one of the dt columns "betName" is filled with objects that sometimes have +/- numbers after the names. I'm trying to figure out how to separate "betName" into 2 columns "betName" & "line" where "betName" is just the name and "line" has the +/- number or regular number
Please see screenshots, thank you for helping!
example of problem and desired result
dt["betName"]
Try this (updated) code:
df2=df['betName'].str.split(r' (?=[+-]\d{1,}\.?\d{,}?)', expand=True).astype('str')
Explanation. You can use str.split to split a text in the rows into 2 or more columns by regular expression:
(?=[+-]\d{1,}\.?\d{,}?)
' ' - Space char is the first.
() - Indicates the start and end of a group.
?= - Lookahead assertion. Matches if ... matches next, but doesn’t consume any of the string.
[+-] - a set of characters. It will match + or -.
\d{1,} - \d is a digit from 0 to 9 with {start, end} number of digits. Here it means from 1 to any number: 1,200,4000 etc.
\.? - \. for a dot and ? - 0 or 1 repetitions of the preceding expression group or symbol.
str.split(pattern=None, n=- 1, expand=False)
pattern - string or regular expression to split on. If not specified, split on whitespace
n - number of splits in output. None, 0 and -1 will be interpreted as return all splits.
expand - expand the split strings into separate columns.
True for placing splitted groups into different columns
False for Series/Index lists of strings in a row.
by .astype('str') function you convert dataframe to string type.
The output.
EDIT: Added a split before doing the regex. This applies the regex only to the cell information that comes after the last white space.
I think you need to extract the bet information with a regular expression.
df["line"] = df["betName"].apply(lambda x: x.split()[-1]).str.extract('([0-9.+-]+)')
Here's how the regex works - the () sets up a capture group, i.e. specifies what information you want to extract.
The stuff inside the square brackets is a character class, so here it matches any number from 0-9, + or - signs and a full stop.
Then plus sign after the square brackets mean match one or more repetitions of anything in the character class.

delete a part of string before a specific pattern

I have a pandas dataframe with a column where I have to retrieve specific names. The only problem is, those names are not always at the same place and all the values of that columns do not have the same length, so I cannot use the split function . However, I have noticed that before those names, there is a always a combination of 4 to 7 digits. I believe it's the identifier for the name.
So how can I use regular expression to go through that column and retrieve the names I need.
Here is a example from the jupyter notebook:
df['info']
csx_Gb009_broken screen_231400_Iphone 7
000345_SamsungS8_tfes_Vodafone_is56t34_3G
Ins45_56003_Huawei P8_
What I want is something like this:
df['Phones']
Iphone 7
SamsungS8
Huawei P8
I want to have something like the above knowing that those names come before a combination of 4 to 7 digits and end by an underscore.
You may use
df['Phones'] = df['info'].str.extract(r'\d{4}_([^_]+)')
The pattern matches:
\d{4} - 4 digits
_ - an underscore
([^_]+) - Capturing group 1 (this value will be returned by str.extract): one or more chars other than _.
See the regex demo.

Regex to python regex

I have a lot of file names with the pattern SURENAME__notalwaysmiddlename_firstnames_1230123Abc123-16x_notalways.pdf, e.g.:
SMITH_John_001322Cde444-16v_HA.pdf
FLORRICK-DOILE_Debora_Alicia_321333Gef213-16p.pdf
ROBINSON-SMITH_Maria-Louise_321333Gef213-16p_GH.pdf
My old regex was ([\w]*)_([\w-\w]+)\.\w+ but after switching to Python and getting the first double-barrelled surnames (and even in the first names) I'm unable to get it running.
With the old regex I got two groups:
SMITH_James
001322Cde444-16v_HA
But now I have no clue how to achieve this with re and even include the occasional double-barrelled names in group 1 and the ID in group 2.
([A-Z-]+)(?:_([A-z-]+))?_([A-z-]+)_(\d.*)\.
This pattern will return the surname, potential middle name, first name, and final string.
([A-Z-]+) returns a upper-cased word that can also contain -
(?:_([A-z-]+))? returns 0 or 1 matches of a word preceded by an _. The (?: makes the _ non-capturing
([A-z-]+) returns a word that can also contain -
(\d.*) returns a string that starts with a number
\. finds the escaped period right before the file type

Python regex matching only if digit

Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-
I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.
Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.
This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)

Categories

Resources