Extract multiple Words using regex on excel data - python

I have a dataset in Excel:
column1
Bank A : 12
Bank B : 40
Bank C : 55
where it contains only a single row with Bank A, B and C information inside one cell.
How would I be able to use regex in Python to create 3 columns whereby my new dataset is:
Bank A Bank B Bank C
12 40 55
Thank You!

You could do it with the following regex:
(.*?)\s:\s(\d+)
Regex Demo
Or using this regex to be more forgiving with the spaces before and after the :
(.*?)(?:\s+)?:(?:\s+)?(\d+)
Regex Demo
Explanation:
(.*?) # For Group 1, match every character
\s:\s # until reaching a space + : + space
(\d+) # For Group 2, match every digit
Then with your python code you can access contents of Group 1 and 2 using the Match.group() method and build the columns as you need.

Related

PYTHON - .replace function

I have a DF similar to the below:
Name
Text
Michael
66l additional text
John
55i additional text
Mary
88l additional text
What I want to do is anywhere "l" occurs in the first string of the "Text" column, then replace it with "P"
Current code
DF['Text'] = DF['Text'].replace({"l", "P", 1})
Desired Outcome
Name
Text
Michael
66P additional text
John
55i additional text
Mary
88P additional text
You can use pandas.Series.str.replace with regex to identify the first word of the string.
>>> import pandas as pd
>>>
>>>
>>> df
Text
0 66l additional text
1 55i additional text
2 88l additional text
>>>
>>>
>>> df['Text'] = df['Text'].str.replace(r"^\w+\b", lambda x: x.group(0).replace("l", "P"), regex=True)
>>> df
Text
0 66P additional text
1 55i additional text
2 88P additional text
Asssuming the l only occurs once (as is shown in your sample dataframe) you can use
df['Text'].str.replace(r'^(\S*)l', r'\1P', regex=True)
# => 0 66P additional text
# 1 55i additional text
# 2 88P additional text
# Name: Text, dtype: object
See the regex demo. Details:
^ - start of string
(\S*) - Group 1: zero or more whitespaces
l - an l char (letter).
The replacement is \1P, i.e. the Group 1 value + P letter.
With your shown samples only, this could be easily done by using str[range] functionality of Python pandas, with your shown samples of DataFrame please try following code.
import pandas as pd
##Create your df here....
df['Text'] = df['Text'].str[:2] + 'P ' + df['Text'].str[4:]
Explanation:
df['Text'].str[:2]: Taking(printing) from 1st position of column Text to till 3rd position(it starts from 0).
+ 'P ' +: Adding/concatenating P to it as per OP's requirement in question here.
df['Text'].str[4:]: Taking(printing) from 5th position of column Text to till end of column's value here and saving this whole df['Text'].str[:2] + 'P ' + df['Text'].str[4:] code's output into Text column itself of DataFrame.

Replacing abbreviations with complete words in dataframe reffering to a list [duplicate]

This question already has answers here:
Regex whitespace word boundary
(3 answers)
Match a whole word in a string using dynamic regex
(1 answer)
Closed 2 years ago.
I have a dataframe (df) containing names with abbreviations like below:
Name
ABC CO
XYZ CO LTD
S.A.L.P, S.P.A.
XXX L.P
NUR YER SAN.TIC.LTD
BAAB TERMINALS LTD.
I have to replace the abbreviations with their complete words referring to a list. So Below was my approach
import pandas as pd
repl = {'CO' : 'COMPANY','LTD' : 'LIMITED','L.P' : 'LIMITED PARTNERSHIP','LTD.' : 'LIMITED','.LTD' : 'LIMITED'}
repl = {rf'\b{k}\b': v for k, v in repl.items()}
df2 = df['Name'].replace(repl, regex=True)
df2
Below is the output;
0 ABC COMPANY
1 XYZ COMPANY LIMITED
2 S.A.LIMITED PARTNERSHIP, S.P.A.
3 XXX LIMITED PARTNERSHIP
4 NUR YER SAN.TIC.LTD
5 BAAB TERMINALS LIMITED.
Name: Name, dtype: object
here S.A.L.P must not replaced with L.P
Expected output :
0 ABC COMPANY
1 XYZ COMPANY LIMITED
2 S.A.L.P, S.P.A.
3 XXX LIMITED PARTNERSHIP
4 NUR YER SAN.TIC.LIMITED
5 BAAB TERMINALS LIMITED.
Name: Name, dtype: object
The code should replace L.P with LIMITED PARTNERSHIP only when it is present separately as a different string not when it is a part of some string. Can anyone help me out with the issue please. Thanks.
You may be able to use this regex with look-arounds that makes sure we don't have a non-whitespace before and after key:
repl = {rf'(?<!\S){re.escape(k)}(?!\S)': v for k, v in repl.items()}
Here:
(?<!\S): Asserts that previous character is not a non-whitespace
(?!\S): Asserts that next character is not a non-whitespace
Put spaces before and after the words, e.g. for L.P:
repl = {'CO' : 'COMPANY','LTD' : 'LIMITED',' L.P ' : ' LIMITED PARTNERSHIP '}

Python Regular expression of group to match text before amount

I am trying to write a python regular expression which captures multiple values from a few columns in dataframe. Below regular expression attempts to do the same. There are 4 parts of the string.
group 1: Date - month and day
group 2: Date - month and day
group 3: description text before amount i.e. group 4
group 4: amount - this group is optional
Some peculiar conditions for group 3 - text that
(1)the text itself might contain characters like "-" , "$". So we cannot use - & $ as the boundary of text.
(2) The text (group 3) sometimes may not be followed by amount.
(3) Empty space between group 3 and 4 is optional
Below is python function code which takes in a dataframe having 4 columns c1,c2,c3,c4 adds the columns dt, txt and amt after processing to dataframe.
def parse_values(args):
re_1='(([JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC]{3}\s{0,}[\d]{1,2})\s{0,}){2}(.*[\s]|.*[^\$]|.*[^-]){1}([-+]?\$[\d|,]+(?:\.\d+)?)?'
srch=re.search(re_1, args[0])
if srch is None:
return args
m = re.match(re_1, args[0])
args['dt']=m.group(1)
args['txt']=m.group(3)
args['amt']=m.group(4)
if m.group(4) is None:
if pd.isnull(args['c3']):
args['amt']=args.c2
else:
args['amt']=args.c3
return args
And in order to test the results I have below 6 rows which needs to return a properly formatted amt column in return.
tt=[{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL ','c2':'$16.84'},
{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL','c2':'$16.84'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK -$80,00,7770.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK-$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK $80,00,7770.70'}
]
t=pd.DataFrame(tt,columns=['c1','c2','c3','c4'])
t=t.apply(parse_values,1)
t
However due to the error in my regular expression in re_1 I am not getting the amt column and txt column parsed properly as they return NaN or miss some words (as dipicted in some rows of the output image below).
How about this:
(((?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s*[\d]{1,2})\s*){2}(.*?)\s*(?=[\-$])([-+]?\$[\d|,]+(?:\.\d+)?)
As seen at regex101.com
Explanation:
First off, I've shortened the regex by changing a few minor details like using \s* instead of \s{0,}, which mean the exact same thing.
The whole [Jan|...|DEC] code was using a character class i.e. [], whcih only takes a single character from the entire set. Using non capturing groups is the correct way of selecting from different groups of multiple letters, which in your case are 'months'.
The meat of the regex: LOOKAHEADS
(?=[\-$]) tells the regex that the text before it in (.*) should match as much as it can until it finds a position followed by a dash or a dollar sign. Lookaheads don't actually match whatever they're looking for, they just tell the regex that the lookahead's arguments should be following that position.

Match words only if preceded by specific pattern

I have a string from a NWS bulletin:
LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley
My aim is to extract a couple fields with regular expressions. In the first string I want "AAD" and from the second string I want "RECHNX". I have tried:
( )\w{3} #for the first string
and
\w{6} #for the 2nd string
But these find all 3 and 6 character strings leading up to the string I want.
Assuming the fields you want to extract are always in capital letters and preceded by 6 digits and a space, this regular expression would do the trick:
(?<=\d{6}\s)[A-Z]+
Demo: https://regex101.com/r/dsDHTs/1
Edit: if you want to match up to two alpha-numeric uppercase words preceded by 6 digits, you can use:
(?<=\d{6}\s)([A-Z0-9]+\b)\s(?:([A-Z0-9]+\b))*
Demo: https://regex101.com/r/dsDHTs/5
If you have a specific list of valid fields, you could also simply use:
(AAD|TMLB|RECHNX|RR4HNX)
https://regex101.com/r/dsDHTs/3
Since the substring you want to extract is a word that follows a number, separated by a space, you can use re.search with the following regex (given your input stored in s):
re.search(r'\b\d+ (\w+)', s).group(1)
To read first groups of word chars from each line, you can use a pattern like
(\w+) (\w+) (\w+) (\w+).
Then, from the first line read group No 4 and from the second line read group No 3.
Look at the following program. It prints four groups from each source line:
import re
txt = """LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley"""
n = 0
pat = re.compile(r'(\w+) (\w+) (\w+) (\w+)')
for line in txt.splitlines():
n += 1
print(f'{n:2}: {line}')
mtch = pat.search(line)
if mtch:
gr = [ mtch.group(i) for i in range(1, 5) ]
print(f' {gr}')
The result is:
1: LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
['LTUS41', 'KCAR', '141558', 'AAD']
2: KHNX 141001 RECHNX Weather Service San Joaquin Valley
['KHNX', '141001', 'RECHNX', 'Weather']

Insert space after the second or third capital letter python

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB, where SW refers to south west and S to south.
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.
So far, I've found that ([A-Z]{2,3}[a-z]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.
I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.
Thanks
You may use
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo
Details
\b - a word boundary
([A-Z]{1,2}) - Capturing group 1 (later referred with \1 from the replacement pattern): one or two uppercase letters
([A-Z][a-z]) - Capturing group 2 (later referred with \2 from the replacement pattern): an uppercase letter + a lowercase one.
If you want to specifically match city quadrants, you may use a bit more specific regex:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo. Here, [NS][EW]|[NESW] matches N or S that are followed with E or W, or a single N, E, S or W.
Pandas demo:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object
You can use
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1

Categories

Resources