Using str.extract with regex on pandas df column - python

I have some address info. stored in a pandas df column like below:
df['Addr']
LT 75 CEDAR WOOD 3RD PL
LTS 22,25 & 26 MULLINS CORNER
LTS 7 & 8
PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION
I want to extract lot information and create a new column so for the example above, my expected results are as below:
df['Lot']
75
22,25 & 26
7 & 8
22-23
This is my code:
df['Lot'] = df['Addr'].str.extract(r'\b(?:LOT|LT|LTS?) (\w+(?:-\d+)*)')
Results I'm getting is:
75
22
7
22-23
How can I modify my regex for expected results if at all possible? Please advise.

You could use
\b(?:LOT|LTS?) (\d+(?:(?:[-,]| & )\d+)*)
Explanation
\b A word boundary
(?:LOT|LTS?) Match LOT or LT or LTS
( Capture group 1
\d+ Match 1+ digits
(?:(?:[-,]| & )\d+)* Optionally repeat either - or , or & followed by 1+ digits
) Close group 1
Regex demo
data = [
"LT 75 CEDAR WOOD 3RD PL",
"LTS 22,25 & 26 MULLINS CORNER",
"LTS 7 & 8",
"PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION"
]
df = pd.DataFrame(data, columns = ['Addr'])
df['Lot'] = df['Addr'].str.extract(r'\b(?:LO?T|LTS?) (\d+(?:(?:[-,]| & )\d+)*)')
print(df)
Output
Addr Lot
0 LT 75 CEDAR WOOD 3RD PL 75
1 LTS 22,25 & 26 MULLINS CORNER 22,25 & 26
2 LTS 7 & 8 7 & 8
3 PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION 22-23
If the - , and & can all be surrounded by optional whitespace chars, you might shorten the pattern to:
\b(?:LOT|LTS?) (\d+(?:\s*[-,&]\s*\d+)*)\b
Regex demo

Related

str.findall returns all NA's

I have this df1 with a lot of different news articles. An example of a news article is this:
'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'
And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.
Data example:
data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}
What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"].
And it should look like this in the end:
Content | Achieve | Affiliation | affect
article text here | 6 | 2 | 2
article text here | 2 | 43 | 2
article text here | 6 | 8 | 8
article text here | 2 | 13 | 7
I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:
from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
.apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
.fillna(0).astype(int)
Both a pandas or r solution would be equally great.
First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:
df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
'Today is brain Aug. 17 the guilty day of 2020 ']})
print (df1)
articles
0 Today is killing Aug. 17 the 230th day of 2020
1 Today is brain Aug. 17 the guilty day of 2020
If list of values in Word column like in picture:
data = {'Word': [['killing'],['even'],['guilty'],['brain']],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
Word Category
0 [killing] Affect
1 [even] Adverb
2 [guilty] Anx
3 [brain] Body
d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
If df2 is different:
data = {'Word': ['killing','even','guilty','brain'],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
0 killing Affect
1 even Adverb
2 guilty Anx
3 brain Body
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
import re
#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')
df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
articles Affect Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 1 1
You can craft a custom regex with named capturing groups and use str.extractall.
With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'
Then groupby+max the notna results, convert to int and join to the original dataframe:
regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
.notna().groupby(level=0).max()
.astype(int)
)
)
output:
articles Affect Adverb Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 0 1 1

How to split 2 columns with patterns?

I have a dataset (df) with 2 columns of arbitrary length, and I need to split it up based on the value.
BUS
CODE
150 H.S.London-lon3 11£150 H.S.London-lon3 16£150 H.S.London-lon3 120
GERI
400 Airport Luton-ptr5 12£400 Airport Luton-ptr5 15£400 Airport Luton-ptr5 17
24£JTR
005 Plaza-cata-md6 08£005 Plaza-cata-md6 012£005 Plaza-cata-md6 18
78£TDE
I've been trying to split it to look like this:
bus
directions
zone
time
code
name
150
H.S.London
lon3
11
NaN
GERI
400
Airport Luton
ptr5
12
24
JTR
005
Plaza-cata
md6
08
78
TDE
So far, I tried to split by patterns, but isn't working and I'm out of ideas or how to split it in other way.
bus = '(?P<bus>[\d]+) (?P<direction>[\w\W]+)-(?P<zone>[\w]+)'
code = '(?P<code>[\S]+)£(?P<name>\d+)
df.BUS.str.extract(bus)).join(df.CODE.str.extract(code)
I was wondering if anyone had a good solution to this.
You can use .str.extract with regex pattern containing named capturing groups:
code = r'^(?P<code>\d+)?.*?(?P<name>[A-Za-z]+)'
bus = r'^(?P<bus>\d+)\s(?P<directions>.*?)-(?P<zone>[^\-]+)\s(?P<time>\d+)'
df['BUS'].str.extract(bus).join(df['CODE'].str.extract(code))
bus directions zone time code name
0 150 H.S.London lon3 11 NaN GERI
1 400 Airport Luton ptr5 12 24 JTR
2 005 Plaza-cata md6 08 78 TDE
See the regex demo for code pattern here and for bus pattern here.
You could use split:
For your code column:
new_cols = ['code','name']
df[new_cols] = df.CODE.str.split(pat = '£', expand = True)
Im sure you can find a way to do this for your first column, and if you have duplicates remove them after splitting

I can remove everything except this one thing in regex: "co-". Not sure how to deal with the dash

I have a column of counties, that are prepended by variations of co, co. ,Co.,co-` etc and I want to get rid of these and just leave the county name (in this case, Cork).
x = pd.DataFrame({'area':['Co Cork','Co. Cork','Co-Cork','Co- Cork','co. Cork','county cork']})
print(x)
area
0 Co Cork
1 Co. Cork
2 Co-Cork
3 Co- Cork
4 co. Cork
5 county cork
This is what I have tried, and have removed most of them, except the third one, with the dash - which is joined to the word Cork.
x['fixed']=(x['area'].str.replace('co\s+', '', case=False)
.str.replace('co.\s+', '', case=False)
.str.replace('county\s+', '', case=False)
.str.replace('co\-\s+ | co-$', '', case=False)
.str.replace('co\.\s+', '', case=False)
.str.replace('\W', '')
.str.title())
print(x)
area fixed
0 Co Cork Cork
1 Co. Cork Cork
2 Co-Cork Cocork
3 Co- Cork Cork
4 co. Cork Cork
5 county cork Cork
I used the dollar sign $ here: .str.replace('co\-\s+ | co-$', '', case=False) to get strings that end in co-. I thought that would remove it. But I guess it's not working because it's a substring?
So, what extremely obvious thing am I doing wrong?
I suggest using
x['area'].str.replace(r'(?i)\bco(?:\b[-.]?|unty)\s*', '').str.title()
Output:
>>> x['area'].str.replace(r'(?i)\bco(?:\b[-.]?|unty)\s*', '').str.title()
0 Cork
1 Cork
2 Cork
3 Cork
4 Cork
5 Cork
Name: area, dtype: object
The (?i)\bco(?:\b[-.]?|unty)\s* pattern matches:
(?i) - case insensitive modifier
\b - word boundary
co - a substring
(?:\b[-.]?|unty) - a non-capturing group matching
\b[-.]? - word boundary and then an optional - or .
| - or
unty - a unty string
-\s* - - and 0+ whitespaces.

How to condense text cleaning steps into single Python function?

Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you
You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])
Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object

Pandas remove characters between brackets [duplicate]

This question already has answers here:
How to remove parentheses and all data within using Pandas/Python?
(4 answers)
Closed 3 years ago.
I would like to remove characters between [] and currently I am doing
df['Text'] = df['Text'].str.replace(r"\[.*\]","")
But the output isn't desirable. Before it is [image] This document and after it is ******* This document where * is whitespace.
How do I get rid of this white space.
Edit 1
The Text column of df looks like below:
ID Text
0 REAL ESTATE LEASE THIS INDUSTRIAL REAL ESTAT...
5 Lease AureementMade and signed on the \ of Aug...
6 FIRST AMENDMENT OF LEASEDATE: August 31, 2001L...
8 [image: image0.jpg] Jack[image: image1.jb2] ...
9 [image: image0.jpg] ABC SALES Meeting 97...
14 FIRST AMENDMENT OF LEASETHIS FIRST AMENDMENT O...
17 [image: image0.tif] Deep ML LEASE SERVI...
22 [image: image0.jpg] F 15 083 EX [image: image1...
26 LEASE AGREEMENT—GROSS LEASEBASIC LEASE PROVISI...
28 [image: image0.jpg] 17. Medical VERIFICATION...
31 [image: image0.jpg] [image: image1.jb2] PLL 3...
32 SUBLEASETHIS SUBLEASE this “Sublease” made as ...
34 [image: image0.tif] Lease Agreement May 10, 20...
35 13057968.3 1 Initials: _____ _____ SECOND ...
42 [image: image0.jpg] Jack Dowson Buy Real MI...
46 Deep – Machine Learning LEASE B...
I would like to see
ID Text
0 REAL ESTATE LEASE THIS INDUSTRIAL REAL ESTAT...
5 Lease AureementMade and signed on the \ of Aug...
6 FIRST AMENDMENT OF LEASEDATE: August 31, 2001L...
8 Jack ...
9 ABC SALES Meeting 97...
14 FIRST AMENDMENT OF LEASETHIS FIRST AMENDMENT O...
17 Deep ML LEASE SERVI...
22 F 15 083 EX ...
26 LEASE AGREEMENT—GROSS LEASEBASIC LEASE PROVISI...
28 17. Medical VERIFICATION...
31 PLL 3...
32 SUBLEASETHIS SUBLEASE this “Sublease” made as ...
34 Lease Agreement May 10, 20...
35 13057968.3 1 Initials: _____ _____ SECOND ...
42 Jack Dowson Buy Real MI...
46 Deep – Machine Learning LEASE B...
Looks like you need .str.strip()
Ex:
df = pd.DataFrame({"ID": [1,2,3], "Text": ["[image: 123.jpg] This document", "[image: image.jpg] Readers of the article", "The agreement between [image: image.jpg] two parties"]})
df["Text"] = df["Text"].str.replace(r"(\s*\[.*?\]\s*)", " ").str.strip()
print(df)
Output:
0 This document
1 Readers of the article
2 The agreement between two parties
Name: Text, dtype: object
Add optional space (?) to your regex, so the whole regex (match part) should be:
r'\[.*\] ?'
Another hint: Your regex is enclosed in parentheses (a capturing group).
They are not needed. Remove them.

Categories

Resources