split a string from a set of substrings in panda frames

split a string from a set of substrings in panda frames - python

I am working on a data set from various sources and I have column is the frame that captures for example City Names. Certain city names have the name of the state in parentheses, others have the zip code. How do I get ride of those extra info and keep the city names only?
Example of data
In certain cases, there is no space between zip code and cityname.
Below is the desired frame:
Desired frame

You can use replace by regex:
\s* for one or zero space
\d+ one or more digits
\(.*\) all strings between + parentheses
df = pd.DataFrame([{'City Name': 'Dallas'} ,
{'City Name': 'New York (NY)'} ,
{'City Name': 'West Orange 07052'} ,
{'City Name': 'Orlando (Florda)'} ,
{'City Name': 'Camdem (NJ)'} ,
{'City Name': 'Boston'} ,
{'City Name': 'Harrison07029'}])
df['new'] = df['City Name'].replace(['\s*\d+', '\s*\(.*\)'], '', regex=True)
print (df)
City Name new
0 Dallas Dallas
1 New York (NY) New York
2 West Orange 07052 West Orange
3 Orlando (Florda) Orlando
4 Camdem (NJ) Camdem
5 Boston Boston
6 Harrison07029 Harrison

Related

Find words and create new value in different column pandas dataframe with regex

suppose I have a dataframe which contains:
df = pd.DataFrame({'Name':['John', 'Alice', 'Peter', 'Sue'],
'Job': ['Dentist', 'Blogger', 'Cook', 'Cook'],
'Sector': ['Health', 'Entertainment', '', '']})
and I want to find all 'cooks', whether in capital letters or not and assign them to the column 'Sector' with a value called 'gastronomy', how do I do that? And without overwriting the other entries in the column 'Sector'? Thanks!

Here's one approach:
df.loc[df.Job.str.lower().eq('cook'), 'Sector'] = 'gastronomy'
print(df)
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy

Using Series.str.match with regex and a regex flag for not case sensitive (?i):
df.loc[df['Job'].str.match('(?i)cook'), 'Sector'] = 'gastronomy'
Output
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy

Trying to use a list to populate a dataframe column

I have a dataframe (df) and I would like to create a new column called country, which is calculated buy looking at the region column and where the region value is present in the EnglandRegions list then the country value is set to England else its the value from the region column.
Please see below for my desired output:
name salary region B1salary country
0 Jason 42000 London 42000 England
1 Molly 52000 South West England
2 Tina 36000 East Midland England
3 Jake 24000 Wales Wales
4 Amy 73000 West Midlands England
You can see that all the values in country are set to England except for the value assigned to Jakes record that is set to Wales (as Wales is not in the EnglandRegions list). The code below produces the following error:
File "C:/Users/stacey/Documents/scripts/stacey.py", line 20
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
^
SyntaxError: invalid syntax
The code is as follows:
import pandas as pd
import numpy as np
EnglandRegions = ["London", "South West", "East Midland", "West Midlands", "East Anglia"]
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'salary': [42000, 52000, 36000, 24000, 73000],
'region': ['London', 'South West', 'East Midland', 'Wales', 'West Midlands']}
df = pd.DataFrame(data, columns = ['name', 'salary', 'region'])
df['B1salary'] = np.where((df['salary']>=40000) & (df['salary']<=50000) , df['salary'], '')
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
print(df)

The specific issue the error is referencing is that you are missing a ] to enclose your .loc. However, fixing this won't work anyways. Try:
df['country'] = np.where(df['region'].isin(EnglandRegions), 'England', df['region'])
This is essentially what you already had in the line above it for B1salary anyways.

Update column values based on other columns

I have a weak grasp of Pandas and not a strong understanding of Python.
I am wanting to update a column (d.Alias) based on the value of existing columns (d.Company and d2.Alias). d.Alias should be equal to d2.Alias if d2.Alias is a substring of d.Company.
Example datasets:
d = {'Company': ['The Cool Company Inc', 'Cool Company, Inc', 'The Cool
Company', 'The Shoe Company', 'Muffler Store', 'Muffler Store'],
'Position': ['Cool Job A', 'Cool Job B', 'Cool Job C', 'Salesman',
'Sales', 'Technician'],
'City': ['Tacoma', 'Tacoma','Tacoma', 'Boulder', 'Chicago', 'Chicago'],
'State': ['AZ', 'AZ', 'AZ', 'CO', 'IL', 'IL'],
'Alias': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
d2 = {'Company': ['The Cool Company, Inc.', 'The Shoe Company', 'Muffler
Store LLC'],
'Alias': ['Cool Company', np.nan, 'Muffler'],
'First Name': ['Carol', 'James', 'Frankie'],
'Last Name': ['Fisher', 'Smith', 'Johnson']}
The np.nan for The Shoe Company is because for that instance an alias is not necessary.
I have tried using .loc, for loops, while loops, pandas.where, numpy.where, and several variations of each with no desirable outcomes. When using a for loop, the end of d2.Alias was copied to all rows in d.Alias. I have not been able to reproduce that, however.
Previous posts that I have looked at which I wasn't able to get to work, or I didn't understand them: Conditionally fill column with value from another DataFrame based on row match in Pandas
pandas create new column based on values from other columns
Any help is greatly appreciated!
EDIT:
Expected output
Update:
After a few days of tinkering I reached the desired outcome. With Wen's response I had to change a couple of things.
First, I created a list from df2.Alias called aliases:
aliases = df2.Alias.unique()
Then, I had to remove .map(df2.set_index('Company').Alias. The line that generated my desired resutls:
df1['Alias'] = df1.Company.apply(lambda x: [process.extract(x, aliases, limit=1)][0][0][0]).

Solution from fuzzywuzzy
from fuzzywuzzy import process
df1['Alias']=df1.Company.apply(lambda x :[process.extract(x, df2.Company, limit=1)][0][0][0]).map(df2.set_index('Company').Alias)
df1
Out[31]:
Alias City Company Position State
0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
2 Cool Company Tacoma The Cool Company Cool Job C AZ
3 NaN Boulder The Shoe Company Salesman CO
4 Muffler Chicago Muffler Store Sales IL
5 Muffler Chicago Muffler Store Technician IL

One approach is to loop through your presumably much smaller dataframe and just look to see when the alias is a substring of d.Company and then just replace the alias with that.
import pandas as pd
d = pd.DataFrame(d)
d2 = pd.DataFrame(d2)
for row in d2[d2.Alias.notnull()].itertuples():
d.loc[d.Company.str.contains(row.Alias), 'Alias'] = row.Alias
print(d)
# Alias City Company Position State
#0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
#1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
#2 Cool Company Tacoma The Cool Company Cool Job C AZ
#3 NaN Boulder The Shoe Company Salesman CO
#4 Muffler Chicago Muffler Store Sales IL
#5 Muffler Chicago Muffler Store Technician IL

(Python, Pandas) - How do I get everything to the left of a certain character?

I have a column, market_area that I want to abbreviate by keeping only the part of the string to the left of the hyphen.
For example, my data is like this:
import pandas as pd
tmp = pd.DataFrame({'market_area': ['San Francisco-Oakland-San Jose',
None,
'Dallas-Fort Worth',
'Los Angeles-Riverside-Orange County'],
'val': [1,2,3,4]})
My desired output would be:
['San Francisco', None, 'Dallas', 'Los Angeles']
I am able to split based on the hyphen:
tmp['market_area'].str.split('-')
But how do I extract only the part to the left of the hyphen?

You can extract the first element in the splitted list using .str[0]:
tmp.market_area.str.split('-').str[0]
Out[3]:
0 San Francisco
1 None
2 Dallas
3 Los Angeles
Name: market_area, dtype: object
Or use str.extract method with regex ^([^-]*).*, which captures the pattern until the first -:
tmp.market_area.str.extract('^([^-]*).*', expand=False)
Out[5]:
0 San Francisco
1 NaN
2 Dallas
3 Los Angeles
Name: market_area, dtype: object

How to split one column into multiple columns in Pandas using regular expression?

For example, if I have a home address like this:
71 Pilgrim Avenue, Chevy Chase, MD
in a column named 'address'. I would like to split it into columns 'street', 'city', 'state', respectively.
What is the best way to achieve this using Pandas ?
I have tried df[['street', 'city', 'state']] = df['address'].findall(r"myregex").
But the error I got is Must have equal len keys and value when setting with an iterable.
Thank you for your help :)

You can use split by regex ,\s+ (, and one or more whitespaces):
#borrowing sample from `Allen`
df[['street', 'city', 'state']] = df['address'].str.split(',\s+', expand=True)
print (df)
address id street city \
0 71 Pilgrim Avenue, Chevy Chase, MD a 71 Pilgrim Avenue Chevy Chase
1 72 Main St, Chevy Chase, MD b 72 Main St Chevy Chase
state
0 MD
1 MD
And if need remove column address add drop:
df[['street', 'city', 'state']] = df['address'].str.split(',\s+', expand=True)
df = df.drop('address', axis=1)
print (df)
id street city state
0 a 71 Pilgrim Avenue Chevy Chase MD
1 b 72 Main St Chevy Chase MD

df = pd.DataFrame({'address': {0: '71 Pilgrim Avenue, Chevy Chase, MD',
1: '72 Main St, Chevy Chase, MD'},
'id': {0: 'a', 1: 'b'}})
#if your address format is consistent, you can simply use a split function.
df2 = df.join(pd.DataFrame(df.address.str.split(',').tolist(),columns=['street', 'city', 'state']))
df2 = df2.applymap(lambda x: x.strip())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

split a string from a set of substrings in panda frames - python

Related

Find words and create new value in different column pandas dataframe with regex

Trying to use a list to populate a dataframe column

Update column values based on other columns

(Python, Pandas) - How do I get everything to the left of a certain character?

How to split one column into multiple columns in Pandas using regular expression?

Categories

Resources