Pandas "Advanced" Merge on Substring

Pandas "Advanced" Merge on Substring - python

I have two dataframes:
df1:
locality
0 Chicago, IL
1 San Francisco, CA
2 Chic, TN
df2:
City County
0 San Francisco San Francisco County
1 Chic Dyer County
2 Chicago Cook County
I want to find all values (or their corresponding indices) in locality that start with each value in City so that I can eventually merge the two dataframes. The answer to this post is close to what I want to achieve, but is greedy and only extracts one value -- I want all matches, even if somewhat incorrect
For example, I would like to know that "Chic" in df2 matches both "Chicago, IL" and "Chic, TN" in df1 (and/or that "Chicago, IL" matches both "Chic" and "Chicago")
So far, I've accomplished this by using pandas apply and a custom function:
def getMatches(row):
matches = df1[df1['locality'].str.startswith(row['City'])]
return matches.index.tolist(), matches['locality']
df2.apply(getMatches, axis=1)
0 ([1], [San Francisco, CA, USA])
1 ([0, 2], [Chicago, IL, USA, Chic, TN, USA])
2 ([0], [Chicago, IL, USA])
This works fine until both df1 and df2 are large (100,000+ rows), where I run into time and memory issues (Even using Dask's parallelized apply). Is there a better alternative?

Related

Pandas Split 1 Column into Multiple Columns where Delimited Split size Can Vary

I have some address data like:
Address
Buffalo, NY, 14201
Stackoverflow Street, New York, NY, 9999
I'd like to split these into columns like:
Street City State Zip
NaN Buffalo NY 14201
StackOverflow Street New York NY 99999
Essentially, I'd like to shift my strings over by one in each column in the result.
With Pandas I know I can split columns like:
import pandas as pd
df = pd.DataFrame(
data={'Address': ['Buffalo, NY, 14201', 'Stackoverflow Street, New York, NY, 99999']}
)
df[['Street','City','State','Zip']] = (
df['Address']
.str.split(',', expand=True)
.applymap(lambda col: col.strip() if col else col)
)
but need to figure out how to conditionally shift columns when my result is only 3 columns.

First, create a function to reverse a split for each row. Because if you split normally, the NaN will be at the end, so you reverse the order and split the list now the NaN will be at the end but the list is reversed.
Then, apply it to all rows.
Then, rename the columns because they will be integers.
Finally, set them in the right order.
fn = lambda x: pd.Series([i for i in reversed(x.split(','))])
pad = df['Address'].apply(fn)
pad looks like this right now,
0 1 2 3
0 14201 NY Buffalo NaN
1 99999 NY New York Stackoverflow Street
Just need to rename the columns and flip the order back.
pad.rename(columns={0:'Zip',1:'State',2:'City',3:'Street'},inplace=True)
df = pad[['Street','City','State','Zip']]
Output:
Street City State Zip
0 NaN Buffalo NY 14201
1 Stackoverflow Street New York NY 99999

Use a bit of numpy magic to reorder the columns with None on the left:
df2 = df['Address'].str.split(',', expand=True)
df[['Street','City','State','Zip']] = df2.to_numpy()[np.arange(len(df))[:,None], np.argsort(df2.notna())]
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 None Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Another idea, add as many commas as needed to have n-1 (here 3) before splitting:
df[['Street','City','State','Zip']] = (
df['Address'].str.count(',')
.rsub(4-1).map(lambda x: ','*x)
.add(df['Address'])
.str.split(',', expand=True)
)
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999

Well I found a solution but not sure if there is something more performant out there. Open to other ideas.
def split_shift(s: str) -> list[str]:
split_str: list[str] = s.split(',')
# If split is only 3 items, shift things over by inserting a NA in front
if len(split_str) == 3:
split_str.insert(0,pd.NA)
return split_str
df[['Street','City','State','Zip']] = pd.DataFrame(df['Address'].apply(lambda x: split_shift(x)).tolist())

Filter Series/DataFrame by another DataFrame

Let's suppose I have a Series (or DataFrame) s1, for example list of all Universities and Colleges in the USA:
University
0 Searcy Harding University
1 Angwin Pacific Union College
2 Fairbanks University of Alaska Fairbanks
3 Ann Arbor University of Michigan
And another Series (od DataFrame) s2, for example list of all cities in the USA:
City
0 Searcy
1 Angwin
2 New York
3 Ann Arbor
And my desired output (bascially an intersection of s1 and s2):
Uni City
0 Searcy
1 Angwin
2 Fairbanks
3 Ann Arbor
The thing is: I'd like to create a Series that consists of cities but only these, that have a university/college. My very first thought was to remove "University" or "College" parts from the s1, but it turns out that it is not enough, as in case of Angwin Pacific Union College. Then I thought of leaving only the first word, but that excludes Ann Arbor.
Finally, I got a series of all the cities s2 and now I'm trying to use it as a filter (something similiar to .contains() or .isin()), so if a string s1 (Uni name) contains any of the elements of s2 (city name), then return only the city name.
My question is: how to do it in a neat way?

I would try to build a list comprehension of cities that are contained in at least one university name:
pd.Series([i for i in s2 if s1.str.contains(i).any()], name='Uni City')
With your example data it gives:
0 Searcy
1 Angwin
2 Ann Arbor
Name: Uni City, dtype: object

Data Used
s=pd.DataFrame({'University':['Searcy Harding University','Angwin Pacific Union College','Fairbanks University of Alaska Fairbanks','Ann Arbor University of Michigan']})
s2=pd.DataFrame({'City':['Searcy','Angwin','Fairbanks','Ann Arbor']})
Convert s2.City to set to create an iterator
st=set(s2.City.unique().tolist())
Calculate s['Uni City'] using the next() function to return the next item from the iterator.
s['Uni City']=s['University'].apply(lambda x: next((i for i in st if i in x)), np.nan)
Outcome

pandas finding inverse of merge

I have two pandas dataframes, one that is a list of states, cities, and a capital flag with a multiIndex of (state, city), and another that is a non-indexed (or default indexed, if that's more appropriate) list of states and their capitals, I need to perform an inner join on the two and then also find out which items in the cities df are NOT in the join.
Cities:
capital
state city
Ohio Akron N
Toledo N
Columbus N
Colorado Boulder N
Denver N
States:
state city
0 West Virginia Charleston
1 Ohio Columbus
Inner join to find the capital of Ohio:
pd.merge(cities, states, on=['state', 'city'], how='inner')
state city capital
0 Ohio Columbus N
Now I need to get a df that includes everything in the cities df EXCEPT Columbus, Ohio. I've been looking at variations of .isin(), both with and without reset_index(), but I can't get it work.
Code to create the cities and states dfs. I have set_index() as a separate call because if I try to do it when I create the df I get an error about ValueError: Shape of passed values is (3, 3), indices imply (2, 3) and haven't figured out a way around it.
cities = pd.DataFrame({'state':['Ohio', 'Ohio', 'Ohio', 'Colorado', 'Colorado'], 'city':['Akron', 'Toledo', 'Columbus', 'Boulder', 'Denver'], 'capital':['N', 'N', 'N', 'N', 'N']}, columns=['state', 'city', 'capital'])
cities.set_index(('state', 'city'))
states = pd.DataFrame({'state':['West Virginia', 'Ohio'], 'city':['Charleston', 'Columbus']})

IIUC, you could use merge with how='outer' and indicator='source', and the keep only those that are 'left_only':
merge = cities.merge(states, on=['state', 'city'], how='outer', indicator='source')
result = merge[merge.source.eq('left_only')].drop('source', axis=1)
print(result)
Output
state city capital
0 Ohio Akron N
1 Ohio Toledo N
3 Colorado Boulder N
4 Colorado Denver N
As an alternative you could use isin, in the following way:
mask = ~cities.reset_index().city.isin(states.city)
print(cities[pd.Series(data=mask.values, index=cities.index)])
Output
capital
state city
Ohio Akron N
Toledo N
Colorado Boulder N
Denver N
The idea of the second approach is to create a boolean mask with an index matching the one in cities. A variation on the second approach is the following:
# drop the index
re_indexed = cities.reset_index()
# find the mask
mask = ~re_indexed.city.isin(states.city)
# reindex back
result = re_indexed[mask].set_index(['state', 'city'])
print(result)

Order Subindex in dataframe and sum the top "n" entries

I have a dataframe that looks like this:
Population2010
State County
AL Baldwin 90332
Douglas 92082
Rolling 52000
CA Orange 3879602
San Diego 4364594
Los Angeles 12123562
CO Boulder 161818
Denver 737728
Jefferson 222368
AZ Maricopa 2239378
Pinal 448888
Pima 1000564
I would like to put the data in descending order based on the population but also have it be ordered by the state
Population2010
State County
AL Douglas 92082
Baldwin 90332
Rolling 52000
CA Los Angeles 12123562
San Diego 4364594
Orange 3879602
CO Denver 737728
Jefferson 222368
Boulder 161818
AZ Maricopa 2239378
Pima 1000564
Pinal 448888
and then I would like to sum the first two entries of population data and give the two states with the highest sums.
'CA', 'AZ'

Question 1:
df.sort_values(['Population2010'], ascending=False)\
.reindex(sorted(df.index.get_level_values(0).unique()), level=0)
or
df.sort_values('Population2010', ascending=False)\
.sort_index(level=0, ascending=[True])
Output:
Population2010
State County
AL Douglas 92082
Baldwin 90332
Rolling 52000
AZ Maricopa 2239378
Pima 1000564
Pinal 448888
CA Los Angeles 12123562
San Diego 4364594
Orange 3879602
CO Denver 737728
Jefferson 222368
Boulder 161818
First, sort the entire dataframe by values descending, then get the values from the index for level=0, sort them and use to reindex on level=0 to sort the dataframe in groups of level 0.
Question 2 somewhat unrelated calculation to the first:
df.groupby('State')['Population2010']\
.apply(lambda x: x.nlargest(2).sum())\
.nlargest(2).index.tolist()
Output:
['CA', 'AZ']
Use nlargest to find two largest values grouped by state and sum, then use nlargest again to find the two largest states for those sums.

str.extract starting from the back in pandas DataFrame

I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).
I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!

You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object

cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas "Advanced" Merge on Substring - python

Related

Pandas Split 1 Column into Multiple Columns where Delimited Split size Can Vary

Filter Series/DataFrame by another DataFrame

pandas finding inverse of merge

Order Subindex in dataframe and sum the top "n" entries

str.extract starting from the back in pandas DataFrame

Categories

Resources