I have a weak grasp of Pandas and not a strong understanding of Python.
I am wanting to update a column (d.Alias) based on the value of existing columns (d.Company and d2.Alias). d.Alias should be equal to d2.Alias if d2.Alias is a substring of d.Company.
Example datasets:
d = {'Company': ['The Cool Company Inc', 'Cool Company, Inc', 'The Cool
Company', 'The Shoe Company', 'Muffler Store', 'Muffler Store'],
'Position': ['Cool Job A', 'Cool Job B', 'Cool Job C', 'Salesman',
'Sales', 'Technician'],
'City': ['Tacoma', 'Tacoma','Tacoma', 'Boulder', 'Chicago', 'Chicago'],
'State': ['AZ', 'AZ', 'AZ', 'CO', 'IL', 'IL'],
'Alias': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
d2 = {'Company': ['The Cool Company, Inc.', 'The Shoe Company', 'Muffler
Store LLC'],
'Alias': ['Cool Company', np.nan, 'Muffler'],
'First Name': ['Carol', 'James', 'Frankie'],
'Last Name': ['Fisher', 'Smith', 'Johnson']}
The np.nan for The Shoe Company is because for that instance an alias is not necessary.
I have tried using .loc, for loops, while loops, pandas.where, numpy.where, and several variations of each with no desirable outcomes. When using a for loop, the end of d2.Alias was copied to all rows in d.Alias. I have not been able to reproduce that, however.
Previous posts that I have looked at which I wasn't able to get to work, or I didn't understand them: Conditionally fill column with value from another DataFrame based on row match in Pandas
pandas create new column based on values from other columns
Any help is greatly appreciated!
EDIT:
Expected output
Update:
After a few days of tinkering I reached the desired outcome. With Wen's response I had to change a couple of things.
First, I created a list from df2.Alias called aliases:
aliases = df2.Alias.unique()
Then, I had to remove .map(df2.set_index('Company').Alias. The line that generated my desired resutls:
df1['Alias'] = df1.Company.apply(lambda x: [process.extract(x, aliases, limit=1)][0][0][0]).
Solution from fuzzywuzzy
from fuzzywuzzy import process
df1['Alias']=df1.Company.apply(lambda x :[process.extract(x, df2.Company, limit=1)][0][0][0]).map(df2.set_index('Company').Alias)
df1
Out[31]:
Alias City Company Position State
0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
2 Cool Company Tacoma The Cool Company Cool Job C AZ
3 NaN Boulder The Shoe Company Salesman CO
4 Muffler Chicago Muffler Store Sales IL
5 Muffler Chicago Muffler Store Technician IL
One approach is to loop through your presumably much smaller dataframe and just look to see when the alias is a substring of d.Company and then just replace the alias with that.
import pandas as pd
d = pd.DataFrame(d)
d2 = pd.DataFrame(d2)
for row in d2[d2.Alias.notnull()].itertuples():
d.loc[d.Company.str.contains(row.Alias), 'Alias'] = row.Alias
print(d)
# Alias City Company Position State
#0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
#1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
#2 Cool Company Tacoma The Cool Company Cool Job C AZ
#3 NaN Boulder The Shoe Company Salesman CO
#4 Muffler Chicago Muffler Store Sales IL
#5 Muffler Chicago Muffler Store Technician IL
Related
I have a dataframe with the following headings:
payer
recipient_country
date of payment
Each rows shows a transaction, and a row (Bob,UK,1st January 2023) shows that a payer Bob sent a payment to the UK on 1st January 2023.
For each row in this table I need to find the number of times that the payer for that row has sent a payment to the country for that row in the past. So for the row above I would want to find the number of times that Bob has sent money to the UK prior to 1st January 2023.
This is for feature engineering purposes.
I have done this using a for loop in which I iterate through rows and do a pandas loc call for each row to find rows with an earlier date with the same payer and country, but this is far too slow for the number of rows I have to process.
Can anyone think of a way to speed up this process using some fast pandas functions?
Thanks!
Testing on this toy data frame:
df = pd.DataFrame(
[{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)
Just group by and cumulatively count:
>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
name country date trns_bf
0 Bob UK 2023-01-01 0
1 Bob UK 2023-01-02 1
2 Bob UK 2023-01-03 2
3 Cob UK 2023-01-04 0
4 Cob UK 2023-01-05 1
5 Cob UK 2023-01-06 2
6 Cob UK 2023-01-07 3
You need to sort first, to ensure that elements before are not confused with elements after. I interpreted "prior" in your question literally: eg there are no transactions before Bob's transaction to the UK on 1 Jan 2023.
Each row gets its own count for transactions with that name to that country before that date. If there are multiple transactions on one day, determine how you want to deal with that. I would probably use another group by and select the maximum value for that day: df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max() and then merge the result back (indexing will make it difficult to attach directly as above).
I have a dataframe (df1) that looks like this:
first_name
last_name
affiliation
jean
dulac
University of Texas
peter
huta
University of Maryland
I want to match this dataframe to another one that contains several potential matches. Each potential match has a first and last name and also a list of all the affiliations this person was associated with, and I want to use the information in this affiliation column to differentiate between my potential matches and keep only the most likely one.
The second dataframe has the following form:
first_name
last_name
affiliations_all
jean
dulac
[{'city_name': 'Kyoto', 'country_name': 'Japan', 'name': 'Kyoto University'}]
jean
dulac
[{'city_name': 'Texas', 'country_name': 'USA', 'name': 'University of Texas'}]
The column affiliations_all is apparently saved as a pandas.core.series.Series (and I can't change that since it comes from an API query).
I am thinking that one way to match the 2 dataframes would be to remove the words like "university" and "of" from the affiliation column of the first dataframe (that's easy), do the same for the affiliations_all column of the second dataframe (don't know how to do that) and then run some version of
test.apply(lambda x: str(x.affiliation) in str(x.affiliations_all), axis=1)
adapted to the fact that affiliations_all is a series.Series.
Any idea how to do that?
Thanks!
One possible solution would be to transform df2 (expand the columns) and then merge df1 with df2:
# transform df2
df2 = df2.explode("affiliations_all")
df2 = pd.concat([df2, df2.pop("affiliations_all").apply(pd.Series)], axis=1)
df2 = df2.rename(columns={"name": "affiliation"})
print(df2)
This prints:
first_name last_name city_name country_name affiliation
0 jean dulac Kyoto Japan Kyoto University
1 jean dulac Texas USA University of Texas
And the seconds step will be merge df1 with transformed df2:
df_out = pd.merge(df1, df2, on=["first_name", "last_name", "affiliation"])
print(df_out)
Prints:
first_name last_name affiliation city_name country_name
0 jean dulac University of Texas Texas USA
I have a dataframe which looks like:
df =
|Name Nationality Family etc.....
0|John Born in Spain. Wife
1|nan But live in England son
2|nan nan daughter
Some columns only have one row but others have multiple answers over a few rows, how could i merge the rows onto each other so it would look something like the below:
df =
|Name Nationality Family etc....
0|John Born in Spain. But live in England Wife Son Daughter
Perhaps this will do it for you:
import pandas as pd
# your dataframe
df = pd.DataFrame(
{'Name': ['John', np.nan, np.nan],
'Nationality': ['Born in Spain.', 'But live in England', np.nan],
'Family': ['Wife', 'son', 'daughter']})
def squeeze_df(df):
new_df = {}
for col in df.columns:
new_df[col] = [df[col].str.cat(sep=' ')]
return pd.DataFrame(new_df)
squeeze_df(df)
# >> out:
# Name Nationality Family
# 0 John Born in Spain. But live in England Wife son daughter
I made the assumption that you only need to do this for one single person (i.e. squeezing/joining the rows of the dataframe into a single row). Also, what does "etc...." mean? For example, will you have integer or floating point values in the dataframe?
suppose I have a dataframe which contains:
df = pd.DataFrame({'Name':['John', 'Alice', 'Peter', 'Sue'],
'Job': ['Dentist', 'Blogger', 'Cook', 'Cook'],
'Sector': ['Health', 'Entertainment', '', '']})
and I want to find all 'cooks', whether in capital letters or not and assign them to the column 'Sector' with a value called 'gastronomy', how do I do that? And without overwriting the other entries in the column 'Sector'? Thanks!
Here's one approach:
df.loc[df.Job.str.lower().eq('cook'), 'Sector'] = 'gastronomy'
print(df)
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy
Using Series.str.match with regex and a regex flag for not case sensitive (?i):
df.loc[df['Job'].str.match('(?i)cook'), 'Sector'] = 'gastronomy'
Output
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy
I am working on a data set from various sources and I have column is the frame that captures for example City Names. Certain city names have the name of the state in parentheses, others have the zip code. How do I get ride of those extra info and keep the city names only?
Example of data
In certain cases, there is no space between zip code and cityname.
Below is the desired frame:
Desired frame
You can use replace by regex:
\s* for one or zero space
\d+ one or more digits
\(.*\) all strings between + parentheses
df = pd.DataFrame([{'City Name': 'Dallas'} ,
{'City Name': 'New York (NY)'} ,
{'City Name': 'West Orange 07052'} ,
{'City Name': 'Orlando (Florda)'} ,
{'City Name': 'Camdem (NJ)'} ,
{'City Name': 'Boston'} ,
{'City Name': 'Harrison07029'}])
df['new'] = df['City Name'].replace(['\s*\d+', '\s*\(.*\)'], '', regex=True)
print (df)
City Name new
0 Dallas Dallas
1 New York (NY) New York
2 West Orange 07052 West Orange
3 Orlando (Florda) Orlando
4 Camdem (NJ) Camdem
5 Boston Boston
6 Harrison07029 Harrison