Speeding up for-loops using pandas for feature engineering - python

I have a dataframe with the following headings:
payer
recipient_country
date of payment
Each rows shows a transaction, and a row (Bob,UK,1st January 2023) shows that a payer Bob sent a payment to the UK on 1st January 2023.
For each row in this table I need to find the number of times that the payer for that row has sent a payment to the country for that row in the past. So for the row above I would want to find the number of times that Bob has sent money to the UK prior to 1st January 2023.
This is for feature engineering purposes.
I have done this using a for loop in which I iterate through rows and do a pandas loc call for each row to find rows with an earlier date with the same payer and country, but this is far too slow for the number of rows I have to process.
Can anyone think of a way to speed up this process using some fast pandas functions?
Thanks!

Testing on this toy data frame:
df = pd.DataFrame(
[{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)
Just group by and cumulatively count:
>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
name country date trns_bf
0 Bob UK 2023-01-01 0
1 Bob UK 2023-01-02 1
2 Bob UK 2023-01-03 2
3 Cob UK 2023-01-04 0
4 Cob UK 2023-01-05 1
5 Cob UK 2023-01-06 2
6 Cob UK 2023-01-07 3
You need to sort first, to ensure that elements before are not confused with elements after. I interpreted "prior" in your question literally: eg there are no transactions before Bob's transaction to the UK on 1 Jan 2023.
Each row gets its own count for transactions with that name to that country before that date. If there are multiple transactions on one day, determine how you want to deal with that. I would probably use another group by and select the maximum value for that day: df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max() and then merge the result back (indexing will make it difficult to attach directly as above).

Related

Drop groups whose variance is zero

Suppose the next df:
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[0, 0.001, 0, 0, 0, np.nan],
'age':[24, 22, 45, np.nan, 60, 32]}
df=pd.DataFrame(d)
The idea is to get the variance for an specific column by group (in this case by: country, level and job title), then select the segments whose variance is below certain threshold and drop them from the original df.
However when applied:
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title']).var()['number']
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
#df.drop(rows_to_drop, axis=0, inplace=True)
The next error arises:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
Expected dataframe would drop: Poland A00 Sales Director 0.000000e+00 for all months , as it is a segment with zero-variance.
Is it possible to reindex group_vars in order to drop it from original df?
What am I missing?
You can achieve this with transform
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title'])['number'].transform('var')
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
df.drop(rows_to_drop, axis=0, inplace=True)
Which gives:
month country level job title number age
0 01/01/2020 Japan A01 Insights Manager 0.000 24.0
1 01/02/2020 Japan A01 Insights Manager 0.001 22.0
2 01/03/2020 Japan A01 Insights Manager 0.000 45.0

Sum columns by key values in another column

I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']

How to extract row with mixed value

I have to extract rows from a pandas dataframe with values in 'Date of birth' column which occur in a list with dates.
import pandas as pd
df = pd.DataFrame({'Name': ['Jack', 'Mary', 'David', 'Bruce', 'Nick', 'Mark', 'Carl', 'Sofie'],
'Date of birth': ['1973', '1999', '1995', '1992/1991', '2000', '1969', '1994', '1989/1990']})
dates = ['1973', '1992', '1969', '1989']
new_df = df.loc[df['Date of birth'].isin(dates)]
print(new_df)
0 Jack 1973
1 Mary 1999
2 David 1995
3 Bruce 1992/1991
4 Nick 2000
5 Mark 1969
6 Carl 1994
7 Sofie 1989/1990
Eventually I get the table below. As you can see, Bruce's and Sofie's rows are absent since the value is followed by / and another value. How should I split up these two filter them out?
Name Date of birth
0 Jack 1973
5 Mark 1969
You could use str.contains:
import pandas as pd
df = pd.DataFrame({'Name': ['Jack', 'Mary', 'David', 'Bruce', 'Nick', 'Mark', 'Carl', 'Sofie'],
'Date of birth': ['1973', '1999', '1995', '1992/1991', '2000', '1969', '1994', '1989/1990']})
dates = ['1973', '1992', '1969', '1989']
new_df = df.loc[df['Date of birth'].str.contains(rf"\b{'|'.join(dates)}\b")]
print(new_df)
Output
Name Date of birth
0 Jack 1973
3 Bruce 1992/1991
5 Mark 1969
7 Sofie 1989/1990
The string rf"\b{'|'.join(dates)}\b" is a regex pattern that will match any of string that contains any of the dates.
I like #DaniMesejo way better but here is a way splitting up the values and stacking:
df[df['Date of birth'].str.split('/', expand=True).stack().isin(dates).max(level=0)]
Output:
Name Date of birth
0 Jack 1973
3 Bruce 1992/1991
5 Mark 1969
7 Sofie 1989/1990

Groupby one column and count another column with a condition?

I was wondering if it is possible to groupby one column while counting the values of another column that fulfill a condition. Because my dataset is a bit weird, I created a similar one:
import pandas as pd
raw_data = {'name': ['John', 'Paul', 'George', 'Emily', 'Jamie'],
'nationality': ['USA', 'USA', 'France', 'France', 'UK'],
'books': [0, 15, 0, 14, 40]}
df = pd.DataFrame(raw_data, columns = ['name', 'nationality', 'books'])
Say, I want to groupby the nationality and count the number of people that don't have any books (books == 0) from that country.
I would therefore expect something like the following as output:
nationality
USA 1
France 1
UK 0
I tried most variations of groupby, using filter, agg but don't seem to get anything that works.
Thanks in advance,
BBQuercus :)
IIUC:
df.books.eq(0).astype(int).groupby(df.nationality).sum()
nationality
France 1
UK 0
USA 1
Name: books, dtype: int64
Use:
df.groupby('nationality')['books'].apply(lambda x: x.eq(0).any().astype(int))
nationality
France 1
UK 0
USA 1
Name: books, dtype: int64

Update column values based on other columns

I have a weak grasp of Pandas and not a strong understanding of Python.
I am wanting to update a column (d.Alias) based on the value of existing columns (d.Company and d2.Alias). d.Alias should be equal to d2.Alias if d2.Alias is a substring of d.Company.
Example datasets:
d = {'Company': ['The Cool Company Inc', 'Cool Company, Inc', 'The Cool
Company', 'The Shoe Company', 'Muffler Store', 'Muffler Store'],
'Position': ['Cool Job A', 'Cool Job B', 'Cool Job C', 'Salesman',
'Sales', 'Technician'],
'City': ['Tacoma', 'Tacoma','Tacoma', 'Boulder', 'Chicago', 'Chicago'],
'State': ['AZ', 'AZ', 'AZ', 'CO', 'IL', 'IL'],
'Alias': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
d2 = {'Company': ['The Cool Company, Inc.', 'The Shoe Company', 'Muffler
Store LLC'],
'Alias': ['Cool Company', np.nan, 'Muffler'],
'First Name': ['Carol', 'James', 'Frankie'],
'Last Name': ['Fisher', 'Smith', 'Johnson']}
The np.nan for The Shoe Company is because for that instance an alias is not necessary.
I have tried using .loc, for loops, while loops, pandas.where, numpy.where, and several variations of each with no desirable outcomes. When using a for loop, the end of d2.Alias was copied to all rows in d.Alias. I have not been able to reproduce that, however.
Previous posts that I have looked at which I wasn't able to get to work, or I didn't understand them: Conditionally fill column with value from another DataFrame based on row match in Pandas
pandas create new column based on values from other columns
Any help is greatly appreciated!
EDIT:
Expected output
Update:
After a few days of tinkering I reached the desired outcome. With Wen's response I had to change a couple of things.
First, I created a list from df2.Alias called aliases:
aliases = df2.Alias.unique()
Then, I had to remove .map(df2.set_index('Company').Alias. The line that generated my desired resutls:
df1['Alias'] = df1.Company.apply(lambda x: [process.extract(x, aliases, limit=1)][0][0][0]).
Solution from fuzzywuzzy
from fuzzywuzzy import process
df1['Alias']=df1.Company.apply(lambda x :[process.extract(x, df2.Company, limit=1)][0][0][0]).map(df2.set_index('Company').Alias)
df1
Out[31]:
Alias City Company Position State
0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
2 Cool Company Tacoma The Cool Company Cool Job C AZ
3 NaN Boulder The Shoe Company Salesman CO
4 Muffler Chicago Muffler Store Sales IL
5 Muffler Chicago Muffler Store Technician IL
One approach is to loop through your presumably much smaller dataframe and just look to see when the alias is a substring of d.Company and then just replace the alias with that.
import pandas as pd
d = pd.DataFrame(d)
d2 = pd.DataFrame(d2)
for row in d2[d2.Alias.notnull()].itertuples():
d.loc[d.Company.str.contains(row.Alias), 'Alias'] = row.Alias
print(d)
# Alias City Company Position State
#0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
#1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
#2 Cool Company Tacoma The Cool Company Cool Job C AZ
#3 NaN Boulder The Shoe Company Salesman CO
#4 Muffler Chicago Muffler Store Sales IL
#5 Muffler Chicago Muffler Store Technician IL

Categories

Resources