Speeding up Pandas .apply function. Counting rows and doing operation with them - python

I have a huge database of migratory movements and I wrote some scripts to get useful info from it but it's really really slow. I'm not a professional coder as you will see, and I was wondering how to make this data gathering more efficient.
To start, the initial CSV database is structured as follows:
1 row = 1 person
Age Sex City_start City_destination ...
Person 1
Person 2
.....
The final database structure:
Balance_2004 Balance_2005 ....
City1
City2
....
For calculating this Balance per city and year I created a function that filters the initial database to count how many rows have a specific city in city_destination (INs), how many rows in city_start (OUTs) and then a simple sum to calculate the balance as INs - OUTs:
# idb = initial database
# City1 = pre-existing in final database
def get_balance(city, df):
ins = idb.City_start[idb.City_start == City1].count()
outs = idb.City_destination[idb.City_destination == City1].count()
balance = ins - outs
return balance
Then with this function I used pandas apply to populate the final database as:
# fdb = final database
fdb['Balance_2004'] = idb['City_start'].apply(get_balance, df=idb)
This works good, the ending result is what I need and I'm using in total 42 apply functions to get more specific data like balance per sex, per ages groups... but to give an idea of how slow this is, I started the script (with 42 functions) 45min ago and is still running.
Is there any way to do this in a less time-consuming way?
Thanks in advance

It might make sense to do this calculation only once, by grouping by the cities:
def get_balance_all_cities(df):
df_diff = pd.DataFrame([df.groupby(["City_start"])["Name"].count(),
df.groupby(["City_destination"])["Name"].count()]).T
df_diff.columns = "start", "end"
df_diff[df_diff.isna()] = 0
return df_diff.start - df_diff.end
Here are some examples for how it works:
>>> df = pd.DataFrame([("Person 1", "Chicago", "Chicago"), ("Person 2", "New York", "Chicago"), ("Person 3", "Houston", "New York")], columns=["Name", "City_start", "City_destination"])
>>> df
Name City_start City_destination
0 Person 1 Chicago Chicago
1 Person 2 New York Chicago
2 Person 3 Houston New York
>>> ins = df.groupby(["City_start"])["Name"].count()
City_start
Chicago 1
Houston 1
New York 1
Name: Name, dtype: int64
>>> outs = df.groupby(["City_end"])["Name"].count()
City_destination
Chicago 2
New York 1
Name: Name, dtype: int64
>>> df_diff = pd.DataFrame([ins, outs]).T
>>> df_diff.columns = "start", "end"
>>> df_diff[df_diff.isna()] = 0
>>> balance = df_diff.start - df_diff.end
Chicago -1.0
Houston 1.0
New York 0.0
dtype: float64
The work-around at the end is to deal with cities in where no-one lives during end or start but does live during the other time.

I believe need aggregate by cities with years with DataFrameGroupBy.size and reshape by unstack, then subtract by sub and if necessary convert to integers:
idb = pd.DataFrame([("a", "Chicago", "Chicago", 2018),
("b", "New York", "Chicago", 2018),
("c", "New York", "Chicago", 2017),
("d", "Houston", "LA", 2018)],
columns=["Name", "City_start", "City_destination", 'year'])
print (idb)
Name City_start City_destination year
0 a Chicago Chicago 2018
1 b New York Chicago 2018
2 c New York Chicago 2017
3 d Houston LA 2018
a1 = idb.groupby(["City_start", 'year']).size().unstack(fill_value=0)
a2 = idb.groupby(["City_destination", 'year']).size().unstack(fill_value=0)
idb = a1.sub(a2, fill_value=0).astype(int).add_prefix('Balance_')
print (idb)
year Balance_2017 Balance_2018
Chicago -1 -1
Houston 0 1
LA 0 -1
New York 1 1

Related

LEFT ON Case When in Pandas

i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )

Update Pandas DataFrame for each string in a list

I am analyzing a dataset containing NFL game results over the past 20 years and am trying to create a column denoting for each team whether or not the game was a home game or away game (home game = 1, away game = 0).
The code I have so far is:
home_list = list(df.home_team.unique())
def home_or_away(team_name, dataf):
dataf['home_or_away'] = np.where(dataf['home_team'] == team_name, 1, 0)
return dataf
for i in home_list:
home_update_all = home_or_away(i, df)
df.update(home_update_all)
This doesn't seem to yield the correct results as each team is just overwritten when iterating over them. Any ideas on how to solve this?
Thanks!
Not really sure what your expected output is. Do you mean you want one column per team? You currently keep creating columns but with the same name so always only the one in the last iteration will be kept, the rest overwritten. Or do you want multiple DataFrames?
If you want multiple columns, one per team:
import pandas as pd
df = pd.DataFrame({'game': [1, 2, 3, 4], 'home_team': ['a', 'b', 'c', 'a']})
> game home_team
0 1 a
1 2 b
2 3 c
3 4 a
First collect unique teams as you did:
home_list = list(df.home_team.unique())
Create a column for each team:
for team in home_list:
df[f'home_or_away_{team}'] = [int(ht==team) for ht in df['home_team']]
Which results in:
> game home_team home_or_away_a home_or_away_b home_or_away_c
0 1 a 1 0 0
1 2 b 0 1 0
2 3 c 0 0 1
3 4 a 1 0 0
You're over complicating it. Don't need to iterate with numpy .where(). Just use the np.where() on the 2 columns (not with a separate function).
Basically says "where home_team equals team_name, put a 1, else put 0"
import pandas as pd
import numpy as np
df = pd.DataFrame([['Chicago Bears','Chicago Bears', 'Green Bay Packers'],
['Chicago Bears','Green Bay Packers', 'Chicago Bears'],
['Detriot Lions','Detriot Lions', 'Los Angeles Chargers'],
['New England Patriots','New York Jets', 'New England Patriots'],
['Houston Texans','Los Angeles Rams', 'Houston Texans']],
columns = ['team_name','home_team','away_team'])
df['home_or_away'] = np.where(df['home_team'] == df['team_name'], 1, 0)
Output:
print(df)
team_name home_team away_team home_or_away
0 Chicago Bears Chicago Bears Green Bay Packers 1
1 Chicago Bears Green Bay Packers Chicago Bears 0
2 Detriot Lions Detriot Lions Los Angeles Chargers 1
3 New England Patriots New York Jets New England Patriots 0
4 Houston Texans Los Angeles Rams Houston Texans 0

How to choose random item from a dictionary to df and exclude one item?

I have a dictionary and a dataframe, for example:
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
print(df)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
Output:
Name
0 Tom
1 Joseph
2 Krish
3 John
I've created a column called CITY by using the following:
df["CITY"]=np.random.choice(list(city), len(df))
df
Name CITY
0 Tom New York
1 Joseph LA
2 Krish Miami
3 John New Yor
Now, I would like to generate a new column - CITY2 with a random item from city dictionary, but I would like CITY will be a different item than CITY2, so basically when I'm generating CITY2 I need to exclude CITY item.
It's worth mentioning that my real df is quite large so I need it to be effective as possible.
Thanks in advance.
continue with approach you have used
have used pd.Series() as a convenience to remove value that has already been used
wrapped in apply() to get value of each row
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
df["CITY"]=np.random.choice(list(city), len(df))
df["CITY2"] = df["CITY"].apply(lambda x: np.random.choice(pd.Series(city).drop(x).index))
Name
CITY
CITY2
0
Tom
Miami
New York
1
Joseph
LA
Miami
2
Krish
New York
Miami
3
John
New York
LA
You could also first group by "CITY", remove the current city per group from the city dict and then create the new random list of cities.
Maybe this is faster because you don't have to drop one city per row, but per group.
city2 = pd.Series()
for key,group in df.groupby('CITY'):
cities_subset = np.delete(np.array(list(city)),list(city).index(key))
city2 = city2.append(pd.Series(np.random.choice(cities_subset, len(group)),index=group.index))
df["CITY2"] = city2
This gives for example:
Name CITY CITY2
0 Tom New York LA
1 Joseph New York Miami
2 Krish LA New York
3 John New York LA

How do you use a function to aggregate dataframe columns and sort them into quarters, based on the European dates?

Hi I’m new to pandas and struggling with a challenging problem.
I have 2 dataframes:
Df1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
And
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. Also note the dates are written in the European format so (day/month/year).
I am able to summarize how many heroes are in each city with the line:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which gives me :
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I need to add another column that lists if the hero will be free from missions certain quarters.
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
If the hero ID in Df2 does not have a mission end date, the count should increase by one. If they do have an end date and it’s separated into
So in the end it should look like this:
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
I think I need to use the python datetime library to get the current date time. Than create a custom function which I can than apply to each row using a lambda. Something similar to the below code:
from datetime import date
today = date.today()
q1 = '05/04/2021'
q3 = '05/10/2020'
q4 = '05/01/2021'
count=0
def QuarterCount(Eid,AssignmentEnd )
if df1['Superhero ID'] == df2['SID'] :
if df2['Mission End date']<q3:
++count
return count
elif df2['Mission End date']>q3 && <q4:
++count
return count
elif df2['Mission End date']>q1:\
++count
return count
df['No. of heroes free in Q3'] = df1[].apply(lambda x(QuarterCount))
Please help me correct my syntax or logic or let me know if there is a better way to do this. Learning pandas is challenging but oddly fun. I'd appreciate any help you can provide :)

multiple conditions for lookup in pandas

I have 2 dataframes. One with the City, dates and sales
sales = [['20101113','Miami',35],['20101114','New York',70],['20101114','Los Angeles',4],['20101115','Chicago',36],['20101114','Miami',12]]
df2 = pd.DataFrame(sales,columns=['Date','City','Sales'])
print (df2)
Date City Sales
0 20101113 Miami 35
1 20101114 New York 70
2 20101114 Los Angeles 4
3 20101115 Chicago 36
4 20101114 Miami 12
The second has some dates and cities.
date = [['20101114','New York'],['20101114','Los Angeles'],['20101114','Chicago']]
df = pd.DataFrame(date,columns=['Date','City'])
print (df)
I want to extract the sales from the first dataframe that match the city and and dates in the 3nd dataframe and add the sales to the 2nd dataframe. If the date is missing in the first table then the next highest date's sales should be retrieved.
The new dataframe should look like this
Date City Sales
0 20101114 New York 70
1 20101114 Los Angeles 4
2 20101114 Chicago 36
I am having trouble extracting and merging tables. Any suggestions?
This is pd.merge_asof, which allows you to join on a combination of exact matches and then a "close" match for some column.
import pandas as pd
df['Date'] = pd.to_datetime(df.Date)
df2['Date'] = pd.to_datetime(df2.Date)
pd.merge_asof(df.sort_values('Date'),
df2.sort_values('Date'),
by='City', on='Date',
direction='forward')
Output:
Date City Sales
0 2010-11-14 New York 70
1 2010-11-14 Los Angeles 4
2 2010-11-14 Chicago 36

Categories

Resources