Optimizing apply and lambda function with pandas - python

I am trying to optimize a function returning the value (wage)of a variable given a condition (largest enrollment within MSA) for every year. I thought combining apply and lambda would be efficient, but my actual dataset is large (shape of 321681x272) making the computation extremely slow. Is there a faster way of going about this ? I think vectorizing the operations instead of iterating through df could be a solution, but I am unsure of the structure it would follow as an alternative to df.apply and lambda
df = pd.DataFrame({'year': [2000, 2000, 2001, 2001],
'msa': ['NYC-Newark', 'NYC-Newark', 'NYC-Newark', 'NYC-Newark'],
'leaname':['NYC School District', 'Newark School District', 'NYC School District', 'Newark School District'],
'enroll': [100000,50000,110000,60000],
'wage': [5,2,7,3] })
def function1(x,y, var):
'''
Returns the selected variable's value for school district with largest enrollment in a given year
'''
t = df[(df['msa'] == x) & (df['year'] == y)]
e = pd.DataFrame(t.groupby(['msa',var]).mean()['enroll'])
return e.loc[e.groupby(level=[0])['enroll'].idxmax()].reset_index()[var]
df['main_city_wage'] = df.apply(lambda x: function1(x['msa'], x['year'], 'wage'), axis = 1)
Sample Output
year msa leaname enroll wage main_wage
0 2000 NYC-Newark NYC School District 100000 5 5
1 2000 NYC-Newark Newark School District 50000 2 5
2 2001 NYC-Newark NYC School District 110000 7 7
3 2001 NYC-Newark Newark School District 60000 3 7

Something like
df['main_wage'] = df.set_index('wage').groupby(['year', 'msa'])['enroll'].transform('idxmax').values

Related

how to calculate percentage variation between two values?

I have this dataframe with the total population number by year.
import pandas as pd
cases_df = pd.DataFrame(data=cases_list, columns=['Year', 'Population', 'Nation'])
cases_df.head(7)
Year Population Nation
0 2019 328239523 United States
1 2018 327167439 United States
2 2017 325719178 United States
3 2016 323127515 United States
4 2015 321418821 United States
5 2014 318857056 United States
6 2013 316128839 United States
I want to calculate how much the population has increased from the year 2013 to 2019 by calculating the percentage change between two values (2013 and 2019):
{[(328239523 - 316128839)/ 316128839] x 100 }
How can I do this? Thank you very much!!
ps. some advice to remove index? 0 1 2 3 4 5 6
i tried to to that
df1 = df.groupby(level='Population').pct_change()
print(df1)
but i get error because "Population" says that is not the name of Index
I would do it following way
import pandas as pd
df = pd.DataFrame({"year":[2015,2014,2013],"population":[321418821,318857056,316128839],"nation":["United States","United States","United States"]})
df = df.set_index("year")
df["percentage"] = df["population"] * 100 / df["population"][2013]
print(df)
output
population nation percentage
year
2015 321418821 United States 101.673363
2014 318857056 United States 100.863008
2013 316128839 United States 100.000000
Note I used subset of data for brevity sake. Using year as index allow easy access to population value in 2013, percentage is computed as (population) * 100 / (population for 2013).
How to remove the mentioned index :
df.set_index('Year',inplace=True)
Now Year will replace your numbered index.
Now
Use cases_df.describe()
or cases_df.attribute_name.describe()
This is more of a math question rather than a programming question.
Let's call this a percentage difference between two values since population can vary both ways (increase or decrease over time).
Now, lets say that in 2013 we had 316128839 people and in 2019 we had 328239523 people:
a = 316128839
b = 328239523
Before we go about calculating the percentage, we need to find the difference between the b and a:
diff = b - a
Now that we have that, we need to see what is the percentage of diff of a:
perc = (diff / a) * 100
And there is your percentage variation between a and b

Getting maximum counts of a column in grouped dataframe

My dataframe df is:
Election Year Votes Party Region
0 2000 50 A a
1 2000 100 B a
2 2000 26 A b
3 2000 180 B b
4 2000 300 A c
5 2000 46 C c
6 2005 149 A a
7 2005 46 B a
8 2005 312 A b
9 2005 23 B b
10 2005 16 A c
11 2005 35 C c
I want to get the Party winning maximum region every year. So the desired output is:
Election Year Party
2000 B
2005 A
I tried this code to get the the above output, but it is giving error:
winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
winner = winner.groupby('Election Year').first().reset_index()
winner = winner[['Election Year', 'Party']].to_string(index=False)
winner
how can I get the desired output?
Here is one approach with nested groupby. We first count per-party votes in each year-region pair, then use mode to find the party winning most regions. The mode need not be unique (if two or more parties win the same number of regions).
df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
.unstack().mode(1).rename(columns={0: "Party"})
Party
Year
2000 B
2005 A
To address the comment, you can replace idxmax above with nlargest and diff to find regions where win margin is below a given number.
margin = df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125
print(margin[margin].reset_index()[["Year", "Region"]])
# Year Region
# 0 2000 a
# 1 2005 a
# 2 2005 c
You can use GroupBy.idxmax() to get the index of max Votes for each group of Election Year, then use .loc to locate the rows followed by selection of required columns, as followed:
df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]
Result:
Election Year Party
4 2000 A
8 2005 A
Edit
If we are to get the Party winning most Region, we can use the following codes (without using the slow .apply() with lambda function):
(df.loc[
df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
[['Election Year', 'Party', 'Region']]
.pivot(index='Election Year', columns='Region')
.mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()
Result:
Election Year Party
0 2000 B
1 2005 A
Try this
winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner
Another method: (closed to #hilberts_drinking_problem in fact)
>>> df.groupby(["Election Year", "Region"]) \
.apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
.unstack().mode(axis="columns") \
.rename(columns={0: "Party"}).reset_index()
Election Year Party
0 2000 B
1 2005 A
I believe the one liner df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party'] solves your problem

How to run a groupby based on result of other/previous groupby?

Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?
Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G
One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.

How to stop a Dataframe subset from "remembering" old values

Sorry for the weird phrasing, but I didn't know how to better describe it. I will be translating my problem to US terms to ease understanding. My problem is, I have a national database with States and Districts and I need to work only with Districts from Florida, so I do this:
df_fl=df.loc[df.state=='florida'].copy()
After some transformations I want to take mean values of every district from Florida, so I do this:
df_final=df_fl.groupby(['district']).mean()
But this brings a dataframe with every district in the database. All rows from districts that are not in Florida are filled with nans. I suppose there's an easy solution to this, but I haven't been able to find it. It's kinda counter intuitive that it works like that, too.
So, can you help me fix this?
Thanks in advance,
Edit:
my data looked like this:
District state Salary
1 Florida 1000
1 Florida 2000
2 Florida 2000
2 Florida 3000
3 California 3000
df_fl, then, looks like this:
District state Salary
1 Florida 1000
1 Florida 2000
2 Florida 2000
2 Florida 3000
And after applying
df_final=df_fl.groupby(['district']).mean()
I expected to get this:
District Salary
1 1500
2 2500
But I'm getting this:
District Salary
1 1500
2 2500
3 nan
Obviously a very simplified version, but the core remains.
It is because your 'District' column is a categorical type.
MCVE
df = pd.DataFrame(dict(
State=list('CCCCFFFF'),
District=list('WXWXYYZZ'),
Value=range(1, 9)
))
Without categorical
df.query('State == "F"').groupby('District').Value.mean()
District
Y 5.5
Z 7.5
Name: Value, dtype: float64
With categorical
df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District').Value.mean()
District
W NaN
X NaN
Y 5.5
Z 7.5
Name: Value, dtype: float64
Solution
Many ways to do this. One way that preserves the categorical typing is to use the method, remove_unused_categories
df = df.assign(District=df.District.cat.remove_unused_categories())
As piRSquared already explained this only happens with categorical data. Starting from 0.23.0 groupby has a new "observed" argument which toggles this behavior. MCVE taken from piRSquared:
>>> df = pd.DataFrame(dict(
State=list('CCCCFFFF'),
District=list('WXWXYYZZ'),
Value=range(1, 9)
))
>>> df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District').Value.mean()
District
W NaN
X NaN
Y 5.5
Z 7.5
Name: Value, dtype: float64
>>> df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District', observed=True).Value.mean()
District
Y 5.5
Z 7.5
Name: Value, dtype: float64

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Categories

Resources