How can I perform math on a pandas pivottable?

How can I perform math on a pandas pivottable? - python

I have the following df I have filtered from a CSV of financial data for stocks.
ticker comp_name per_fisc_year per_fisc_qtr tot_revnu
47 A AGILENT TECH 2006 4 4973.0
48 A AGILENT TECH 2007 4 5420.0
58 A AGILENT TECH 2006 1 1167.0
59 A AGILENT TECH 2006 2 1239.0
60 A AGILENT TECH 2006 3 1239.0
61 A AGILENT TECH 2006 4 1328.0
62 A AGILENT TECH 2007 1 1280.0
63 A AGILENT TECH 2007 2 1320.0
64 A AGILENT TECH 2007 3 1374.0
65 A AGILENT TECH 2007 4 1446.0
I then need to ADD up all the Quarterly data to get annual with a Pivot Table.
mean_rev_table = pd.pivot_table(zacks_df_filter_1, values=['tot_revnu'],
index=['comp_name'],columns=['per_fisc_year'],
aggfunc=np.mean)
mean_rev_table[:5]
which gives me a nicely formatted table
tot_revnu
per_fisc_year 2006 2007
comp_name
1800FLOWERS.COM 390.962667 290.26000
21ST CENTURY IN 550.114800 349.28200
24/7 KID DOC 0.857600 1.09520
24/7 REAL MEDIA 80.097200 57.66300
3COM CORP 409.215333 506.99238
Now I want to calc annual growth or just delta between 2006 and 2007 but I dont know how to reference the total Annual in the table (2006 and 2007).
I tried.
mean_rev_table['rev_growth']= mean_rev_df['2007'] - mean_rev_df['2006']
but I get a key error because I think it only recognizes tot_revnu as the column. I probably need to recreate the Pivot Table but not sure how to. Thanks

You need remove []for avoid MultiIndex in columns:
mean_rev_table=zacks_df_filter_1.pivot_table(
values='tot_revnu', <-[] create MultiIndex
index='comp_name',
columns='per_fisc_year',aggfunc=np.mean)
Another solution is droplevel:
mean_rev_table.columns = mean_rev_table.columns.droplevel(0)

you can also use groupby() + unstack():
mean_rev_table = (zacks_df_filter_1.groupby(['comp_name','per_fisc_year'])['tot_revnu']
.sum()
.unstack('per_fisc_year')
.rename_axis(None, 1))
Result:
In [46]: mean_rev_table
Out[46]:
2006 2007
comp_name
AGILENT TECH 9946.0 10840.0

Related

Pandas where function

I'm using Pandas where function trying to find the percentage in each state
filter1 = df['state']=='California'
filter2 = df['state']=='Texas'
filter3 = df['state']=='Florida'
df['percentage']= df['total'].where(filter1)/df['total'].where(filter1).sum()
The output is
Year state total percentage
2014 California 914198.0 0.134925
2014 Florida 766441.0 NaN
2014 Texas 1045274.0 NaN
2015 California 874642.0 0.129087
2015 Florida 878760.0 NaN
how do I apply the rest of 2 filters into there too?

Don't use where but groupby.transform:
df['percentage'] = df['total'].div(df.groupby('state')['total'].transform('sum'))
Output:
Year state total percentage
0 2014 California 914198.0 0.511056
1 2014 Florida 766441.0 0.465865
2 2014 Texas 1045274.0 1.000000
3 2015 California 874642.0 0.488944
4 2015 Florida 878760.0 0.534135

You can try out df.loc[(filter1) & (filter2) & (filter3)] in pandas to apply multiple filter together !

Extract values from dataset to perform functions- multiple countries within dataset

My dataset looks as follows:
Country
Year
Value
Ireland
2010
9
Ireland
2011
11
Ireland
2012
14
Ireland
2013
17
Ireland
2014
20
France
2011
15
France
2012
19
France
2013
21
France
2014
28
Germany
2008
17
Germany
2009
20
Germany
2010
19
Germany
2011
24
Germany
2012
27
Germany
2013
32
My goal is to create a new dataset which tells me the % increase from the first year of available data for a given country, compared to the most recent, which would look roughly as follows:
Country
% increase
Ireland
122
France
87
Germany
88
In essence, I need my code for each country in my dataset, to locate the smallest and largest value for year, then take the corresponding values within the value column and calculate the % increase.
I can do this manually, however I have a lot of countries in my dataset and am looking for a more elegant way to do it. I am trying to troubleshoot my code for this however I am not having much luck as of yet.
My code looks as follows at present:
df_1["Min_value"] = df.loc[df["Year"].min(),"Value"].iloc[0]
df_1["Max_value"] = df.loc[df["Year"].max(),"Value"].iloc[0]
df_1["% increase"] = ((df_1["Max_value"]-df_1["Min_value"])/df_1["Min_value"])*100
This returns an error:
AttributeError: 'numpy.float64' object has no attribute 'iloc'
In addition to this it also has the issue that I cannot figure out a way to have the code to run individually for each country within my dataset, so this is another challenge which I am not entirely sure how to address.
Could I potentially go down the route of defining a particular function which could then be applied to each country?

You can group by Country and aggregate min/max for both Year and Value, then calculate percentage change between min and max of the Value.
pct_df = df.groupby(['Country']).agg(['min', 'max'])['Value']\
.apply(lambda x: x.pct_change().round(2) * 100, axis=1)\
.drop('min', axis=1).rename(columns={'max':'% increase'}).reset_index()
print(pct_df)
The output:
Country % increase
0 France 87.0
1 Germany 88.0
2 Ireland 122.0

Replace NA in DataFrame for multiple columns with mean per country

I want to replace NA values with the mean of other column with the same year.
Note: To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?

Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:
fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113

This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64

Merge pandas groupBy objects

I have a huge dataset of 292 million rows (6GB) in CSV format. Panda's read_csv function is not working for such big file. So I am reading data in small chunks (10 million rows) iteratively using this code :
for chunk in pd.read_csv('hugeData.csv', chunksize=10**7):
#something ...
In the #something I am grouping rows according to some columns. So in each iteration, I get new groupBy objects. I am not able to merge these groupBy objects.
A smaller dummy example is as follows :
Here dummy.csv is a 28 rows CSV file, which is trade report between some countries in some year. sitc is some product code and export is export amount in some USD billion. (Please note that data is fictional)
year,origin,dest,sitc,export
2000,ind,chn,2146,2
2000,ind,chn,4132,7
2001,ind,chn,2146,3
2001,ind,chn,4132,10
2002,ind,chn,2227,7
2002,ind,chn,4132,7
2000,ind,aus,7777,19
2001,ind,aus,2146,30
2001,ind,aus,4132,12
2002,ind,aus,4133,30
2000,aus,ind,4132,6
2001,aus,ind,2146,8
2001,chn,aus,1777,9
2001,chn,aus,1977,31
2001,chn,aus,1754,12
2002,chn,aus,8987,7
2001,chn,aus,4879,3
2002,aus,chn,3489,7
2002,chn,aus,2092,30
2002,chn,aus,4133,13
2002,aus,ind,0193,6
2002,aus,ind,0289,8
2003,chn,aus,0839,9
2003,chn,aus,9867,31
2003,aus,chn,3442,3
2004,aus,chn,3344,17
2005,aus,chn,3489,11
2001,aus,ind,0893,17
I split it into two 14 rows data and grouped them according to year, origin, dest.
for chunk in pd.read_csv('dummy.csv', chunksize=14):
xd = chunk.groupby(['origin','dest','year'])['export'].sum();
print(xd)
Results :
origin dest year
aus ind 2000 6
2001 8
chn aus 2001 40
ind aus 2000 19
2001 42
2002 30
chn 2000 9
2001 13
2002 14
Name: export, dtype: int64
origin dest year
aus chn 2002 7
2003 3
2004 17
2005 11
ind 2001 17
2002 14
chn aus 2001 15
2002 50
2003 40
Name: export, dtype: int64
How can I merge the two GroupBy objects?
Will merging them, again create memory issues in the big data? A prediction by looking at the nature of data, if properly merged the number of rows will surely reduce by at least 10-15 times.
The basic aim is :
Given origin country and dest country,
I need to plot total exports between them yearwise.
Querying this everytime over the whole data is taking a lot of time.
xd = chunk.loc[(chunk.origin == country1) & (chunk.dest == country2)]
Hence I was thinking to save time by once arranging them in groupBy manner.
Any suggestion is greatly appreciated.

You can use pd.concat to join groupby results and then apply sum:
>>> pd.concat([xd0,xd1],axis=1)
export export
origin dest year
aus ind 2000 6 6
2001 8 8
chn aus 2001 40 40
ind aus 2000 19 19
2001 42 42
2002 30 30
chn 2000 9 9
2001 13 13
2002 14 14
>>> pd.concat([xd0,xd1],axis=1).sum(axis=1)
origin dest year
aus ind 2000 12
2001 16
chn aus 2001 80
ind aus 2000 38
2001 84
2002 60
chn 2000 18
2001 26
2002 28

Pandas: Excel subheading

I'm trying to read in an excel file that has a sub-header. So far, I'm doing the following:
link = 'http://www.bea.gov/industry/xls/io-annual/GDPbyInd_GO_NAICS_1997-2013.xlsx'
xd = pd.read_excel(link, sheetname='07NAICS_GO_A_Gross Output', skiprows=3)
Unfortunately, the data has a second sub header in row 4 (0-indexed) that only gives the unit of measurement, as follows. Can I somehow cleanly ignore that row?
Table IO Code Description 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Current-dollar gross output (Millions of dollars)
A 1111A0 Oilseed farming 19973 17241 13259 13646 13721 14258 15672 21290 17910 18325 21425 31559 33027 34592 38524 43203 44948

skiprows can be a list of rows to ignore, so this does what you want:
xd = pd.read_excel(link, sheetname='07NAICS_GO_A_Gross Output', skiprows=[0, 1, 2, 4])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I perform math on a pandas pivottable? - python

You need remove []for avoid MultiIndex in columns: mean_rev_table=zacks_df_filter_1.pivot_table( values='tot_revnu', <-[] create MultiIndex index='comp_name', columns='per_fisc_year',aggfunc=np.mean) Another solution is droplevel: mean_rev_table.columns = mean_rev_table.columns.droplevel(0)

you can also use groupby() + unstack(): mean_rev_table = (zacks_df_filter_1.groupby(['comp_name','per_fisc_year'])['tot_revnu'] .sum() .unstack('per_fisc_year') .rename_axis(None, 1)) Result: In [46]: mean_rev_table Out[46]: 2006 2007 comp_name AGILENT TECH 9946.0 10840.0

Related

Pandas where function

Extract values from dataset to perform functions- multiple countries within dataset

Replace NA in DataFrame for multiple columns with mean per country

Merge pandas groupBy objects

Pandas: Excel subheading

Categories

Resources