Pandas : How to calculate PCT Change for all columns dynamically?

Pandas : How to calculate PCT Change for all columns dynamically? - python

I got the following pandas df by using the following command, how to get PCT Change for all the columns dynamically for AAL , AAN ... 100 more
price['AABA_PCT_CHG'] = price.AABA.pct_change()
AABA AAL AAN AABA_PCT_CHG
0 16.120001 9.635592 18.836105 NaN
1 16.400000 8.363149 23.105881 0.017370
2 16.680000 8.460282 24.892321 0.017073
3 17.700001 8.829385 28.275263 0.061151
4 16.549999 8.839100 27.705627 -0.064972
5 15.040000 8.654548 27.754738 -0.091239

Apply on dataframe like
In [424]: price.pct_change().add_suffix('_PCT_CHG')
Out[424]:
AABA_PCT_CHG AAL_PCT_CHG AAN_PCT_CHG
0 NaN NaN NaN
1 0.017370 -0.132057 0.226680
2 0.017073 0.011614 0.077315
3 0.061151 0.043628 0.135903
4 -0.064972 0.001100 -0.020146
5 -0.091239 -0.020879 0.001773
In [425]: price.join(price.pct_change().add_suffix('_PCT_CHG'))
Out[425]:
AABA AAL AAN AABA_PCT_CHG AAL_PCT_CHG AAN_PCT_CHG
0 16.120001 9.635592 18.836105 NaN NaN NaN
1 16.400000 8.363149 23.105881 0.017370 -0.132057 0.226680
2 16.680000 8.460282 24.892321 0.017073 0.011614 0.077315
3 17.700001 8.829385 28.275263 0.061151 0.043628 0.135903
4 16.549999 8.839100 27.705627 -0.064972 0.001100 -0.020146
5 15.040000 8.654548 27.754738 -0.091239 -0.020879 0.001773

Related

How to multiply different columns in different dataframes using Pandas

I have 2 dataframes that I want to multiply. I want to multiply multiple columns from dataframe 1 with one column in dataframe 2
raw_material_LCI = dataframe1[["climate change","ozone depletion",
"ionising radiation, hh","photochemical ozone formation, hh",
"particulate matter","human toxicity, non-cancer",
"human toxicity, cancer","acidification",
"eutrophication, freshwater","eutrophication, marine",
"eutrophication, terrestrial","ecotoxicity, freshwater",
"land use", "resource use, fossils","resource use, minerals and metals",
"water scarcity"]] * dataframe2["mass_frac"]
The above code returns a dataframe where all the values are NaN. The names of the columns all are fields with numeric values in them.
I decided to try multiply dataframe1 with just a single value to see if it worked e.g. example below
raw_material_LCI = dataframe1[["climate change","ozone depletion",
"ionising radiation, hh","photochemical ozone formation, hh",
"particulate matter","human toxicity, non-cancer",
"human toxicity, cancer","acidification",
"eutrophication, freshwater","eutrophication, marine",
"eutrophication, terrestrial","ecotoxicity, freshwater",
"land use", "resource use, fossils","resource use, minerals and metals",
"water scarcity"]] * 0.7
The example with the single value returns a dataframe with numbers, so it works. Does anyone know why the multiplication in the first instance does not work? I have looked at multiple articles on multiplying columns in different dataframes in Python and cannot find a solution.

You have to align both row and column indexes when you multiply two dataframes and align row index when you multiply a DataFrame by a Series:
>>> df
A B C D E
0 0.787081 0.350508 0.058542 0.492340 0.489379
1 0.512436 0.501375 0.108115 0.960808 0.841969
2 0.055247 0.305830 0.976043 0.016188 0.006424
3 0.303570 0.914876 0.157100 0.767454 0.340381
4 0.446077 0.595001 0.307799 0.115410 0.568281
5 0.226516 0.636902 0.086790 0.079260 0.402414
6 0.451920 0.526025 0.012470 0.931610 0.267155
7 0.472778 0.137005 0.227569 0.941355 0.584782
8 0.944396 0.769115 0.497214 0.531419 0.570797
9 0.788023 0.310288 0.336480 0.585466 0.432246
>>> sr
0 0.920878
1 0.445332
2 0.894407
3 0.613317
4 0.242270
5 0.299121
6 0.843052
7 0.279014
8 0.526778
9 0.249538
dtype: float64
So, this produces nan values:
>>> df * sr
A B C D E
0 0.724805 0.322775 0.053910 0.453385 0.450658
1 0.228204 0.223279 0.048147 0.427878 0.374956
2 0.049413 0.273536 0.872980 0.014479 0.005745
3 0.186185 0.561109 0.096352 0.470693 0.208762
4 0.108071 0.144151 0.074571 0.027961 0.137678
5 0.067756 0.190511 0.025961 0.023708 0.120371
6 0.380992 0.443466 0.010513 0.785396 0.225226
7 0.131912 0.038226 0.063495 0.262651 0.163162
8 0.497487 0.405153 0.261921 0.279940 0.300683
9 0.196642 0.077429 0.083965 0.146096 0.107862
but using mul along index axis works as expected:
>>> df.mul(sr, axis=0) # but not df.mul(sr) (same as df*sr)
A B C D E
0 0.724805 0.322775 0.053910 0.453385 0.450658
1 0.228204 0.223279 0.048147 0.427878 0.374956
2 0.049413 0.273536 0.872980 0.014479 0.005745
3 0.186185 0.561109 0.096352 0.470693 0.208762
4 0.108071 0.144151 0.074571 0.027961 0.137678
5 0.067756 0.190511 0.025961 0.023708 0.120371
6 0.380992 0.443466 0.010513 0.785396 0.225226
7 0.131912 0.038226 0.063495 0.262651 0.163162
8 0.497487 0.405153 0.261921 0.279940 0.300683
9 0.196642 0.077429 0.083965 0.146096 0.107862
If your series and dataframe have not the same length, you get a partial result:
>>> df.mul(sr.iloc[:5], axis=0)
A B C D E
0 0.724805 0.322775 0.053910 0.453385 0.450658
1 0.228204 0.223279 0.048147 0.427878 0.374956
2 0.049413 0.273536 0.872980 0.014479 0.005745
3 0.186185 0.561109 0.096352 0.470693 0.208762
4 0.108071 0.144151 0.074571 0.027961 0.137678
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
>>> df.mul(sr.iloc[5:], axis=0)
A B C D E
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 0.067756 0.190511 0.025961 0.023708 0.120371
6 0.380992 0.443466 0.010513 0.785396 0.225226
7 0.131912 0.038226 0.063495 0.262651 0.163162
8 0.497487 0.405153 0.261921 0.279940 0.300683
9 0.196642 0.077429 0.083965 0.146096 0.107862
Take care to have the same index between instances.

Pandas dataframe merge row by addition

I want to create a dataframe from census data. I want to calculate the number of people that returned a tax return for each specific earnings group.
For now, I wrote this
census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
for i, column_name in zip(range(1, 7), num_of_returns):
sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']
I have 6 groups attached to a specific zip code. I want to get one row, with the number of returns for a specific zip code appearing just once as a column. I already tried to change NaNs to 0 and to use groupby('zipcode').sum(), but I get 50 million rows summed for zip code 0, where it seems that only around 800k should exist.
Here is the dataframe that I currently get:
zipcode agi_stub N02650 A02650 ELDERLY A07180 Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more Amount_1_25000 Amount_25000_50000 Amount_50000_75000 Amount_75000_100000 Amount_100000_200000 Amount_200000_more
0 0 1 778140.0 10311099.0 144610.0 2076.0 778140.0 NaN NaN NaN NaN NaN 10311099.0 NaN NaN NaN NaN NaN
1 0 2 525940.0 19145621.0 113810.0 17784.0 NaN 525940.0 NaN NaN NaN NaN NaN 19145621.0 NaN NaN NaN NaN
2 0 3 285700.0 17690402.0 82410.0 9521.0 NaN NaN 285700.0 NaN NaN NaN NaN NaN 17690402.0 NaN NaN NaN
3 0 4 179070.0 15670456.0 57970.0 8072.0 NaN NaN NaN 179070.0 NaN NaN NaN NaN NaN 15670456.0 NaN NaN
4 0 5 257010.0 35286228.0 85030.0 14872.0 NaN NaN NaN NaN 257010.0 NaN NaN NaN NaN NaN 35286228.0 NaN
And here is what I want to get:
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 850.0

here is one way to do it using groupby and sum the desired columns
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
df.groupby('zipcode', as_index=False)[num_of_returns].sum()
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 0.0

This question needs more information to actually give a proper answer. For example you leave out what is meant by certain columns in your data frame:
- `N1: Number of returns`
- `agi_stub: Size of adjusted gross income`
According to IRS this has the following levels.
Size of adjusted gross income "0 = No AGI Stub
1 = ‘Under $1’
2 = '$1 under $10,000' 3 = '$10,000 under $25,000' 4 = '$25,000 under $50,000' 5 = '$50,000 under $75,000' 6 = '$75,000 under $100,000' 7 = '$100,000 under $200,000'
8 = ‘$200,000 under $500,000’
9 = ‘$500,000 under $1,000,000’
10 = ‘$1,000,000 or more’"
I got the above from https://www.irs.gov/pub/irs-soi/16incmdocguide.doc
With this information, I think what you want to find is the number of
people who filed a tax return for each of the income levels of agi_stub.
If that is what you mean then, this can be achieved by:
import pandas as pd
data = pd.read_csv("./data/19zpallagi.csv")
## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]
## solution to your problem?
df = data.pivot_table(
index='zipcode',
values='N1',
columns='agi_stub',
aggfunc=['sum']
)
## bit of cleaning up.
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX + level for level in df.columns.get_level_values(1).astype(str)]
Here's the output.
In [77]: df
Out[77]:
agi_stub_level_1 agi_stub_level_2 ... agi_stub_level_5 agi_stub_level_6
zipcode ...
0 50061850.0 37566510.0 ... 21938920.0 8859370.0
1001 2550.0 2230.0 ... 1420.0 230.0
1002 2850.0 1830.0 ... 1840.0 990.0
1005 650.0 570.0 ... 450.0 60.0
1007 1980.0 1530.0 ... 1830.0 460.0
... ... ... ... ... ...
99827 470.0 360.0 ... 170.0 40.0
99833 550.0 380.0 ... 290.0 80.0
99835 1250.0 1130.0 ... 730.0 190.0
99901 1960.0 1520.0 ... 1030.0 290.0
99999 868450.0 644160.0 ... 319880.0 142960.0
[27595 rows x 6 columns]

How to Group by the mean of specific columns in Python

In the dataframe below:
import pandas as pd
import numpy as np
df= {
'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'],
'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19],
'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12],
'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
}
df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7'])
df.info()
I want to do the following:
Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7)
Export all the outputs as one xlsx file (refer to the figure below)
The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated)
My attempt:
# Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7
df1 = df[['AIC_TRX', 'diff', 'series',
'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4',
'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']]
#Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7'
df1.groupby('series').Grwth_Time1.agg(['mean'])
Thanks in advance

You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time.
df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3',
'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx")
Output:
AIC_TRX 1 2 3 4
series
1 150.78 208.07 146.87 229.28
2 162.34 217.76 182.54 244.73
4 188.53 229.48 189.57 269.91
8 197.69 139.51 199.97 249.19
AIC_TRX 1 2 3 4
series
1 250.78 308.07 346.87 329.28
2 262.34 317.70 382.54 347.73
4 288.53 329.81 369.59 369.91
8 297.69 339.15 399.97 349.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 270.84 318.73 398.75 494.85
2 282.14 327.47 432.18 509.39
4 298.53 369.63 449.78 515.52
8 306.69 389.59 473.55 539.23
AIC_TRX 1 2 3 4
series
1 25.78 30.07 34.87 29.28
2 22.34 17.70 32.54 34.73
4 28.53 29.81 36.59 36.91
8 27.69 33.15 39.97 34.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 27.84 18.73 38.75 13.85
2 28.14 27.47 24.18 9.39
4 29.53 36.63 24.78 15.52
8 30.69 38.59 21.55 39.23

Just use the df.apply method to average across each column based on series and AIC_TRX grouping.
result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1)
Result:
series AIC_TRX
1 1 0 120.738
2 4 156.281
3 8 170.285
4 12 196.270
2 1 1 122.358
2 5 152.758
3 9 184.494
4 13 205.175
4 1 2 135.471
2 6 171.968
3 10 187.825
4 14 214.907
8 1 3 142.183
2 7 162.849
3 11 196.851
4 15 216.455
dtype: float64

how to merge two excel columns in python using colab

I'm working on a data-oriented project, we have some cancer measurements, and want to classify with K-means algorithms.
Now I have two basic example datasets, with two-two columns, but the K-means algorithms need only 2 columns, so I decided to concatenate the columns, but how can I do it?
For example fst dataset looks like this:
0 2713.9 566.42
1 2718.9 566.42
2 2723.3 566.25
3 2729.5 565.99
4 2735.9 565.83
the snd one looks like this:
0 6571.5 959.12
1 6571.6 959.13
2 6571.7 959.12
3 6571.7 959.16
4 6571.7 959.15
And I want something like this (without the row number of course):
0 2713.9 566.42
1 2718.9 566.42
2 2723.3 566.25
3 2729.5 565.99
4 2735.9 565.83
0 6571.5 959.12
1 6571.6 959.13
2 6571.7 959.12
3 6571.7 959.16
4 6571.7 959.15
I tried with this:
X = ds1[ds1.columns[2:4]].append(ds2[ds2.columns[2:4]])
X
and got this:
0 2713.9 566.42 NaN NaN
1 2718.9 566.42 NaN NaN
2 2723.3 566.25 NaN NaN
3 2729.5 565.99 NaN NaN
4 2735.9 565.83 NaN NaN
... ... ... ... ...
44 NaN NaN 6571.8 959.01
45 NaN NaN 6571.7 959.00
46 NaN NaN 6571.7 958.98
47 NaN NaN 6571.5 959.00
48 NaN NaN 6571.4 959.01
Also got this with this code:
X = pd.concat([ds1[ds1.columns[2:4]], ds2[ds2.columns[2:4]]], axis=0, join='outer', ignore_index=False)
How can I do this? Is there any method for this, or I have to transform the data in Excel?

Try via vstack():
out=pd.DataFrame(np.vstack((ds1.columns[2:4].values,ds2[ds2.columns[2:4]].values)))
OR
via concatenate():
out=pd.DataFrame(np.concatenate((ds1.columns[2:4].values,ds2[ds2.columns[2:4]].values)))
OR
out=ds1[ds1.columns[2:4]].append(ds2[ds2.columns[2:4]]).T.agg(sorted,key=pd.isnull).dropna().T
OR
You can also rename the name of columns of any 1 dataset so that both subset of df's has same name then use concat() or append() them

Dataframe calculation

I want to do the following calculation and the outcome has to be a new column Calculated trap..
test["calculation trap"] = (( 0.000164 + 0.000415)/2)
so the outcome of this formula has to be 0.0002895.
I tried the following code to do this calculation for the whole column, but i got the outcome in the column below.
test["calculation trap"] = ((test["calculation"][0:]+test["calculation"][1:])/2).reset_index(drop=True)
Temp calculation. calculation trap.
0 90.01 0.000164 NaN
1 91.03 0.000415 0.000415
2 95.06 0.001315 0.001315
3 100.07 0.002896 0.002896
4 103.50 NaN NaN

Use Series.shift with -1:
test["calculation trap"] = ((test["calculation"].shift(-1)+test["calculation"])/2)
print (test)
Temp calculation calculation trap
0 90.01 0.000164 0.000290
1 91.03 0.000415 0.000865
2 95.06 0.001315 0.002106
3 100.07 0.002896 NaN
4 103.50 NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas : How to calculate PCT Change for all columns dynamically? - python

Related

How to multiply different columns in different dataframes using Pandas

Pandas dataframe merge row by addition

How to Group by the mean of specific columns in Python

how to merge two excel columns in python using colab

Dataframe calculation

Categories

Resources