How do i aggregate a dataframe and apply a lambda function? - python

i have a data frame with the following columns:
id
name
product
count
price
discount
and i want to create a sumamry data frame where it shows the sum of how much each client has spend total. with and without discount applied.
i tried the following
summary = df.groupby('client_name')['price','count','discount'].agg([
('Total pre discount', df['price']*df['count']),
('Discount applied', df['price']*df['count']*df['discount']
])
and im getting this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
is it even possible to do this in one step?
what am i doing wrong?

Please note that agg() function cannot be used for calculation involving multiple columns. You have to use apply() function instead. Refer to this post for details.
if you had multiple columns that needed to interact together then you
cannot use agg, which implicitly passes a Series to the aggregating
function. When using apply the entire group as a DataFrame gets passed
into the function.
For your case, you have to define a customized function as follows:
def f(x):
data = {}
data['Total pre discount'] = (x['price'] * x['count']).sum()
data['Discount applied'] = (x['price'] * x['count'] * x['discount']).sum()
return pd.Series(data)
Then perform your desired task by:
df.groupby('client_name').apply(f)
or if you want to use lambda function instead of customized function:
df.groupby('client_name').apply(lambda x: pd.Series({'Total pre discount': (x['price'] * x['count']).sum(), 'Discount applied': (x['price'] * x['count'] * x['discount']).sum()}))
Run Demonstration
Test Data Creation
data = {'id': ['0100', '0200', '0100', '0200', '0300'], 'client_name': ['Ann', 'Bob', 'Ann', 'Bob', 'Charles'], 'product': ['pen', 'paper', 'folder', 'pencil', 'tray'], 'count': [12, 300, 5, 12, 10], 'price': [2.00, 5.00, 3.50, 2.30, 8.20], 'discount': [0.0, 0.1, 0.15, 0.1, 0.12]}
df = pd.DataFrame(data)
print(df)
Output:
id client_name product count price discount
0 0100 Ann pen 12 2.0 0.00
1 0200 Bob paper 300 5.0 0.10
2 0100 Ann folder 5 3.5 0.15
3 0200 Bob pencil 12 2.3 0.10
4 0300 Charles tray 10 8.2 0.12
Run New Codes
# Use either one of the following 2 lines of codes:
summary = df.groupby('client_name').apply(f) # Using the customized function f() defined above
# or using lambda function
summary = df.groupby('client_name').apply(lambda x: pd.Series({'Total pre discount': (x['price'] * x['count']).sum(), 'Discount applied': (x['price'] * x['count'] * x['discount']).sum()}))
print(summary)
Output:
Total pre discount Discount applied
client_name
Ann 41.5 2.625
Bob 1527.6 152.760
Charles 82.0 9.84

df.groupby(['client', 'discount']).agg({'price' : 'sum'}).reset_index()

Related

Creating sub columns in Pandas Dataframes for Summary Statistics

I am working with water quality data for both surface water locations and groundwater well locations. I would like to create a summary statistics table for all three of my parameters (pH, Temp, salinity) grouped by the location the samples were taken from (surface water vs. Groundwater) as shown below:
| 'Surface Water' | 'Groundwater' |
___________________________________________________________________________
| min | max | mean | std | min | max | mean | std
'pH'
The way I set up my Excel Sheet for data collection includes the following columns: Date, Monitoring ID (Either Surface Water or Groundwater), pH, Temp, and Salinity.
How can i tell python to do this? I am familiar with the groupby and describe() function but I don't know how to style organize it the way that I want. Any help would be appreciated!
I have tried using the groupby function for each descriptive stat for example:
mean = df.\
groupby('Monitoring ID')\
[['pH', 'SAL (ppt)', 'Temperature (°C)', 'DO (mg/L)']].mean()
min = df.\
groupby('Monitoring ID')\
[['pH', 'SAL (ppt)', 'Temperature (°C)', 'DO (mg/L)']].min()
etc.... but I don't know how to incorporate it all into one nice table
You can use groupby_describe as you suggest then stack_transpose:
metrics = ['count', 'mean', 'std', 'min', 'max']
out = df.groupby('Monitoring ID').describe().stack().T.loc[:, (slice(None), metrics)]
>>> out
Monitoring ID Groundwater Surface Water
count mean std min max count mean std min max
pH 159.0 6.979182 0.587316 6.00 7.98 141.0 6.991135 0.564097 6.00 7.99
SAL (ppt) 159.0 1.976226 0.577557 1.02 2.99 141.0 1.917589 0.576650 1.01 2.99
Temperature (°C) 159.0 13.466101 4.805317 4.13 21.78 141.0 13.099645 4.989240 4.03 21.61
DO (mg/L) 159.0 1.984277 0.609071 1.00 2.99 141.0 1.939433 0.577651 1.00 2.96
You can use agg along with groupby:
import pandas as pd
import numpy as np
# Sample data
data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-01-02', '2022-01-03'],
'Monitoring ID': ['Surface Water', 'Surface Water', 'Surface Water', 'Groundwater', 'Groundwater', 'Groundwater'],
'pH': [7.1, 7.2, 7.5, 7.8, 7.6, 7.4],
'Temp': [10, 12, 9, 15, 13, 14],
'Salinity': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
df = pd.DataFrame(data)
# Group by 'Monitoring ID' and calculate summary statistics
summary_stats = df.groupby('Monitoring ID').agg({'pH': ['min', 'max', 'mean', 'std'],
'Temp': ['min', 'max', 'mean', 'std'],
'Salinity': ['min', 'max', 'mean', 'std']})
# Reorganise column by renaming
summary_stats.columns = ['_'.join(col).strip() for col in summary_stats.columns.values]
# Summary table
print(summary_stats)
Pardon me I'm still trying to figure how to demonstrate the output of the code here but I hope this helps.

How to do math operations on a dataframe with an undefined number of columns?

I have a data frame in which there is an indefinite number of columns, to be defined later.
Like this:
index
GDP
2004
2005
...
brasil
1000
0.10
0.10
...
china
1000
0.15
0.10
...
india
1000
0.05
0.10
...
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.05],
'2005': [0.10, 0.10, 0.10]})
Being the column GDP the initial GDP, and the columns from 2004 onwards being floats, representing percentages, relating to GDP growth in each year.
Using percentages to get the absolute number of the GDP in each year, based on initial GDP. I need a dataframe like this:
index
GDP
2004
2005
brasil
1000
1100
1210
china
1000
1150
1265
india
1000
1050
1155
I tried to use itertuples, df.columns and for loops, but i probably missing something.
Remembering that there are an indefinite number of columns.
Thank you very much in advance!
My answer is a combination of Wardy and user19*.
Starting with...
df = pd.DataFrame(data={'GDP': [1000, 1000, 1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10],
'index': ['brasil', 'china', 'india']})
Find the percentage columns and make sure they are in the right order.
columns_of_interest = sorted(c for c in df.columns if c not in ['GDP', 'index'])
Now we calculate...
running_GDP = df['GDP'] # starting value
for column in columns_of_interest:
running_GDP *= 1.0 + df[column]
df[column] = running_GDP
This results in
GDP 2004 2005 index
0 1000 1100.0 1210.0 brasil
1 1000 1150.0 1265.0 china
2 1000 1500.0 1650.0 india
A simple way is to count the columns and loop over:
num = df.shape[1]
start = 2
for idx in range(start, num):
df.iloc[:, idx] = df.iloc[:, idx-1] * (1+df.iloc[:, idx])
print(df)
which gives
index GDP 2004 2005
0 brasil 1000 1100.0 1210.0
1 china 1000 1150.0 1265.0
2 india 1000 1050.0 1155.0
You can use df.columns to access a list of the dataframes columns.
Then you can do a loop over all of these column names. Here is an example of your data frame where I multiplied every value by 2. If you want to do different operations to different columns you can add conditions into the loop.
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10]})
for colName in df.columns:
df[colName] *= 2
print(df)
this returns...
index GDP 2004 2005
0 brasilbrasil 2000 0.2 0.2
1 chinachina 2000 0.3 0.2
2 indiaindia 2000 1.0 0.2
Hope this helps!
Add one to the percentages; calculate the cumulative product;
q = (df.iloc[:,2:] + 1).cumprod(axis=1)
multiply by the beginning gdp.
q = q.mul(df['GDP'],axis='index')
If you are trying to change the original DataFrame assign the result.
df.iloc[:,2:] = q
If you want to make new DataFrame concatenate the result with the first columns of the original.
new = pd.concat([df.iloc[:,:2],q],axis=1)
You can put those first two lines together if you want.
q = (df.iloc[:,2:] + 1).cumprod(axis=1).mul(df.GDP,axis='index')

Trying to get value in column of dataframe using another columns data using function?

I currently have a data frame "Finaldf" consisting of a column (underlying_price, strike, rate, days_to_exp, price,IV).
that looks like this-
import pandas as pd
import mibian
stocksdf = {'underlying_price': [82600,38900,28775,28900,28275],
'strike': [30400,19050,34000,36500,34500],
'rate': [0,0,0,0,0],
'days_to_exp': [3,3,3,3,3],
'price': [12,3,4,8,3.5],
'Opt_type': ['CE', 'PE', 'PE', 'PE', 'PE']}
final=pd.DataFrame(stocksdf)
final['IV']=""
print(final)
output-
underlying_price strike rate days_to_exp price Opt_type IV
0 82600 30400 3.81 3 12.0 CE
1 38900 19050 3.81 3 3.0 PE
2 28775 34000 3.81 3 4.0 PE
3 28900 36500 3.81 3 8.0 PE
4 28275 34500 3.81 3 3.5 PE
and I have a function to calculate the "ImpVol" column of "final" data frame that looks like this:
def impliedVol_Call(underlying_price, strike, rate, days_to_exp, price):
c = mibian.BS([underlying_price, strike, rate,
days_to_exp], callPrice=price)
Call_IV = c.impliedVolatility
return Call_IV
def impliedVol_Put(underlying_price, strike, rate, days_to_exp, price):
p = mibian.BS([underlying_price, strike, rate,
days_to_exp], putPrice=price)
Put_IV = p.impliedVolatility
return Put_IV
So, I tried to calculate "IV" column like this-
for i in range(len(final)):
if pd.isna(final["Opt_type"].iloc[i]=='CE'):
final['IV'].iloc[i]=impliedVol_Call(final['Underlying_price'][i],final['strike'][i],final['rate'][i],final['time_toEx'][i],final['Premium_price'][i])
else:
final['IV'].iloc[i]=impliedVol_Put(final['Underlying_price'][i],final['strike'][i],final['rate'][i],final['time_toEx'][i],final['Premium_price'][i])
Please help me to get the column of ImVol(IV).
Well, what you are doing is in iterative manner.
You can explore the lambdas function and apply methods over dataframe.
Below is the sample code, which you can alter as per the need.
Since, i don't have your function methodology of impliedVol_Put , i can only suggest the method of how you can alter this.
final['ImpVol'] = final.apply(lambda x: impliedVol_Call(final['Underlying_price'][i],final['strike'][i],final['rate'][i],final['time_toEx'][i],final['Premium_price'][i])
if pd.isna(final["Opt_type"].iloc[i]=='CE') else impliedVol_Put(final['Underlying_price'][i],final['strike'][i],final['rate'][i],final['time_toEx'][i],final['Premium_price'][i]),
axis=1)
Maybe possible to call function impliedVol_Call inside lambda with columns as arguments.
finaldf['ImpVol']=finaldf.apply(lambda x:impliedVol_Call(x[0],x[1],x[2],x[3],x[4]))

Divide every column in one df to every column in another df in Python

Good day,
Problem: I have two data frames - performance per a firm aka output and input per a firm:
`firms = ['1', '2', '3']
df = pd.DataFrame(firms)
output = { 'firms': ['1', '2', '3'],
'Sales': [150, 200, 50],
'Profit':[200, 210, 90]}
df1 = pd.DataFrame.from_dict(output)
inputs = { 'firms': ['1', '2', '3'],
'Salary': [10000, 20000, 500],
'employees':[2, 4, 5]}
df2 = pd.DataFrame.from_dict(inputs)`
What I need is to divide every column from the output table to every column in the input table. As of now I am doing it in a very ugly manner - by dividing the entire output tbl by every individual column in the input table and then merging the result together. It's all good when I have two columns, but I wonder if there is a better way to do it as I might have 100 columns in one table and 50 in another. Ah, it's also important that the size might be different, e.g. 50 cols in the input and 100 in the output table.
frst = df1.iloc[:,0:2].divide(df2.Salary, axis = 0)
frst.columns = ['y1-x1', 'y2-x1']
sec = df1.iloc[:,0:2].divide(df2.employees, axis = 0)
sec.columns = ['y1-x2', 'y2-x2']
complete = pd.DataFrame(df).join(frst).join(sec)
Output:
| Firm | y1-x1 | y2-x1 | y1-x2 | y2-x2 |
| 1 | 0.0200 | 0.015 | 100.0 | 75.0 |
| 2 | 0.0105 | 0.010 | 52.5 | 50.0 |
| 3 | 0.1800 | 0.100 | 18.0 | 10.0 |
I also tried with loops but if I remember correctly because in my actual example, I have tables of different size, it did not work out. I will be very grateful for your suggestions!
I don't see why you can't just use a simple loop. It seems like you want to align everything on firms so setting that to the index will resolve any joins or divisions by unequal lengths.
df1 = df1.set_index('firms')
df2 = df2.set_index('firms')
l = []
for col in df2.columns:
l.append(df1.div(df2[col], axis=0).add_suffix(f'_by_{col}'))
pd.concat(l, axis=1)
Output:
Sales_by_Salary Profit_by_Salary Sales_by_employees Profit_by_employees
firms
1 0.015 0.0200 75.0 100.0
2 0.010 0.0105 50.0 52.5
3 0.100 0.1800 10.0 18.0
So I think the issue is that you are treating your data as essentially three-dimensional, where you have dimensions (firms, components of costs, components of income), and you want ratios for each of the outer product of the three dimensions.
There are certainly ways to accomplish what you'd like to do in a DataFrame, but they're messsy.
Pandas does have a 3-D object called a Panel, but this is being deprecated in favor of a more complete solution for indexed higher-dimensional data structures called xarray. Think of it as pandas for NDArrays.
We can convert your data into an xarray DataArray by labeling and stacking your indices:
In [2]: income = df1.set_index('firms').rename_axis(['income'], axis=1).stack('income').to_xarray()
In [3]: income
Out[3]:
<xarray.DataArray (firms: 3, income: 2)>
array([[150, 200],
[200, 210],
[ 50, 90]])
Coordinates:
* firms (firms) object '1' '2' '3'
* income (income) object 'Sales' 'Profit'
In [4]: costs = df2.set_index('firms').rename_axis(['costs'], axis=1).stack('costs').to_xarray()
In [5]: costs
Out[5]:
<xarray.DataArray (firms: 3, costs: 2)>
array([[10000, 2],
[20000, 4],
[ 500, 5]])
Coordinates:
* firms (firms) object '1' '2' '3'
* costs (costs) object 'Salary' 'employees'
You now have two DataArrays, each with two dimensions, but the dimensions do not match. Both are indexed by firms, but income is indexed by income and costs is indexed by costs.
These are broadcast against each other automatically when operations are performed against both of them:
In [6]: income / costs
Out[6]:
<xarray.DataArray (firms: 3, income: 2, costs: 2)>
array([[[1.50e-02, 7.50e+01],
[2.00e-02, 1.00e+02]],
[[1.00e-02, 5.00e+01],
[1.05e-02, 5.25e+01]],
[[1.00e-01, 1.00e+01],
[1.80e-01, 1.80e+01]]])
Coordinates:
* firms (firms) object '1' '2' '3'
* income (income) object 'Sales' 'Profit'
* costs (costs) object 'Salary' 'employees'
This data now has the structure you're trying to achieve, and this division was done using optimized cython operations rather than loops.
You can convert the data back to a dataframe using the built in DataArray.to_series method:
In [7]: (income / costs).to_series().to_frame(name='income to cost ratio')
Out[7]:
income to cost ratio
firms income costs
1 Sales Salary 0.0150
employees 75.0000
Profit Salary 0.0200
employees 100.0000
2 Sales Salary 0.0100
employees 50.0000
Profit Salary 0.0105
employees 52.5000
3 Sales Salary 0.1000
employees 10.0000
Profit Salary 0.1800
employees 18.0000

Assigning result of pandas groupby

I have the following dataframe:
date, industry, symbol, roc
25-02-2015, Health, abc, 200
25-02-2015, Health, xyz, 150
25-02-2015, Mining, tyr, 45
25-02-2015, Mining, ujk, 70
26-02-2015, Health, abc, 60
26-02-2015, Health, xyz, 310
26-02-2015, Mining, tyr, 65
26-02-2015, Mining, ujk, 23
I need to determine the average 'roc', max 'roc', min 'roc' as well as how many symbols exist for each date+industry. In other words I need to groupby date and industry, and then determine various averages, max/min etc.
So far I am doing the following, which is working but seems to be very slow and inefficient:
sector_df = primary_df.groupby(['date', 'industry'], sort=True).mean()
tmp_max_df = primary_df.groupby(['date', 'industry'], sort=True).max()
tmp_min_df = primary_df.groupby(['date', 'industry'], sort=True).min()
tmp_count_df = primary_df.groupby(['date', 'industry'], sort=True).count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)
The above code works, resulting in a dataframe indexed by date+industry, showing me what was the min/max 'roc' for each date+industry, as well as how many symbols existed for each date+industry.
I am basically doing a complete groupby multiple times (to determine the mean, max, min, count of the 'roc'). This is very slow because it's doing the same thing over and over.
Is there a way to just do the group by once. Then perform the mean, max etc on that object and assign the result to the sector_df?
You want to perform an aggregate using agg:
In [72]:
df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
Out[72]:
roc
mean max min count
date industry
2015-02-25 Health 175.0 200 150 2
Mining 57.5 70 45 2
2015-02-26 Health 185.0 310 60 2
Mining 44.0 65 23 2
This allows you to pass an iterable (a list in this case) of functions to perform.
EDIT
To access individual results you need to pass a tuple for each axis:
In [78]:
gp.loc[('2015-02-25','Health'),('roc','mean')]
Out[78]:
175.0
Where gp = df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
You can just save the groupby part to a variable as shown below:
primary_df = pd.DataFrame([['25-02-2015', 'Health', 'abc', 200],
['25-02-2015', 'Health', 'xyz', 150],
['25-02-2015', 'Mining', 'tyr', 45],
['25-02-2015', 'Mining', 'ujk', 70],
['26-02-2015', 'Health', 'abc', 60],
['26-02-2015', 'Health', 'xyz', 310],
['26-02-2015', 'Mining', 'tyr', 65],
['26-02-2015', 'Mining', 'ujk', 23]],
columns='date industry symbol roc'.split())
grouped = primary_df.groupby(['date', 'industry'], sort=True)
sector_df = grouped.mean()
tmp_max_df = grouped.max()
tmp_min_df = grouped.min()
tmp_count_df = grouped.count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)

Categories

Resources