I'm relatively new to Pandas dataframes and I have to do simple calculation, but so far I haven't found a good way to go about it.
Basically what I have is:
type group amount
1 A real 55
2 A fake 12
3 B real 610
4 B fake 23
5 B real 45
Now, I have to add a new column that would show the percentage of fakes in type total. So the simple formula for this table would be for A 12 / (55 + 12) * 100 and for B 23 / (610 + 23 + 45) * 100 and the table should look something like this:
type group amount percentage
1 A real 55
2 A fake 12 17.91
3 B real 610
4 B fake 23
5 B real 45 3.39
I know about groupby statements and basically all the components I need for this (I guess...), but can't figure out how to combine to get this result.
df['percentage'] = df.amount \
/ df.groupby(['type']) \
If handling multiple fake in group per type. We can be a bit more careful. I'll set the index to preserve the type and group columns while I transform.
c = ['type', 'group']
d1 = df.set_index(c, append=True)
d1.amount /= d1.groupby(level=['type']).amount.transform('sum')
From here, you can choose to leave that alone or consolidate the group column.
Try this out:
percentage = {}
for type in df.type.unique():
numerator = df[(df.type == type) & (df.group == 'fake')].amount.sum()
denominator = df[(df.type == type)].amount.sum()
percentage[type] = numerator / denominator * 100
df['percentage'] = list(df.type.map(percentage))
If you wanted to make sure you accounted for multiple fake groups per type you can do the following
type_group_total = df.groupby(['type', 'group']).transform('sum')
type_total = df.groupby('type')[['amount']].transform('sum')
df['percentage'] = type_group_total / type_total
type group amount percentage
0 A real 55 0.820896
1 A fake 12 0.179104
2 B real 610 0.899705
3 B fake 23 0.100295
4 B fake 45 0.100295
I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
My objective is to compute the drill down info for each column
Let me explain by an example.
If we filter the dataframe by df[df['grade']=='A'], we get 2 records as result. let's consider the filtered column grade as parent_variable. Out of those 2 records returned as result, how much dumeel column (child_variable) values and dash column (child_variable) values account for target column values (which is 0 and 1). All categorical/object columns other than parent variable are called child variables.
We have to repeat the above exaple procedure for all the categorical/object variables in our dataset
As a first step, I made use of the below from a SO post
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%"
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
out.rename(columns={'variable': 'parent_variable','value': 'parent_value'}, inplace=True)
But the above, gets me only the % and count of target based on all parent variable. I would like to get the breakdown by child variables as well (for each parent variable)
%_contrib is obtained by computing the % of that record to the target value. ex: for dash=P, we have one grade values A (for target = 1). So, it has to be 100%. Hope this helps.
I expect my output to be like as shown below. I have shown sample only for couple of columns under parent_variable. But in my real data, there will be more than 20 categorical variables. So, any efficient approach is welcome and useful
As you are using a random function to generate the DataFrame it is hard for me to reproduce your example, but I think you are looking for value_counts -
This is the DataFrame I generated with your code -
grade dash dumeel dumma target
0 D P W 50 1
1 D S R 595 0
2 C P E 495 1
3 A Q Q 690 0
4 B P W 653 1
5 D R E 554 0
6 C P Q 392 1
7 D Q Q 186 0
8 B Q E 1228 1
9 C P E 14 0
When I do a value_counts() on the two columns -
df[(df['dash']=='P') & (df['target'] == 1)]['dumeel'].value_counts(normalize=True)
W 0.50
Q 0.25
E 0.25
Name: dumeel, dtype: float64
df[(df['dash']=='P') & (df['target'] == 1)]['grade'].value_counts(normalize=True)
C 0.50
D 0.25
B 0.25
Name: grade, dtype: float64
If you want to loop over all the child_columns - you can do
excl_cols = ['dash', 'target']
child_cols = [col for col in df.columns if col not in excl_cols]
for col in child_cols:
print(df[(df['dash']=='P') & (df['target'] == 1)][col].value_counts(normalize=True))
If you want to loop over all the columns - then you can use:
loop_columns = set(df.columns) - {'target'}
for parent_col in loop_columns:
print(f'Parent column is {parent_col}\n')
parent_vals = df[parent_col].unique()
child_cols = loop_columns - {parent_col}
for parent_val in parent_vals:
for child_col in child_cols:
print(df[(df[parent_col]==parent_val) & (df['target'] == 1)][child_col].value_counts(normalize=True))
I want to create a new column in my table by implementing equation, but there might be 2 possible equations for the new table.
(1) frequency = (total x 100) / hour
(2) frequency = (total x 1000000) / km_length
the table is similar to this:
type hour km_length total
A 1 - 1
B - 2 1
the calculation for "frequency" table would depend on which columns between hour and km_length that has value.
then, I expect the table will be like this:
type hour km_length total frequency
A 1 - 1 100
B - 2 1 500000
I have tried using np.nan_to_num before but it did not show the expected table I want.
is there anyway I can make it using python? Looking forward to any help
We can use np.where for assigning values based on a condition:
df[["hour", "km_length"]] = df[["hour", "km_length"]].apply(pd.to_numeric, errors="coerce")
df["frequency"] = np.where(
df["total"] * 100 / df["hour"],
df["total"] * 1_000_000 / df["km_length"]
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
Make your values numeric then multiply. Because a missing value indicates with method to use and because division with NaN results in a NaN do both multiplications and use .fillna to determine the correct resulting value.
df[['hour', 'km_length']] = df[['hour', 'km_length']].apply(pd.to_numeric, errors='coerce')
s1 = df['total'].divide(df['hour']).multiply(100)
s2 = df['total'].divide(df['km_length']).multiply(10**6)
df['frequency'] = s1.fillna(s2)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
You can store the data in numpy array.
import numpy as np
table = np.array([['hour' , 'km_lenght' , 'total' , 'frequrncy']] #set the value of frequency as 0
for i in table:
i[3] = (i[2]*100)/i[0]
i[3] = (i[2]*1000000)/i[1]
This should print the desired table.
I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.
What I have right now looks like this:
0 0.00000787
1 0.00000785
2 0.00000749
3 0.00000788
4 0.00000786
5 0.00000538
6 0.00000472
7 0.00000759
And I would like to add a new column next to it, and if the value of spread in between (for example) 0 and 0.00005 then it is part of bin A, if (for example) between 0.00005 and 0.0006 then bin B (there are three bins in total). What I have tried so far:
minspread = df['spread'].min()
maxspread = df['spread'].max()
born = (float(maxspread)-float(minspread))/3
born1 = born + float(minspread)
born2 = float(maxspread) - born
df['Bin'] = df['spread'].apply(lambda x: 'A' if x < born1 else ( 'B' if born1 < x <= born2 else 'C'))
But when I do so everything ends up in the Bin A:
spread Bin
0 0.00000787 A
1 0.00000785 A
2 0.00000749 A
3 0.00000788 A
4 0.00000786 A
Does anyone have an idea on how to divide the column 'spread' in three bins (A-B-C) with the same number of observations in it? Thanks!
If get error:
unsupported operand type(s) for +: 'decimal.Decimal' and 'float'
It means the column type is Decimal, which works poorly with pandas, and should be converted to numeric.
One possible solution is to multiply columns by some big number e.g. 10e15 and convert to integer to avoid lost precision if converting to floats and then use qcut:
#sample data
#from decimal import Decimal
#df['spread'] = [Decimal(x) for x in df['spread']]
df['spread1'] = (df['spread'] * 10**15).astype(np.int64)
df['bins'] = pd.qcut(df['spread1'], 3, labels=list('ABC'))
print (df)
spread spread1 bins
0 0.00000787 7870000000 C
1 0.00000785 7850000000 B
2 0.00000749 7490000000 A
3 0.00000788 7880000000 C
4 0.00000786 7860000000 C
5 0.00000538 5380000000 A
6 0.00000472 4720000000 A
7 0.00000759 7590000000 B
Solution with no new column:
s = (df['spread'] * 10**15).astype(np.int64)
df['bins'] = pd.qcut(s, 3, labels=list('ABC'))
print (df)
spread bins
0 0.00000787 C
1 0.00000785 B
2 0.00000749 A
3 0.00000788 C
4 0.00000786 C
5 0.00000538 A
6 0.00000472 A
7 0.00000759 B
df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>> grouped = df.groupby("A")
>>> result = grouped.agg('first')
>>> result
mask values
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>> result
mask values Z
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.