pandas crosstab simplified view of multiple columns

pandas crosstab simplified view of multiple columns - python

I already referred the posts here and here. Don't mark it as duplicate
I have a dataframe like as below
id,status,country,amount,qty
1,pass,USA,123,4500
1,pass,USA,156,3210
1,fail,UK,687,2137
1,fail,UK,456,1236
2,pass,AUS,216,324
2,pass,AUS,678,241
2,nan,ANZ,637,213
2,pass,ANZ,213,543
sf = pd.read_clipboard(sep=',')
I would like to get the percentage of values from each column as a seperate column
So, with the help of this post, I tried the below
Approach - 1 Doesn't give expected output shape
(pd.crosstab(sf['id'],[sf['status'].fillna('nan'),sf['country'].fillna('nan')],normalize=0)
.drop('nan', 1)
.mul(100)).reset_index()
Approach - 2 - Doesn't give expected output
sf_inv= sf.melt()
pd.crosstab(sf_inv.value, sf_inv.variable)
I expect my output to be like as below

You can use crosstab with normalize='index' on your different columns and concat the results:
pd.concat([pd.crosstab(sf['id'], sf[c], normalize='index')
for c in ['status', 'country']], axis=1).mul(100).add_suffix('_pct')
output:
fail_pct pass_pct ANZ_pct AUS_pct UK_pct USA_pct
id
1 50.0 50.0 0.0 0.0 50.0 50.0
2 0.0 100.0 50.0 50.0 0.0 0.0
handling NaNs:
pd.concat([pd.crosstab(sf['id'], sf[c].fillna('NA'), normalize='index')
.drop(columns='NA', errors='ignore')
for c in ['status', 'country']], axis=1).mul(100).add_suffix('_pct')
output:
fail_pct pass_pct ANZ_pct AUS_pct UK_pct USA_pct
id
1 50.0 50.0 0.0 0.0 50.0 50.0
2 0.0 75.0 50.0 50.0 0.0 0.0

Related

How to count values over different thresholds per column in Pandas dataframe groupby?

My goal is for a dataset similar to the example below, to group by [s_num, ip, f_num, direction] and then filter the score columns using separate thresholds and count how many values are above the threshold.
id s_num ip f_num direction algo_1_x algo_2_x algo_1_score algo_2_score
0 0.0 0.0 0.0 0.0 X -4.63 -4.45 0.624356 0.664009
15 19.0 0.0 2.0 0.0 X -5.44 -5.02 0.411217 0.515843
16 20.0 0.0 2.0 0.0 X -12.36 -5.09 0.397237 0.541112
20 24.0 0.0 2.0 1.0 X -4.94 -5.15 0.401744 0.526032
21 25.0 0.0 2.0 1.0 X -4.78 -4.98 0.386410 0.564934
22 26.0 0.0 2.0 1.0 X -4.89 -5.03 0.394326 0.513896
24 28.0 0.0 2.0 2.0 X -4.78 -5.00 0.420078 0.521993
25 29.0 0.0 2.0 2.0 X -4.91 -5.14 0.407355 0.485878
26 30.0 0.0 2.0 2.0 X 11.83 -4.97 0.392242 0.659122
27 31.0 0.0 2.0 2.0 X -4.73 -5.07 0.377011 0.524774

the result should look something like:
Each entry in algo_i column is the # of values in the group larger than the corresponding threshold
So far I tried first grouping, and applying custom aggregation like so:
def count_success(x,thresh):
return ((x > thresh)*1).sum()
thresholds=[0.1,0.2]
df.groupby(attr_cols).agg({f'algo_{i+1}_score':count_success(thresh) for i, thresh in enumerate(thresholds)})
but this results in an error :
count_success() missing 1 required positional argument: 'thresh'
So, how can I pass another argument to a function using .agg( )? or is there an easier way to do it using some pandas function?

Named aggregation does not allow extra parameter to be passed to your function. You can use numpy boardcasting:
attr_cols = ["s_num", "ip", "f_num", "direction"]
score_cols = df.columns[df.columns.str.match("algo_\d+_score")]
# Convert everything to numpy to prepare for broadcasting
score = df[score_cols].to_numpy()
threshold = np.array([0.1, 0.5])
# Raise `threshold` up 2 dimensions so that every value in `score` is
# broadcast against every value in `threshold`
mask = score > threshold[:, None, None]
# Assemble the result
row_index = pd.MultiIndex.from_frame(df[attr_cols])
col_index = pd.MultiIndex.from_product([threshold, score_cols], names=["threshold", "algo"])
result = (
pd.DataFrame(np.hstack(mask), index=row_index, columns=col_index)
.groupby(attr_cols)
.sum()
)
Result:
threshold 0.1 0.5
algo algo_1_score algo_2_score algo_1_score algo_2_score
s_num ip f_num direction
0.0 0.0 0.0 X 1 1 1 1
2.0 0.0 X 2 2 0 2
1.0 X 3 3 0 3
2.0 X 4 4 0 3

How do I sum on the outter most level of a multi index (row)?

I am trying to figure out how to sum on the outer most level of my multi-index. So I want to sum the COUNTS column based on the individual operators, and all the shops listed for it.
df=pd.DataFrame(data.groupby('OPERATOR').SHOP.value_counts())
df=df.rename(columns={'SHOP':'COUNTS'})
df['COUNTS'] = df['COUNTS'].astype(float)
df['percentage']=df.groupby(['OPERATOR'])['COUNTS'].sum()
df['percentage']=df.sum(axis=0, level=['OPERATOR', 'SHOP'])
df.head()
COUNTS percentage
OPERATOR SHOP
AVIANCA CC9 3.0 3.0
FF9 1.0 1.0
IHI 1.0 1.0
Aegean HA9 33.0 33.0
IN9 24.0 24.0
When I use the df.sum call, it lets me call it on both levels, but then when I change it to df.sum(axis=0, level=['OPERATOR'], it results in the percentage column being NaN. I originally had the count column as int so I thought maybe that was the issue, and converted to float, but this didn't resolve the issue. This is the desired output:
COUNTS percentage
OPERATOR SHOP
AVIANCA CC9 3.0 5.0
FF9 1.0 5.0
IHI 1.0 5.0
Aegean HA9 33.0 57.0
IN9 24.0 57.0
(This is just a stepping stone on the way to calculating the percentage for each shop respective to the operator, i.e. the FINAL final output would be):
COUNTS percentage
OPERATOR SHOP
AVIANCA CC9 3.0 .6
FF9 1.0 .2
IHI 1.0 .2
Aegean HA9 33.0 .58
IN9 24.0 .42
So bonus points if you include the last step of that as well!! Please help me!!!

Group by OPERATOR and normalize your data:
df['percentage'] = df.groupby('OPERATOR')['COUNTS'] \
.transform(lambda x: x / x.sum()) \
.round(2)
>>> df
COUNTS percentage
OPERATOR SHOP
AVIANCA CC9 3.0 0.60
FF9 1.0 0.20
IHI 1.0 0.20
Aegean HA9 33.0 0.58
IN9 24.0 0.42

Multi-index calculation to new columns

I have a dataframe like this.
status new allocation
asset csh fi eq csh fi eq
person act_type
p1 inv 0.0 0.0 100000.0 0.0 0.0 1.0
rsp 0.0 30000.0 20000.0 0.0 0.6 0.4
tfsa 10000.0 40000.0 0.0 0.2 0.8 0.0
The right three columns are percent of total for each act_type. The following does calculate the columns correctly:
# set the percent allocations
df.loc[idx[:,:],idx["allocation",'csh']] = df.loc[idx[:,:],idx["new",'csh']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
df.loc[idx[:,:],idx["allocation",'fi']] = df.loc[idx[:,:],idx["new",'fi']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
df.loc[idx[:,:],idx["allocation",'eq']] = df.loc[idx[:,:],idx["new",'eq']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
I have tried to do these calculations on one line combining 'csh', 'fi', 'eq' as follows:
df.loc[idx[:,:],idx["new", ('csh', 'fi', 'eq')]] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
But this results in ValueError: cannot join with no level specified and no overlapping names
Any suggestions how I can reduce these three lines to one line of code so that i'm dividing ('csh','fi','eq') by the account total and getting percents in the next columns?

First idx[:,:] should be simplify by :, then use DataFrame.div by axis=0 and for new columns use rename with DataFrame.join:
df1=df.loc[:, idx["new",('csh', 'fi', 'eq')]].div(df.loc[:, idx["new",:]].sum(axis=1),axis=0)
df = df.join(df1.rename(columns={'new':'allocation'}, level=0))
print (df)
status new allocation
asset csh fi eq csh fi eq
person act_type
p1 inv 0.0 0.0 100000.0 0.0 0.0 1.0
rsp 0.0 30000.0 20000.0 0.0 0.6 0.4
tfsa 10000.0 40000.0 0.0 0.2 0.8 0.0

Using groupby().sum() on a dataframe, then plotting a pie chart with labels?

This is my first question here, I'm quite new to Python/pandas/matplolib
I have this line of code that creates a DataFrame:
repartition = sorted2017.groupby(by=sorted2017["Traitement"]).sum()
It works as I expected, except that the column title "Traitement" seems to appear on its own row:
Prix Coût net Manuvie CCQ SSQ
Traitement
masso (Véro) 213.86 0.0 144.0 69.86 0.0
ostéo (Véro) 80.00 0.0 64.0 16.00 0.0
physio (Danny) 415.00 0.0 265.0 150.00 0.0
physio (Véro) 269.00 0.0 204.8 64.20 0.0
psy (Simone) 500.00 0.0 150.0 350.00 0.0
psy (Véro) 300.00 0.0 240.0 60.00 0.0
I wanted to use the "Traitement" column as labels for my matplotlib pie chart, so I tried :
plt.pie(repartition["Prix"], labels=repartition["Traitement"])
plt.show()
But I get a KeyError. I've also tried with iloc for the labels, but then I get
ValueError : "'label' must be of length 'x'"
How can I fix this?

After groupby, "Traitement" column is in index column.
plt.pie(x=repartition["Prix"], labels=repartition.index)
plt.show()

python pandas dataframe equivalent function logic for nonzero value calculation

This is a pinescript code which I am trying to code in python.what shall be an optimized equivalent python code be for the same
kama[1] here is previous kama value, for first-time calc in an array what should be done for this kama[1] value as it would not exist the first time.
kama=nz(kama[1], close[1])+smooth*(close[1]-nz(kama[1], close[1]))
pinescript info :
nz
Replaces NaN values with zeros (or given value) in a series.
nz(x, y) → integer
nz(sma(close, 100))
RETURNS
Two args version: returns x if it's a valid (not NaN) number, otherwise y
One arg version: returns x if it's a valid (not NaN) number, otherwise 0
ARGUMENTS
x (series) Series of values to process.
y (float) Value that will be inserted instead of all NaN values in x series.
edit 1
something i tried as below that is not working
stockdata['kama'] = stockdata['kama'](-1) if stockdata['kama'](-1) !=0 \
else stockdata['close'] + stockdata['smooth']*(stockdata['close'] - \
stockdata['kama'](-1) if stockdata['kama'](-1) !=0 else stockdata['close'])
edit 2
the alternative i tried just to make sure at least one part is working but that is also failing (nz(kama[1], close))
stockdata['kama'] = np.where(stockdata['kama'][-1] != 0, stockdata['kama'][-1], stockdata['close'])
completely struck now if this line of
kama=nz(kama[1], close)+smooth*(close-nz(kama[1], close))
pine-script code not converted to python my whole logic will go for a toss. any of your working solutions are greatly appreciated.
edit 3:
the dataframe input of the series
open high low close adjusted_close \
date
2002-07-01 5.2397 5.5409 5.2397 5.4127 0.0634
2002-07-02 5.5234 5.5370 5.4214 5.4438 0.0638
2002-07-03 5.5060 5.5458 5.3281 5.4661 0.0640
2002-07-04 5.5011 5.5720 5.4175 5.5283 0.0647
2002-07-05 5.5633 5.6566 5.4749 5.5905 0.0655
2002-07-08 5.5011 5.7187 5.5011 5.6255 0.0659
2002-07-09 5.5905 5.7586 5.5681 5.6167 0.0658
2002-07-10 5.4885 5.4885 5.1465 5.2222 0.0612
2002-07-11 4.9784 5.2135 4.9784 5.1863 0.0607
2002-07-12 5.5011 5.5011 5.2446 5.3194 0.0623
2002-07-15 5.3243 5.4797 5.1912 5.3330 0.0625
2002-07-16 5.1999 5.4389 5.1999 5.3155 0.0623
2002-07-17 4.7024 5.1377 4.6189 5.0445 0.0591
2002-07-18 4.8803 5.1465 4.8356 5.0804 0.0595
2002-07-19 5.0270 5.2038 5.0221 5.1513 0.0603
2002-07-22 5.0804 5.1465 4.9687 4.9735 0.0582
2002-07-23 4.8181 5.0843 4.8181 5.0619 0.0593
2002-07-24 5.0580 5.1290 4.9376 5.0619 0.0593
2002-07-25 5.0580 5.0580 4.7918 4.8492 0.0568
volume dividend_amount split_coefficient Om \
date
2002-07-01 21923 0.0 1.0 NaN
2002-07-02 61045 0.0 1.0 NaN
2002-07-03 34161 0.0 1.0 NaN
2002-07-04 27893 0.0 1.0 NaN
2002-07-05 58976 0.0 1.0 NaN
2002-07-08 48910 0.0 1.0 5.472433
2002-07-09 321846 0.0 1.0 5.530900
2002-07-10 138434 0.0 1.0 5.525083
2002-07-11 15027 0.0 1.0 5.437150
2002-07-12 24187 0.0 1.0 5.437150
2002-07-15 50330 0.0 1.0 5.397317
2002-07-16 24928 0.0 1.0 5.347117
2002-07-17 21357 0.0 1.0 5.199100
2002-07-18 27532 0.0 1.0 5.097733
2002-07-19 13380 0.0 1.0 5.105833
2002-07-22 21666 0.0 1.0 5.035717
2002-07-23 40161 0.0 1.0 4.951350
2002-07-24 34480 0.0 1.0 4.927700
2002-07-25 38185 0.0 1.0 4.986967
Hm Lm Cm vClose diff \
date
2002-07-01 NaN NaN NaN NaN 1669.8373
2002-07-02 NaN NaN NaN NaN 1669.8062
2002-07-03 NaN NaN NaN NaN 1669.7839
2002-07-04 NaN NaN NaN NaN 1669.7217
2002-07-05 NaN NaN NaN NaN 1669.6595
2002-07-08 5.595167 5.397117 5.511150 5.493967 1669.6245
2002-07-09 5.631450 5.451850 5.545150 5.539837 1669.6333
2002-07-10 5.623367 5.406033 5.508217 5.515675 1670.0278
2002-07-11 5.567983 5.347750 5.461583 5.453617 1670.0637
2002-07-12 5.556167 5.318933 5.426767 5.434754 1669.9306
2002-07-15 5.526683 5.271650 5.383850 5.394875 1669.9170
2002-07-16 5.480050 5.221450 5.332183 5.345200 1669.9345
2002-07-17 5.376567 5.063250 5.236817 5.218933 1670.2055
2002-07-18 5.319567 5.011433 5.213183 5.160479 1670.1696
2002-07-19 5.317950 5.018717 5.207350 5.162463 1670.0987
2002-07-22 5.258850 4.972733 5.149700 5.104250 1670.2765
2002-07-23 5.192950 4.910550 5.104517 5.039842 1670.1881
2002-07-24 5.141300 4.866833 5.062250 4.999521 1670.1881
2002-07-25 5.128017 4.895650 5.029700 5.010083 1670.4008
signal noise efratio smooth
date
2002-07-01 5.4127 1670.3373 0.003240 0.416113
2002-07-02 5.4438 1670.3062 0.003259 0.416113
2002-07-03 5.4661 1670.2839 0.003273 0.416114
2002-07-04 5.5283 1670.2217 0.003310 0.416115
2002-07-05 5.5905 1670.1595 0.003347 0.416116
2002-07-08 5.6255 1670.1245 0.003368 0.416116
2002-07-09 5.6167 1670.1333 0.003363 0.416116
2002-07-10 5.2222 1670.5278 0.003126 0.416110
2002-07-11 5.1863 1670.5637 0.003105 0.416109
2002-07-12 5.3194 1670.4306 0.003184 0.416111
2002-07-15 5.3330 1670.4170 0.003193 0.416111
2002-07-16 5.3155 1670.4345 0.003182 0.416111
2002-07-17 5.0445 1670.7055 0.003019 0.416107
2002-07-18 5.0804 1670.6696 0.003041 0.416107
2002-07-19 5.1513 1670.5987 0.003084 0.416109
2002-07-22 4.9735 1670.7765 0.002977 0.416106
2002-07-23 5.0619 1670.6881 0.003030 0.416107
2002-07-24 5.0619 1670.6881 0.003030 0.416107
2002-07-25 4.8492 1670.9008 0.002902 0.416104
what is expected for kama=nz(kama[1], close)+smooth*(close-nz(kama[1], close))?
stockdata['kama'] =nz(stockdata[kama][-1],stockdata['close'] +stockdata['smooth']*(stockdata['close']-nz(stockdata['kama'][-1],stockdata['close'])
in this case for first iteration there will not be any previous kama value which should be taken care. all the inputs are given in dataframe format above.

You need to create the column kama first with the values of close:
import numpy as np
stockdata['kama'] = stockdata['close']
previous_kama = stockdata['kama'].shift()
previous_close = stockdata['close'].shift()
value = np.where(previous_kama.notnull(), previous_kama, previous_close)
stockdata['kama'] = value + stockdata['smooth'] * (previous_close - value)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas crosstab simplified view of multiple columns - python

Related

How to count values over different thresholds per column in Pandas dataframe groupby?

How do I sum on the outter most level of a multi index (row)?

Multi-index calculation to new columns

Using groupby().sum() on a dataframe, then plotting a pie chart with labels?

python pandas dataframe equivalent function logic for nonzero value calculation

Categories

Resources