How to calculate the ratio per columns in python? - python

I'm trying to calculate the ratio by columns in python.
import pandas as pd
import numpy as np
data={
'category': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'value 1': [1, 1, 2, 5, 3, 4, 4, 8, 7],
'value 2': [4, 2, 8, 5, 7, 9, 3, 4, 2]
}
data=pd.DataFrame(data)
data.set_index('category')
# value 1 value 2
#category
# A 1 4
# B 1 2
# C 2 8
# D 5 5
# E 3 7
# F 4 9
# G 4 3
# H 8 4
# I 7 2
The expected results is as below:
#The sum of value 1: 35, value 2: 44
#The values in the first columns were diveded by 35, and the second columns were divded by 44
# value 1 value 2
#category
# A 0.028 0.090
# B 0.028 0.045
# C 0.057 0.181
# D 0.142 0.113
# E 0.085 0.159
# F 0.114 0.204
# G 0.114 0.068
# H 0.228 0.090
# I 0.2 0.045
I tried to run the below code, but it returned NaN values:
data=data.apply(lambda x:x/data.sum())
data
I think that there are simpler methods for this job, but I cannot search the proper keywords..
How can I calculate the ratio in each column?

The issue is that you did not make set_index permanent.
What I usually do to ensure I do correct things is using pipelines
data=pd.DataFrame(data)
dataf =(
data
.set_index('category')
.transform(lambda d: d/d.sum())
)
print(dataf)
By piping commands, you get what you want. Note: I used transform instead of apply for speed.
They are easy to read, and less prune to mistake. Using inplace=True is discouraged in Pandas as the effects could be unpredictable.

Related

Creating flexible, iterative field name in Python function or loop

I am creating a DataFrame with the code below:
import pandas as pd
df1= pd.DataFrame({'segment': ['abc','abc','abc','abc','abc','xyz','xyz','xyz','xyz','xyz','xyz','xyz'],
'prod_a_clients': [5,0,12,25,0,2,5,24,0,1,21,7],
'prod_b_clients': [15,6,0,12,8,0,17,0,2,23,15,0] })
abc_seg= df1[(df1['segment']=='abc')]
xyz_seg= df1[(df1['segment']=='xyz')]
seg_prod= df1[(df1['segment']=='abc') & (df1['prod_a_clients']>0)]
abc_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])
seg_prod= df1[(df1['segment']=='abc') & (df1['prod_b_clients']>0)]
abc_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])
seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_a_clients']>0)]
xyz_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])
seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_b_clients']>0)]
xyz_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])
segs_combined= [abc_seg,xyz_seg]
df2= pd.concat(segs_combined, ignore_index=True)
print(df2)
As you can see from the result I need to calculate a mean for every product and segment combination I have. I'm going to be doing this for 100s of products and segments. I have tried many different ways of doing this with a loop or a function and have gotten close with something like the following:
def prod_seg(sg,prd):
seg_prod= df1[(df1['segment']==sg) & (df1[prd+'_clients']>0)]
prod_name= prd+'_clients'
col_name= prd+'_average'
df_name= sg+'_seg'
df_name+"['"+prd+'_average'+"']"=statistics.mean(seg_prod[prod_name])
return
The issue is that I need to create a unique column for every iteration and the way I am doing it above is obviously not working.
Is there any way I can recreate what I did above in a loop or function?
You could use groupby in order to calculate the mean per group. Also, replace the 0 with nan and it gets skipped by the mean calculation. The script then looks like:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'segment': ['abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz',
'xyz', 'xyz', 'xyz'],
'prod_a_clients': [5, 0, 12, 25, 0, 2, 5, 24, 0, 1, 21, 7],
'prod_b_clients': [15, 6, 0, 12, 8, 0, 17, 0, 2, 23, 15, 0]})
df1.set_index("segment", inplace=True, drop=True)
df1[df1 == 0] = np.nan
mean_values = dict()
for seg_key, seg_df in df1.groupby(level=0):
mean_value = seg_df.mean(numeric_only=True)
mean_values[seg_key] = mean_value
results = pd.DataFrame.from_dict(mean_values)
print(results)
The results is:
abc xyz
prod_a_clients 14.00 10.00
prod_b_clients 10.25 14.25
Instead of using a loop, you can derive the same result by first using where on the 0s in the clients columns (which replaces 0s with NaN); then groupby the "segments" column and transform the "mean" method.
The point of where is that mean method by default skips NaN values, so by converting 0s with NaN, we make sure 0s are not considered for the mean.
transform(mean) transforms the mean (which is an aggregate value) to align with the original DataFrame, so every row has a matching mean value.
clients = ['prod_a_clients', 'prod_b_clients']
out = (df1.join(df1[['segment']]
.join(df1[clients].where(df1[clients]>0))
.groupby('segment').transform('mean')
.add_suffix('_mean')))
Output:
segment prod_a_clients prod_b_clients prod_a_clients_mean prod_b_clients_mean
0 abc 5 15 14.0 10.25
1 abc 0 6 14.0 10.25
2 abc 12 0 14.0 10.25
3 abc 25 12 14.0 10.25
4 abc 0 8 14.0 10.25
5 xyz 2 0 10.0 14.25
6 xyz 5 17 10.0 14.25
7 xyz 24 0 10.0 14.25
8 xyz 0 2 10.0 14.25
9 xyz 1 23 10.0 14.25
10 xyz 21 15 10.0 14.25
11 xyz 7 0 10.0 14.25

How to multiply two columns together with a condition applied to one of the columns in pandas python?

Here is some example data:
data = {'Company': ['A', 'B', 'C', 'D', 'E', 'F'],
'Value': [18700, 26000, 44500, 32250, 15200, 36000],
'Change': [0.012, -0.025, -0.055, 0.06, 0.035, -0.034]
}
df = pd.DataFrame(data, columns = ['Company', 'Value', 'Change'])
df
Company Value Change
0 A 18700 0.012
1 B 26000 -0.025
2 C 44500 -0.055
3 D 32250 0.060
4 E 15200 0.035
5 F 36000 -0.034
I would like to create a new column called 'New Value'. The logic for this column is something along the lines of the following for each row:
if Change > 0, then Value + (Value * Change)
if Change < 0, then Value - (Value * (abs(Change)) )
I attempted to create a list with the following loop and add it to df as a new column but many more values than expected were returned when I expected only 5 (corresponding with the number of rows in df).
lst = []
for x in df['Change']:
for y in df['Value']:
if x > 0:
lst.append(y + (y*x))
elif x < 0:
lst.append(y - (y*(abs(x))))
print(lst)
It would be great if someone could point out where I've gone wrong, or suggest an alternate method :)
Your two conditions are actually identical, so this is all you need to do:
df['New Value'] = df['Value'] + df['Value'] * df['Change']
Output:
>>> df
Company Value Change New Value
0 A 18700 0.012 18924.4
1 B 26000 -0.025 25350.0
2 C 44500 -0.055 42052.5
3 D 32250 0.060 34185.0
4 E 15200 0.035 15732.0
5 F 36000 -0.034 34776.0
Or, slightly more consisely:
df['New Value'] = df['Value'] * df['Change'].add(1)
Or
df['New Value'] = df['Value'].mul(df['Change'].add(1))

How to apply rolling function when all variables in window from multiple columns are required

I'm trying to calculate a rolling statistic that requires all variables in a window from two input columns.
My only solution involves a for loop. Is there a more efficient way, perhaps using Pandas' rolling and apply functions?
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])[1]
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.rolling(2).apply(lambda x: f(x), raw=False) # KeyError: 'a'
I get KeyError: 'a' because df gets passed to f() one series (column) at a time. Specifying axis=1 sends one row and all columns to f(), but neither approach provides the required set of observations.
You could try rolling, mean and sum:
df['result'] = df.rolling(2).mean().sum(axis=1)
a b result
0 1 5 0.0
1 2 6 7.0
2 3 7 9.0
3 4 8 11.0
EDIT
Adding a different answer based upon new information in the question by OP.
Set up the function.
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])
Create the data and dataframe:
a_data = [1,2,3,4]
b_data = [5,6,7,8]
df = pd.DataFrame(data={'a': a_data, 'b': b_data})
a b
0 1 5
1 2 6
2 3 7
3 4 8
I gather after researching coint that you are trying to pass two rolling arrays to f['a'] and f['b']. The following will create the arrays and dataframe.
n=2
arr_a = [df['a'].shift(x).values[::-1][:n] for x in range(len(df['a']))[::-1]]
arr_b = [df['b'].shift(x).values[::-1][:n] for x in range(len(df['b']))[::-1]]
df1 = pd.DataFrame(data={'a': arr_a, 'b': arr_b})
n is the size of the rolling window.
df1
a b
0 [1.0, nan] [5.0, nan]
1 [2.0, 1.0] [6.0, 5.0]
2 [3.0, 2.0] [7.0, 6.0]
3 [4, 3] [8, 7]
Then you can use apply.(f) to send in the rows of arrays.
df1.iloc[(n-1):,].apply(f, axis=1)
Your output is as follows:
1 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
2 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
3 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
dtype: object
When I run this I do get an error for perfectly colinear data, but I suspect that will disappear with real data.
Also, I know a purely vecotorized solution might have been faster. I wonder what the performance will be like for this if it what you are looking for?
Hats off to #Zero who really had the solution for this problem here.
I tried placing the sum before the rolling:
import pandas as pd
import time
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.copy()
s = time.time()
df2.loc[:, 'mean1'] = df.sum(axis = 1).rolling(2).mean()
print(time.time() - s)
s = time.time()
df2.loc[:, 'mean2'] = df.rolling(2).mean().sum(axis=1)
print(time.time() - s)
df2
0.003737926483154297
0.005460023880004883
a b mean1 mean2
0 1 5 NaN 0.0
1 2 6 7.0 7.0
2 3 7 9.0 9.0
3 4 8 11.0 11.0
It is slightly faster than the previous answer, but works the same and maybe in large datasets the difference migth significant.
You can modify it to select the columns of interest only:
s = time.time()
print(df[['a', 'b']].sum(axis = 1).rolling(2).mean())
print(time.time() - s)
0 NaN
1 7.0
2 9.0
3 11.0
dtype: float64
0.0033559799194335938

Fast conversion to multiindexed pandas dataframe using bincounts

I have data from users who have left star ratings (1, 2 or 3 stars) on items in various categories, where each item may belong to multiple categories. In my current dataframe, each row represents a rating and the categories are one-hot encoded, like so:
import numpy as np
import pandas as pd
df_old = pd.DataFrame({
'user': [1, 1, 2, 2, 2],
'rate': [3, 2, 1, 1, 2],
'cat1': [1, 0, 1, 1, 1],
'cat2': [0, 1, 0, 0, 1]
})
# user rate cat1 cat2
# 0 1 3 1 0
# 1 1 2 0 1
# 2 2 1 1 0
# 3 2 1 1 0
# 4 2 2 1 1
I want to convert this to a new dataframe, multiindexed by user and rate, which show the per-category bincounts for each star rating. I'm currently doing this with loops:
multi_idx = pd.MultiIndex.from_product(
[df_old.user.unique(), range(1,4)],
names=['user', 'rate']
)
df_new = pd.DataFrame( # preallocate in an attempt to speed up the code
{'cat1': np.nan, 'cat2': np.nan},
index=multi_idx
)
df_new.sort_index(inplace=True)
idx = pd.IndexSlice
for uid in df_old.user.unique():
for cat in ['cat1', 'cat2']:
df_new.loc[idx[uid, :], cat] = np.bincount(
df_old.loc[(df_old.user == uid) & (df_old[cat] == 1),
'rate'].values, minlength=4)[1:]
# cat1 cat2
# user rate
# 1 1 0.0 0.0
# 2 0.0 1.0
# 3 1.0 0.0
# 2 1 2.0 0.0
# 2 1.0 1.0
# 3 0.0 0.0
Unfortunately the above code is hopelessly slow on my real dataframe, which is long and contains many categories. How can I eliminate the loops please?
With your multi-index, you can aggregate your old data frame, and reindex it:
df_old.groupby(['user', 'rate']).sum().reindex(multi_idx).fillna(0)
Or as #piRSquared commented, do the reindex and fill missing value at one step:
df_old.groupby(['user', 'rate']).sum().reindex(multi_idx, fill_value=0)

Manipulate pandas.DataFrame with multiple criterias

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?
query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0
Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

Categories

Resources