How to round percentages with Pandas? - python

I've got to print percentages, but the trick is that i have to round the values to 4 decimals.
It is in a DataFrame where each column represent the percentages for one allocation.
Sometimes, the sum of the percentages does not give 1, but 0.9999 or 1.0001 (which makes sense). But how do you make sure it does ?
You have to arbitrary pick a row and put the delta in it.
I've come up with this solution, but i've got to iterate through each column and do the modification on the Series.
Code
df = abs(pd.DataFrame(np.random.randn(4, 4), columns=range(0,4)))
# Making sure the sum of allocation is 1.
df = df / df.sum()
# Rounding the allocation
df = df.round(4)
print("-- before --")
print(df)
print(df.sum())
# It can happen that after rounding your number, the sum is not equal to 1. (imagine rounding 1/3 three times...)
# So check for the sum of each col and then put the delta in in the fund with the lowest value.
for p in df:
if df[p].sum() != 1:
# get the id of the fund with the lowest percentage (but not 0)
low_id = (df[p][df[p] != 0].idxmin())
df[p][low_id] += (1 - df[p].sum())
print("-- after --")
print(df)
print(df.sum())
Output
-- before --
0 1 2 3
0 0.0116 0.1256 0.4980 0.3738
1 0.2562 0.5458 0.3086 0.1221
2 0.4853 0.0009 0.0588 0.0078
3 0.2470 0.3277 0.1346 0.4962
0 1.0001
1 1.0000
2 1.0000
3 0.9999
dtype: float64
-- after --
0 1 2 3
0 0.0115 0.1256 0.4980 0.3738
1 0.2562 0.5458 0.3086 0.1221
2 0.4853 0.0009 0.0588 0.0079
3 0.2470 0.3277 0.1346 0.4962
0 1.0
1 1.0
2 1.0
3 1.0
dtype: float64
Is there any faster solution ?
Thanks a lot,
Regards,
Julien

It is always better to avoid loops.
df = abs(pd.DataFrame(np.random.randn(4, 4) ))
df = df / df.sum()
df = df.round(4)
columns = ['Sum','Min', 'submin']
dftemp = pd.DataFrame(columns=columns)
dftemp['Sum']= df.sum(axis=0) # sum columns
dftemp['Min']= df[df!=0].min(axis=0) # non zero minimum of column
dftemp['submin']= dftemp['Min']+(1-dftemp['Sum']) # (1 -sum of columns) + minimum value
dftemp['FinalValue']= np.where (dftemp['Sum']!=1,dftemp.submin,dftemp.Min) # decide weather to use existing miinimum value or delta
print('\n\nBefore \n\n ',df,'\n\n ', df.sum())
df=df.mask(df.eq(df.min(0),1),df.eq(df.min(0),1).mul(dftemp['FinalValue'].tolist())) # Replace the minmum value with delta values
print('After \n\n ',df,'\n\n ', df.sum())
Output
Output
Before
0 1 2 3
0 0.1686 0.0029 0.1055 0.1739
1 0.5721 0.5576 0.2904 0.2205
2 0.0715 0.2749 0.4404 0.5014
3 0.1878 0.1647 0.1637 0.1042
0 1.0000
1 1.0001
2 1.0000
3 1.0000
dtype: float64
After
0 1 2 3
0 0.1686 0.0028 0.1055 0.1739
1 0.5721 0.5576 0.2904 0.2205
2 0.0715 0.2749 0.4404 0.5014
3 0.1878 0.1647 0.1637 0.1042
0 1.0
1 1.0
2 1.0
3 1.0
dtype: float64

Related

window based weighted average in pandas

I am trying to do a window based weighted average of two columns
for example if i have my value column "a" and my weighting column "b"
a b
1: 1 2
2: 2 3
3: 3 4
with a trailing window of 2 (although id like to work with a variable window length)
my third weighted average column should be "c" where the rows that do not have enough previous data for a full weighted average are nan
c
1: nan
2: (1 * 2 + 2 * 3) / (2 + 3) = 1.8
3: (2 * 3 + 3 * 4) / (3 + 4) = 2.57
For your particular case of window of 2, you may use prod and shift
s = df.prod(1)
(s + s.shift()) / (df.b + df.b.shift())
Out[189]:
1 NaN
2 1.600000
3 2.571429
dtype: float64
On sample df2:
a b
0 73.78 51.46
1 73.79 27.84
2 73.79 34.35
s = df2.prod(1)
(s + s.shift()) / (df2.b + df2.b.shift())
Out[193]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
This method still works on variable window length. For variable window length, you need additional listcomp and sum
Try on sample df2 above
s = df2.prod(1)
m = 2 #window length 2
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[214]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
On window length 3
m = 3 #window length 3
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[215]:
0 NaN
1 NaN
2 73.785472
dtype: float64

How to create a column using a function based of previous values in the column in python

My Problem
I have a loop that creates a value for x in time period t based on x in time period t-1. The loop is really slow so i wanted to try and turn it into a function. I tried to use np.where with shift() but I had no joy. Any idea how i might be able to get around this problem?
Thanks!
My Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('y_list.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df.loc[df.index[0], 'var'] = 0
for x in range(1,len(df.index)):
if df["LAST"].iloc[x] > 0:
df["var"].iloc[x] = ((df["var"].iloc[x - 1] * 2) + df["LAST"].iloc[x]) / 3
else:
df["var"].iloc[x] = (df["var"].iloc[x - 1] * 2) / 3
df
Input Data
Dates,LAST
03/09/2018,-7
04/09/2018,5
05/09/2018,-4
06/09/2018,5
07/09/2018,-6
10/09/2018,6
11/09/2018,-7
12/09/2018,7
13/09/2018,-9
Output
Dates,LAST,var
03/09/2018,-7,0.000000
04/09/2018,5,1.666667
05/09/2018,-4,1.111111
06/09/2018,5,2.407407
07/09/2018,-6,1.604938
10/09/2018,6,3.069959
11/09/2018,-7,2.046639
12/09/2018,7,3.697759
13/09/2018,-9,2.465173
You are looking at ewm:
arg = df.LAST.clip(lower=0)
arg.iloc[0] = 0
arg.ewm(alpha=1/3, adjust=False).mean()
Output:
0 0.000000
1 1.666667
2 1.111111
3 2.407407
4 1.604938
5 3.069959
6 2.046639
7 3.697759
8 2.465173
Name: LAST, dtype: float64
You can use df.shift to shift the dataframe be a default of 1 row, and convert the if-else block in to a vectorized np.where:
In [36]: df
Out[36]:
Dates LAST var
0 03/09/2018 -7 0.0
1 04/09/2018 5 1.7
2 05/09/2018 -4 1.1
3 06/09/2018 5 2.4
4 07/09/2018 -6 1.6
5 10/09/2018 6 3.1
6 11/09/2018 -7 2.0
7 12/09/2018 7 3.7
8 13/09/2018 -9 2.5
In [37]: (df.shift(1)['var']*2 + np.where(df['LAST']>0, df['LAST'], 0)) / 3
Out[37]:
0 NaN
1 1.666667
2 1.133333
3 2.400000
4 1.600000
5 3.066667
6 2.066667
7 3.666667
8 2.466667
Name: var, dtype: float64

List of NAN values while calculating p value and Z score in Scipy

I am calculating Z score and P value for different sub segments within a data frame.
The data frame has two columns, here are the top 5 values in my data frame:
df[["Engagement_score", "Performance"]].head()
Engagement_score Performance
0 6 0.0
1 5 0.0
2 7 66.3
3 3 0.0
4 11 0.0
Here's the distribution of engagement score:
Here's the distribution of performance:
I am grouping my dataframe by engagement score and then I calculate these three statistics for those groups:
1) Average performance score(sub_average) and number of values within that group(sub_bookings)
2) Average performance score for rest of the groups(rest_average) and number of values in rest of the groups(rest_bookings)
Overall performance score and overall bookings are calculated for the overall data frame.
Here's my code to do that.
def stats_comparison(i):
df.groupby(i)['Performance'].agg({
'average': 'mean',
'bookings': 'count'
}).reset_index()
cat = df.groupby(i)['Performance']\
.agg({
'sub_average': 'mean',
'sub_bookings': 'count'
}).reset_index()
cat['overall_average'] = df['Performance'].mean()
cat['overall_bookings'] = df['Performance'].count()
cat['rest_bookings'] = cat['overall_bookings'] - cat['sub_bookings']
cat['rest_average'] = (cat['overall_bookings']*cat['overall_average'] \
- cat['sub_bookings']*cat['sub_average'])/cat['rest_bookings']
cat['z_score'] = (cat['sub_average']-cat['rest_average'])/\
np.sqrt(cat['overall_average']*(1-cat['overall_average'])
*(1/cat['sub_bookings']+1/cat['rest_bookings']))
cat['prob'] = np.around(stats.norm.cdf(cat.z_score), decimals = 10) # this is the p value
cat['significant'] = [(lambda x: 1 if x > 0.9 else -1 if x < 0.1 else 0)(i) for i in cat['prob']]
# if the p value is less than 0.1 then I can confidently say that the 2 samples are different.
print(cat)
stats_comparison('Engagement_score')
I get the following output when I execute my code:
Engagement_score sub_average sub_bookings overall_average \
0 3 57.281118 1234 34.405373
1 4 56.165374 722 34.405373
2 5 52.896404 890 34.405373
3 6 50.275880 966 34.405373
4 7 43.475344 1018 34.405373
5 8 37.693290 1222 34.405373
6 9 30.418053 1695 34.405373
7 10 16.458142 2874 34.405373
8 11 25.604145 1375 34.405373
9 12 10.910013 789 34.405373
overall_bookings rest_bookings rest_average z_score prob significant
0 12785 11551 31.961544 NaN NaN 0
1 12785 12063 33.102984 NaN NaN 0
2 12785 11895 33.021850 NaN NaN 0
3 12785 11819 33.108233 NaN NaN 0
4 12785 11767 33.620702 NaN NaN 0
5 12785 11563 34.057900 NaN NaN 0
6 12785 11090 35.014797 NaN NaN 0
7 12785 9911 39.609727 NaN NaN 0
8 12785 11410 35.465995 NaN NaN 0
9 12785 11996 35.950709 NaN NaN 0
I don't know why I am getting a list of NAN values in ZScore and P value columns. There are no negative values in my data set.
I also get the following warning when I run the code in Jupyter Notebook:
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
after removing the cwd from sys.path.
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: RuntimeWarning: invalid value encountered in sqrt
from ipykernel import kernelapp as app
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1738: RuntimeWarning: invalid value encountered in greater_equal
cond2 = (x >= self.b) & cond0

Find euclidean distance from a point to rows in pandas dataframe

i have a dataframe
id lat long
1 12.654 15.50
2 14.364 25.51
3 17.636 32.53
5 12.334 25.84
9 32.224 15.74
I want to find the euclidean distance of these coordinates from a particulat location saved in a list L1
L1 = [11.344,7.234]
i want to create a new column in df where i have the distances
id lat long distance
1 12.654 15.50
2 14.364 25.51
3 17.636 32.53
5 12.334 25.84
9 32.224 15.74
i know to find euclidean distance between two points using math.hypot():
dist = math.hypot(x2 - x1, y2 - y1)
How do i write a function using apply or iterate over rows to give me distances.
Use vectorized approach
In [5463]: (df[['lat', 'long']] - np.array(L1)).pow(2).sum(1).pow(0.5)
Out[5463]:
0 8.369161
1 18.523838
2 26.066777
3 18.632320
4 22.546096
dtype: float64
Which can also be
In [5468]: df['distance'] = df[['lat', 'long']].sub(np.array(L1)).pow(2).sum(1).pow(0.5)
In [5469]: df
Out[5469]:
id lat long distance
0 1 12.654 15.50 8.369161
1 2 14.364 25.51 18.523838
2 3 17.636 32.53 26.066777
3 5 12.334 25.84 18.632320
4 9 32.224 15.74 22.546096
Option 2 Use Numpy's built-in np.linalg.norm vector norm.
In [5473]: np.linalg.norm(df[['lat', 'long']].sub(np.array(L1)), axis=1)
Out[5473]: array([ 8.36916101, 18.52383805, 26.06677732, 18.63231966, 22.5460958 ])
In [5485]: df['distance'] = np.linalg.norm(df[['lat', 'long']].sub(np.array(L1)), axis=1)
Translating [(x2 - x1)2 + (y2 - y1)2]1/2 into pandas vectorised operations, you have:
df['distance'] = (df.lat.sub(11.344).pow(2).add(df.long.sub(7.234).pow(2))).pow(.5)
df
lat long distance
id
1 12.654 15.50 8.369161
2 14.364 25.51 18.523838
3 17.636 32.53 26.066777
5 12.334 25.84 18.632320
9 32.224 15.74 22.546096
Alternatively, using arithmetic operators:
(((df.lat - 11.344) ** 2) + (df.long - 7.234) ** 2) ** .5

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Categories

Resources