Cumulative result with specific number in pandas - python

This is my DataFrame:
index, value
10, 109
11, 110
12, 111
13, 110
14, 108
15, 106
16, 100
I want to build another column based on multippliing by 0,05 with cumulative result.
index, value, result
10, 109, 109
11, 110, 109 + (0,05 * 1) = 109,05
12, 111, 109 + (0,05 * 2) = 109,1
13, 110, 109 + (0,05 * 3) = 109,15
14, 108, 109 + (0,05 * 4) = 109,2
15, 106, 109 + (0,05 * 5) = 109,25
16, 100, 109 + (0,05 * 6) = 109,3
I tried to experiment with shift and cumsum, but nothing works. Can you give me an advice how to do it?
Now I do something like:
counter = 1
result = {}
speed = 0,05
for item in range (index + 1, last_row_index + 1):
result[item] = result[first_index] + speed * counter
counter += 1
P.S. During your answers I've edited column result. Please don't blame me. I am really silly person, but I try to grow.
Thank you all for your answers!

Use numpy:
df['result'] = df['value'].iloc[0]*1.05**np.arange(len(df))
Output:
index value result
0 10 109 109.000000
1 11 110 114.450000
2 12 111 120.172500
3 13 110 126.181125
4 14 108 132.490181
5 15 106 139.114690
6 16 100 146.070425
After you edited the question:
df['result'] = df['value'].iloc[0]+0.05*np.arange(len(df))
output:
index value result
0 10 109 109.00
1 11 110 109.05
2 12 111 109.10
3 13 110 109.15
4 14 108 109.20
5 15 106 109.25
6 16 100 109.30

if indices are consecutive
df['result'] = (df['index'] - df['index'][0]) * 0.05 + df['value'][0]
or not:
df['result'] = df.value.reset_index().index * 0.05 + df.value[0]

df['result'] = np.arange(len(df)) * 0.05
df['result'] = df['value'].add(df['result'])
print(df)
Output:
value result
0 109 109.00
1 110 110.05
2 111 111.10
3 110 110.15
4 108 108.20
5 106 106.25
6 100 100.30

Related

Weighted Mean as a Column in Pandas

I am trying to add a column with the weighted average of 4 columns with 4 columns of weights
df = pd.DataFrame.from_dict(dict([('A', [2000, 1000, 2509, 2145]),
('A_Weight', [37, 47, 33, 16]),
('B', [2100, 1500, 2000, 1600]),
('B_weights', [17, 21, 6, 2]),
('C', [2500, 1400, 0, 2300]),
('C_weights', [5, 35, 0, 40]),
('D', [0, 1600, 2100, 2000]),
('D_weights', [0, 32, 10, 5])]))
I want the weighted average to be in a new column named "WA" but every time I try it displays NaN
Desired Dataframe would be a new column with the following values as ex:
Formula I used (((A * A_weight)+(B * b_weight)+(C * C_weight)+(D * D_weight)) / sum(all weights)
df['WA'] = [2071.19,1323.70, 2363.20,2214.60 ]
Thank you
A straight-forward and simple way to do is as follows:
(Since your columns name for the weights are not consistently named, e.g. some with 's' and some without, some with capital 'W' and some with lower case 'w', it is not convenient to group columns e.g. by .filter())
df['WA'] = ( (df['A'] * df['A_Weight']) + (df['B'] * df['B_weights']) + (df['C'] * df['C_weights']) + (df['D'] * df['D_weights']) ) / (df['A_Weight'] + df['B_weights'] + df['C_weights'] + df['D_weights'])
Result:
print(df)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175
The not so straight-forward way:
Group columns by prefix via str.split
get the column-wise product via groupby prod
get the row-wise sum of the products with sum on axis 1.
filter + sum on axis 1 to get sum of "weights" columns
Divide the the group product sums with the weight sums.
df['WA'] = (
df.groupby(df.columns.str.split('_').str[0], axis=1).prod().sum(axis=1)
/ df.filter(regex='_[wW]eight(s)?$').sum(axis=1)
)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175
Another option to an old question:
Split data into numerator and denominator:
numerator = df.filter(regex=r"[A-Z]$")
denominator = df.filter(like='_')
Convert denominator into a MultiIndex, comes in handy when computing with numerator:
denominator.columns = denominator.columns.str.split('_', expand = True)
Multiply numerator by denominator, and divide the sum of the outcome with the sum of the denominator:
outcome = numerator.mul(denominator, level=0, axis=1).sum(1)
outcome = outcome.div(denominator.sum(1))
df.assign(WA = outcome)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175

How to sort a MultiIndex by one level without changing the order of the other levels

I am struggling to sort a pivot table according to one level of a MultiIndex.
My target is to sort the values in the level according to a list of values which basically works.
But i also want to preserve the original order of the other levels.
import pandas as pd
import numpy as np
import random
group_size = 3
n = 10
df = pd.DataFrame({
'i_a': list(np.arange(0, group_size))*n,
'i_b': random.choices(list("ARBMC"), k=n*group_size),
'value': np.random.randint(0, 100, size=n*group_size),
})
pt = pd.pivot_table(
df,
index=['i_a', 'i_b'],
values=['value'],
aggfunc='sum'
)
# The pivot table looks like this
value
i_a i_b
0 A 48
B 55
C 161
M 41
R 126
1 A 60
B 236
C 99
M 30
R 202
2 A 22
B 144
C 30
M 146
R 168
# defined order for i_b
ORDER = {
"A": 0,
"R": 1,
"B": 2,
"M": 3,
"C": 4,
}
def order_by_list(value, ascending=True):
try:
idx = ORDER[value]
except KeyError:
# place items which are not available at the last place
idx = len(ORDER)
if not ascending:
# reverse the order
idx = -idx
return idx
def sort_by_ib(df):
return pt.sort_index(level=["i_b"],
key=lambda index: index.map(order_by_list),
sort_remaining=False
)
pt_sorted = pt.pipe(sort_by_ib)
# i_a index of pt_sorted is rearranged what i dont want
value
i_a i_b
0 A 48
1 A 60
2 A 22
0 R 126
1 R 202
2 R 168
0 B 55
1 B 236
2 B 144
0 M 41
1 M 30
2 M 146
0 C 161
1 C 99
2 C 30
# Instead, The sorted pivot table should look like this
value
i_a i_b
0 A 48
R 126
B 55
M 41
C 161
1 A 60
R 202
B 236
M 30
C 99
2 A 22
R 168
B 144
M 146
C 30
What is the preferred/recommended way to do this?
If want change order you can crete helper column for mapping, add to index parameter in pivot_table and last remove by droplevel. If added before i_b it is sorting by id_a and new levels:
df['new'] = df['i_b'].map(ORDER)
pt = pd.pivot_table(
df,
index=[ 'i_a','new', 'i_b'],
values=['value'],
aggfunc='sum'
).droplevel(1)
print (pt)
value
i_a i_b
0 A 217
R 135
M 150
C 43
1 A 44
R 266
B 44
M 13
C 128
2 A 167
R 3
B 85
M 159
C 81

my data cleaning script is slow, any ideas on how to improve?

I have a Data(csv format) where the first column is an epoch timestamp(strictly increasing) and the other columns are cumulative rows(just increasing or equal).
Sample is as below:
df = pandas.DataFrame([[1515288240, 100, 50, 90, 70],[1515288241, 101, 60, 95, 75],[1515288242, 110, 70, 100, 80],[1515288239, 110, 70, 110, 85],[1515288241, 110, 75, 110, 85],[1515288243,110,70,110,85]],columns =['UNIX_TS','A','B','C','D'])
df =
id UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80
3 1515288239 110 70 110 85
4 1515288241 110 75 110 85
5 1515288243 110 70 110 85
import pandas as pd
def clean(df,column_name,equl):
i=0
while(df.shape[0]-2>=i):
if df[column_name].iloc[i]>df[column_name].iloc[i+1]:
df.drop(df[column_name].iloc[[i+1]].index,inplace=True)
continue
elif df[column_name].iloc[i]==df[column_name].iloc[i+1] and equl==1:
df.drop(df[column_name].iloc[[i+1]].index,inplace=True)
continue
i+=1
clean(df,'UNIX_TS',1)
for col in df.columns[1:]:
clean(df,col,0)
df =
id UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80
My script works as intended but its too slow, anybody has ideas about how to improve its speed.
I wrote a script to remove all the invalid data based on 2 rules:
Unix_TS must be strictly increasing(because its a time, it cannot flow back or pause),
other columns are increasing and can be constant for example is in one row it is 100 and the next row it can be >=100 but not less.
Based on the rules the index 3 & 4 are invalid because unix_ts 1515288239 is 1515288241 are less than the index 2.
index 5 is wrong because the value B decreased
IIUC, can use
cols = ['A', 'B', 'C', 'D']
mask_1 = df['UNIX_TS'] > df['UNIX_TS'].cummax().shift().fillna(0)
mask_2 = mask_2 = (df[cols] >= df[cols].cummax().shift().fillna(0)).all(1)
df[mask_1 & mask_2]
Outputs
UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80

Subtracting many columns in a df by one column in another df

I'm trying to substract a df "stock_returns" (144 rows x 517 col) by a df "p_df" (144 rows x 1 col).
I have tried;
stock_returns - p_df
stock_returns.rsub(p_df,axis=1)
stock_returns.substract(p_df)
But none of them work and all return Nan values.
I'm passing it through this fnc, and using the for loop to get args:
def disp_calc(returns, p, wi): #apply(disp_calc, rows = ...)
wi = wi/np.sum(wi)
rp = (col_len(returns)*(returns-p)**2).sum() #returns - p causing problems
return np.sqrt(rp)
for i in sectors:
stock_returns = returns_rolling[sectordict[i]]#.apply(np.mean,axis=1)
portfolio_return = returns_rolling[i]; p_df = portfolio_return.to_frame()
disp_df[i] = stock_returns.apply(disp_calc,args=(portfolio_return,wi))
My expected output is to subtract all 517 cols in the first df by the 1 col in p_df. so final results would still have 517 cols. Thanks
You're almost there, just need to set axis=0 to subtract along the indexes:
>>> stock_returns = pd.DataFrame([[10,100,200],
[15, 115, 215],
[20,120, 220],
[25,125,225],
[30,130,230]], columns=['A', 'B', 'C'])
>>> stock_returns
A B C
0 10 100 200
1 15 115 215
2 20 120 220
3 25 125 225
4 30 130 230
>>> p_df = pd.DataFrame([1,2,3,4,5], columns=['P'])
>>> p_df
P
0 1
1 2
2 3
3 4
4 5
>>> stock_returns.sub(p_df['P'], axis=0)
A B C
0 9 99 199
1 13 113 213
2 17 117 217
3 21 121 221
4 25 125 225
data['new_col3'] = data['col1'] - data['col2']

new python pandas dataframe column based on value of variable, using function

I have a variable, 'ImageName' which ranges from 0-1600. I want to create a new variable, 'LocationCode', based on the value of 'ImageName'.
If 'ImageName' is less than 70, I want 'LocationCode' to be 1. if 'ImageName' is between 71 and 90, I want 'LocationCode' to be 2. I have 13 different codes in all. I'm not sure how to write this in python pandas. Here's what I tried:
def spatLoc(ImageName):
if ImageName <=70:
LocationCode = 1
elif ImageName >70 and ImageName <=90:
LocationCode = 2
return LocationCode
df['test'] = df.apply(spatLoc(df['ImageName'])
but it returned an error. I'm clearly not defining things the right way but I can't figure out how to.
You can just use 2 boolean masks:
df.loc[df['ImageName'] <= 70, 'Test'] = 1
df.loc[(df['ImageName'] > 70) & (df['ImageName'] <= 90), 'Test'] = 2
By using the masks you only set the value where the boolean condition is met, for the second mask you need to use the & operator to and the conditions and enclose the conditions in parentheses due to operator precedence
Actually I think it would be better to define your bin values and call cut, example:
In [20]:
df = pd.DataFrame({'ImageName': np.random.randint(0, 100, 20)})
df
Out[20]:
ImageName
0 48
1 78
2 5
3 4
4 9
5 81
6 49
7 11
8 57
9 17
10 92
11 30
12 74
13 62
14 83
15 21
16 97
17 11
18 34
19 78
In [22]:
df['group'] = pd.cut(df['ImageName'], range(0, 105, 10), right=False)
df
Out[22]:
ImageName group
0 48 [40, 50)
1 78 [70, 80)
2 5 [0, 10)
3 4 [0, 10)
4 9 [0, 10)
5 81 [80, 90)
6 49 [40, 50)
7 11 [10, 20)
8 57 [50, 60)
9 17 [10, 20)
10 92 [90, 100)
11 30 [30, 40)
12 74 [70, 80)
13 62 [60, 70)
14 83 [80, 90)
15 21 [20, 30)
16 97 [90, 100)
17 11 [10, 20)
18 34 [30, 40)
19 78 [70, 80)
Here the bin values were generated using range but you could pass your list of bin values yourself, once you have the bin values you can define a lookup dict:
In [32]:
d = dict(zip(df['group'].unique(), range(len(df['group'].unique()))))
d
Out[32]:
{'[0, 10)': 2,
'[10, 20)': 4,
'[20, 30)': 9,
'[30, 40)': 7,
'[40, 50)': 0,
'[50, 60)': 5,
'[60, 70)': 8,
'[70, 80)': 1,
'[80, 90)': 3,
'[90, 100)': 6}
You can now call map and add your new column:
In [33]:
df['test'] = df['group'].map(d)
df
Out[33]:
ImageName group test
0 48 [40, 50) 0
1 78 [70, 80) 1
2 5 [0, 10) 2
3 4 [0, 10) 2
4 9 [0, 10) 2
5 81 [80, 90) 3
6 49 [40, 50) 0
7 11 [10, 20) 4
8 57 [50, 60) 5
9 17 [10, 20) 4
10 92 [90, 100) 6
11 30 [30, 40) 7
12 74 [70, 80) 1
13 62 [60, 70) 8
14 83 [80, 90) 3
15 21 [20, 30) 9
16 97 [90, 100) 6
17 11 [10, 20) 4
18 34 [30, 40) 7
19 78 [70, 80) 1
The above can be modified to suit your needs but it's just to demonstrate an approach which should be fast and without the need to iterate over your df.
In Python, you use the dictionary lookup notation to find a field within a row. The field name is ImageName. In the spatLoc() function below, the parameter row is a dictionary containing the entire row, and you would find an individual column by using the field name as key to the dictionary.
def spatLoc(row):
if row['ImageName'] <=70:
LocationCode = 1
elif row['ImageName'] >70 and row['ImageName'] <=90:
LocationCode = 2
return LocationCode
df['test'] = df.apply(spatLoc, axis=1)

Categories

Resources