I posted something simpler because I thought it could be easy to understand, but referring to your comments, I was wrong, so I edit this question :
So here is the code. I want to do it without a loop, should it be done in pandas ?
import pandas as pd
myval = [0.0,1.1, 2.2, 3.3, 4.4, 5.5,6.6, 7.7, 8.8,9.9]
s1 = [0,0,1,1,0,0,1,1,0,1]
s2 = [0,0,1,0,1,0,1,0,1,1]
posin = [10,0,0,0,0,0,0,0,0,0]
posout = [0,0,0,0,0,0,0,0,0,0]
sig = ['-']
d = {'myval' : myval, 's1' : s1, 's2' : s2}
d = pd.DataFrame(d)
'''
normaly the dataframe should be with the 6 col,
but I can't make the part below working in the df.(THAT is the problem !!)
The real df is 5000+ row, and this should be done for 100+ sets of values,
so this way is not eligible. Too slow.
'''
for i in xrange(1,len(myval)) :
if (s1[i]== 1) & (s2[i] == 1) & (posin[i-1] != 0 ) :
posin[i]= 0
posout[i]= posin[i-1] / myval[i]
sig.append( 'a')
elif (s1[i] == 0) & (s2[i] == 1) & (posin[i-1] == 0) :
posin[i]= posout[i-1] * myval[i]
posout[i] = 0
sig.append( 'v')
else :
posin[i] = posin[i-1]
posout[i] = posout[i-1]
sig.append('-')
d2 = pd.DataFrame({'posin' : posin , 'posout' : posout , 'sig' : sig })
d = d.join(d2)
#the result wanted :
print d
myval s1 s2 posin posout sig
0 0.0 0 0 10.000000 0.000000 -
1 1.1 0 0 10.000000 0.000000 -
2 2.2 1 1 0.000000 4.545455 a
3 3.3 1 0 0.000000 4.545455 -
4 4.4 0 1 20.000000 0.000000 v
5 5.5 0 0 20.000000 0.000000 -
6 6.6 1 1 0.000000 3.030303 a
7 7.7 1 0 0.000000 3.030303 -
8 8.8 0 1 26.666667 0.000000 v
9 9.9 1 1 0.000000 2.693603 a
Any help ?
Thanks for it !!
I was hoping that something like the following might work (as suggested in the comments), however (suprisingly?) this use of np.where raises a ValueError: shape mismatch: objects cannot be broadcast to a single shape (using a 1D to select from a 2D):
np.where(df.s1 & df.s2,
pd.DataFrame({"bin": 0, "bout": df.bin.diff() / df.myval}),
np.where(df.s1,
pd.DataFrame({"bin": df.bout.diff() * df.myval, "bout": 0}),
pd.DataFrame({"bin": df.bin.diff(), "bout": df.bout.diff()})))
As an alternative to using where, I would construct this in stages:
res = pd.DataFrame({"bin": 0, "bout": df.bin.diff() / df.myval})
res.update(pd.DataFrame({"bin": df.bout.diff() * df.myval,
"bout": 0}).loc[(df.s1 == 1) & (df.s2 == 0)])
res.update(pd.DataFrame({"bin": df.bin.diff(),
"bout": df.bout.diff()}).loc[(df.s1 == 0) & (df.s2 == 0)])
Then you can assign this to the two columns in df:
df[["bin", "bout"]] = res
Code referring to Andy Hayden's answer :
import pandas as pd
myval = [0.0,1.1, 2.2, 3.3, 4.4, 5.5,6.6, 7.7, 8.8,9.9]
s1 = [0,0,1,1,0,0,1,1,0,1]
s2 = [0,0,1,0,1,0,1,0,1,1]
posin = [10,0,0,0,0,0,0,0,0,0]
posout = [0,0,0,0,0,0,0,0,0,0]
sig = ['-']
d = {'myval' : myval, 's1' : s1, 's2' : s2,'posin' : posin , 'posout' : posout }
d = pd.DataFrame(d)
res = pd.DataFrame({"posin": 10, 'sig' : '-', "posout": d.posin.diff() / d.myval})
res.update(pd.DataFrame({"posin": 0,
'sig' : 'a',
"posout":d.posin.diff() / d.myval }
).loc[(d.s1 == 1) & (d.s2 == 1) & (d.posin.diff() != 0) ])
res.update(pd.DataFrame({"posin": d.posout.diff() * d.myval,
'sig' : 'v',
"posout": 0}
).loc[(d.s1 == 0) & (d.s2 == 1) & (d.posin.diff()) == 0])
d[["posin", "posout", 'sig']] = res
print d
myval posin posout s1 s2 sig
0 0.0 10 0 0 0 v
1 1.1 0 0 0 0 v
2 2.2 0 0 1 1 v
3 3.3 0 0 1 0 v
4 4.4 0 0 0 1 v
5 5.5 0 0 0 0 v
6 6.6 0 0 1 1 v
7 7.7 0 0 1 0 v
8 8.8 0 0 0 1 v
9 9.9 0 0 1 1 v
Related
Say, i have the following dataframe:
import pandas as pd
dict = {'val':[3.2, 2.4, -2.3, -4.9, 3.2, 2.4, -2.3, -4.9, 2.4, -2.3, -4.9],
'label': [0, 2, 1, -1, 1, 2, -1, -1,1, 1, -1]}
df = pd.DataFrame(dict)
df
val label
0 3.2 0
1 2.4 2
2 -2.3 1
3 -4.9 -1
4 3.2 1
5 2.4 2
6 -2.3 -1
7 -4.9 -1
8 2.4 1
9 -2.3 1
10 -4.9 -1
I want to take each n (for example 2) rows before -1 value in column label. In the given df first -1 appears at index 3, we take 2 rows before it and drop index 3, then next -1 appears at index 6, we again keep 2 rows before and etc. The desired output is as following:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
Thanks for any ideas!
You can get the index values and then get the previous two row index values:
idx = df[df.label == -1].index
filtered_idx = (idx-1).union(idx-2)
filtered_idx = filtered_idx[filtered_idx > 0]
df_new = df.iloc[filtered_idx]
output:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
Speed comparison with for a for loop solution:
# create large df:
import numpy as np
df = pd.DataFrame(np.random.random((20000000,2)), columns=["val","label"])
df.loc[df.sample(frac=0.01).index, "label"] = - 1
def vectorized_filter(df):
idx = df[df.label == -1].index
filtered_idx = (idx -1).union(idx-2)
df_new = df.iloc[filtered_idx]
return df_new
def loop_filter(df):
filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
return df2
%timeit vectorized_filter(df)
%timeit loop_filter(df)
vectorized runs ~20x faster on my machine
Here's a solution:
new_df = pd.DataFrame()
markers = df[df.label.eq(-1)].index
for marker in markers:
new_df = new_df.append(df[marker-2:marker])
new_df.reset_index().drop_duplicates().set_index("index")
Result:
val label
index
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
print(df2)
Output:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
This should also work if you have the label as -1 in the first two rows
I would like to give each employee a pro rata share after a sale has been made. Therefore I first need to sum up the number of contacts per Customer that leads to a sale and then split the reward the each employee involved in this process.
import pandas as pd
df = pd.DataFrame({"Cust_ID":[1,1,1,2,3,3], "Employee": ["A","B","B","C","B","A"], "Purchase":[0,0,1,1,0,1]})
df
Cust_ID Employee Purchase
0 1 A 0
1 1 B 0
2 1 B 1
3 2 C 1
4 3 B 0
5 3 A 1
When it takes 3 (or more) steps for the final sale (Cust_ID = 1) the rewards shall be distributed in 50%, 30% and 20% (0%..).
For 2 steps 70% and 30%. One step = 100%
The result should look like this:
Cust_ID Employee Purchase Reward
0 1 A 0 0.2
1 1 B 0 0.3
2 1 B 1 0.5
3 2 C 1 1.0
4 3 B 0 0.3
5 3 A 1 0.7
I tried using df["Reward"] = df.groupby("Cust_ID").Purchase.transform("xxx") but this didn't execute the distributed reward..
Thanks in advance!
First let's augment the DataFrame:
df['Touch'] = df.groupby('Cust_ID').cumcount()
df['Touches'] = df.groupby('Cust_ID').Employee.count()[df.Cust_ID].values
df['Reward'] = 0.0
Now we have the basic setup:
Cust_ID Employee Purchase Touch Touches Reward
0 1 A 0 0 3 0.0
1 1 B 0 1 3 0.0
2 1 B 1 2 3 0.0
3 2 C 1 0 1 0.0
4 3 B 0 0 2 0.0
5 3 A 1 1 2 0.0
Finally, apply the reward rules:
df.loc[df.Touches == 1, 'Reward'] = 1.0
df.loc[(df.Touches == 2) & (df.Touch == 0), 'Reward'] = 0.3
df.loc[(df.Touches == 2) & (df.Touch == 1), 'Reward'] = 0.7
df.loc[(df.Touches == 3) & (df.Touch == 0), 'Reward'] = 0.2
df.loc[(df.Touches == 3) & (df.Touch == 1), 'Reward'] = 0.3
df.loc[(df.Touches == 3) & (df.Touch == 2), 'Reward'] = 0.5
This last part could be done more cleverly using np.select(). This is an exercise for the reader.
I'm working with a data frame like this, but bigger and with more zone. I am trying to sum the value of the rows by their names. The total sum of the R or C zones goes in total column while the total sum of either M zones goes in total1 .
Input:
total, total1 are the desired output.
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 total total1
1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
3 C2 40 4 C4 60 6 0 6 0 10 0
3 C1 100 8 0 0 0 0 100 0 8 0
5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
You can use filter for DataFrames for Zones and Values:
z = df.filter(like='Zone')
v = df.filter(like='Value')
Then create boolean DataFrames by contains with apply if want check substrings:
m1 = z.apply(lambda x: x.str.contains('R|C'))
m2 = z.apply(lambda x: x.str.contains('M'))
#for check strings
#m1 = z == 'R2'
#m2 = z.isin(['C1', 'C4'])
Last filter by where v and sum per rows:
df['t'] = v.where(m1.values).sum(axis=1).astype(int)
df['t1'] = v.where(m2.values).sum(axis=1).astype(int)
print (df)
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 t t1
0 1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
2 3 C2 40 4 C4 60 6 0 6 0 10 0
3 3 C1 100 8 0 0 0 0 100 0 8 0
4 5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
Solution1 (simpler code but slower and less flexible)
total = []
total1 = []
for i in range(df.shape[0]):
temp = df.iloc[i].tolist()
if "R2" in temp:
total.append(temp[temp.index("R2")+1])
else:
total.append(0)
if ("C1" in temp) & ("C4" in temp):
total1.append(temp[temp.index("C1")+1] + temp[temp.index("C4")+1])
else:
total1.append(0)
df["Total"] = total
df["Total1"] = total1
Solution2 (faster than solution1 and easier to customize but possibly memory intensive)
# columns to use
cols = df.columns.tolist()
zones = [x for x in cols if x.startswith('Zone')]
vals = [x for x in cols if x.startswith('Value')]
# you can customize here
bucket1 = ['R2']
bucket2 = ['C1', 'C4']
thresh = 2 # "OR": 1, "AND": 2
original = df.copy()
# bucket1 check
for zone in zones:
df.loc[~df[zone].isin(bucket1), cols[cols.index(zone)+1]] = 0
original['Total'] = df[vals].sum(axis=1)
df = original.copy()
# bucket2 check
for zone in zones:
df.loc[~df[zone].isin(bucket2), cols[cols.index(zone)+1]] = 0
df['Check_Bucket'] = df[zones].stack().reset_index().groupby('level_0')[0].apply(list)
df['Check_Bucket'] = df['Check_Bucket'].apply(lambda x: len([y for y in x if y in bucket2]))
df['Total1'] = df[vals].sum(axis=1)
df.loc[df.Check_Bucket < thresh, 'Total1'] = 0
df.drop('Check_Bucket', axis=1, inplace=True)
When I expanded original dataframe to 100k rows, solution 1 took 11.4 s ± 82.1 ms per loop, while solution 2 took 3.53 s ± 29.8 ms per loop. The difference is because solution 2 does not for-looping over row direction.
I have a dataframe net that contains the distance d between two locations A and B.
net =
A B d
0 5 3 3.5
1 2 0 2.3
2 3 2 1.2
3 4 5 2.2
4 0 1 3.2
5 0 3 4.5
Then I have a symmetric matrix M that contains all the possible distances between two pairs, so:
M =
0 1 2 3 4 5
0 0 3.2 2.3 4.5 1.7 5.2
1 3.2 0 2.1 0.7 3.9 3.8
2 2.3 2.1 0 1.2 1.5 4.7
3 4.5 0.7 1.2 0 3.2 3.5
4 1.7 3.9 1.5 3.2 0 2.2
5 5.2 3.8 4.7 3.5 2.2 0
I want to generate a new dataframe df1 that contains two random different locations A and B in the same distance interval ds > np.floor(d) & ds < np.floor(d)+1.
This is what I am doing
H = []
W = []
for i in net.index:
tmp = net['d'][i]
ds = np.where( (M > np.floor(tmp)) & (M < np.floor(tmp)+1) )
size = len(ds[0])
ind = randint(size) ## find two random locations with distance ds
h = ds[0][ind]
w = ds[1][ind]
H.append(h)
W.append(w)
df1 = pd.DataFrame()
df1['A'] = H
df1['B'] = W
group M by floor division of 1. Then use that to query and sample
g = M.stack().index.to_series().groupby(M.stack() // 1)
net.d.apply(lambda x: pd.Series(g.get_group(x // 1).sample(1).iloc[0], list('AB')))
I want to know if there is any faster way to do the following loop? Maybe use apply or rolling apply function to realize this
Basically, I need to access previous row's value to determine current cell value.
df.ix[0] = (np.abs(df.ix[0]) >= So) * np.sign(df.ix[0])
for i in range(1, len(df)):
for col in list(df.columns.values):
if ((df[col].ix[i] > 1.25) & (df[col].ix[i-1] == 0)) | :
df[col].ix[i] = 1
elif ((df[col].ix[i] < -1.25) & (df[col].ix[i-1] == 0)):
df[col].ix[i] = -1
elif ((df[col].ix[i] <= -0.75) & (df[col].ix[i-1] < 0)) | ((df[col].ix[i] >= 0.5) & (df[col].ix[i-1] > 0)):
df[col].ix[i] = df[col].ix[i-1]
else:
df[col].ix[i] = 0
As you can see, in the function, I am updating the dataframe, I need to access the most updated previous row, so using shift will not work.
For example:
Input:
A B C
1.3 -1.5 0.7
1.1 -1.4 0.6
1.0 -1.3 0.5
0.4 1.4 0.4
Output:
A B C
1 -1 0
1 -1 0
1 -1 0
0 1 0
you can use .shift() function for accessing previous or next values:
previous value for col column:
df['col'].shift()
next value for col column:
df['col'].shift(-1)
Example:
In [38]: df
Out[38]:
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
In [39]: df['prev_a'] = df['a'].shift()
In [40]: df
Out[40]:
a b c prev_a
0 1 0 5 NaN
1 9 9 2 1.0
2 2 2 8 9.0
3 6 3 0 2.0
4 6 1 7 6.0
In [43]: df['next_a'] = df['a'].shift(-1)
In [44]: df
Out[44]:
a b c prev_a next_a
0 1 0 5 NaN 9.0
1 9 9 2 1.0 2.0
2 2 2 8 9.0 6.0
3 6 3 0 2.0 6.0
4 6 1 7 6.0 NaN
I am surprised there isn't a native pandas solution to this as well, because shift and rolling do not get it done. I have devised a way to do this using the standard pandas syntax but I am not sure if it performs any better than your loop... My purposes just required this for consistency (not speed).
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
Disclaimer: I used pandas 0.16 but with only slight modification this will work for the latest versions too.
Others had similar questions and I posted this solution on those as well:
Reference previous row when iterating through dataframe
Reference values in the previous row with map or apply
#maxU has it right with shift, I think you can even compare dataframes directly, something like this:
df_prev = df.shift(-1)
df_out = pd.DataFrame(index=df.index,columns=df.columns)
df_out[(df>1.25) & (df_prev == 0)] = 1
df_out[(df<-1.25) & (df_prev == 0)] = 1
df_out[(df<-.75) & (df_prev <0)] = df_prev
df_out[(df>.5) & (df_prev >0)] = df_prev
The syntax may be off, but if you provide some test data I think this could work.
Saves you having to loop at all.
EDIT - Update based on comment below
I would try my absolute best not to loop through the DF itself. You're better off going column by column, sending to a list and doing the updating, then just importing back again. Something like this:
df.ix[0] = (np.abs(df.ix[0]) >= 1.25) * np.sign(df.ix[0])
for col in df.columns.tolist():
currData = df[col].tolist()
for currRow in range(1,len(currData)):
if currData[currRow]> 1.25 and currData[currRow-1]== 0:
currData[currRow] = 1
elif currData[currRow] < -1.25 and currData[currRow-1]== 0:
currData[currRow] = -1
elif currData[currRow] <=-.75 and currData[currRow-1]< 0:
currData[currRow] = currData[currRow-1]
elif currData[currRow]>= .5 and currData[currRow-1]> 0:
currData[currRow] = currData[currRow-1]
else:
currData[currRow] = 0
df[col] = currData