python set value of column based on other column value - python

I have a df
Side ref_price price price_diff
0 100 110
1 110 100
I want to keep price_diff values based on side values.
if side==0:
df['price_diff']=df['ref_price']*df['price']
else if side==1:
df['price_diff']=df['ref_price']*df['price']*-1
Tried with
df.loc[df.Side == 0, 'price_diff'] = (df['price']*df['ref_price'])
Not working, throwing errors.

You could use "Side" column as a condition in numpy.where:
df['price_diff'] = np.where(df['Side'].astype(bool), df['ref_price']*df['price']*-1, df['ref_price']*df['price'])
or in this specific case, use "Side" column values as power of -1:
df['price_diff'] = df['ref_price']*df['price']*(-1)**df['Side']
Output:
Side ref_price price price_diff
0 0 100 110 11000
1 1 110 100 -11000

You can use np.where:
df['price_diff'] = np.where(df['side'] == 0,
df['ref_price'] * df['price'],
df['ref_price'] * df['price'] * -1)
print(df)
# Output
side ref_price price price_diff
0 0 100 110 11000
1 1 110 100 -11000

Related

How to set values in a dataframe based on rows?

I have a dataframe like this:
id value
111 0.222
222 2.253
333 0.444
444 21.256
....
I want to add a new column flag and the first half of the rows, set flag to 0, the rest is set to 1:
id value flag
111 0.222 0
222 2.253 0
333 0.444 0
444 21.256 0
...
8888 1212.500 1
9999 0.025 1
What's the best way to do this? I tired the following:
df['flag'][:int(len(df) / 2)] = 0
df['flag'][int(len(df) / 2):] = 1
But this gave me KeyError: 'flag', assuming probably I need to create an empty column with name flag? Can someone help please? Thanks.
Even if you create an empty column, you would get some warning/error due to chain indexing. Try assign at once:
df['flag'] = (np.arange(len(df)) >= (len(df)//2)).astype(int)
Or
l = len(df)//2
df['flag'] = [0] * s + [1] * (len(df) - l)

Calculating DataFrame columns based on other columns

Having DataFrame like so:
# comments are the equations that have to be done to calculate the given column
df = pd.DataFrame({
'item_tolerance': [230, 115, 155],
'item_intake': [250,100,100],
'open_items_previous_day': 0, # df.item_intake.shift() + df.open_items_previous_day.shift() - df.items_shipped.shift() + df.items_over_under_sla.shift()
'total_items_to_process': 0, # df.item_intake + df.open_items_previous_day
'sla_relevant': 0, # df.item_tolerance if df.open_items_previous_day + df.item_intake > df.item_tolerance else df.open_items_previous_day + df.item_intake
'items_shipped': [230, 115, 50],
'items_over_under_sla': 0 # df.items_shipped - df.sla_relevant
})
item_tolerance
item_intake
open_items_previous_day
total_items_to_process
sla_relevant
items_shipped
items_over_under_sla
0
230
250
0
0
0
230
0
1
115
100
0
0
0
115
0
2
155
100
0
0
0
50
0
I'd like to calculate all the columns that have comments in them. I've tried using df.apply(some_method, axis=1) to perform row wise calculations but the problem is that I don't have the access to the previous row inside some_method(row).
To give a little more explanation, what I'm trying to achieve is for example: df.items_over_under_sla = df.items_shipped - df.sla_relevant but df.sla_relevant is based on equation which needs df.open_items_previous_day which needs df.open_items_previous_day which needs the previous row to be calculated. This is the problem, I need to calculate rows based on the values from this row and the previous one.
What is the correct approach to such problem?
If you are calculating each column with a different operation I suggest obtaining them individually:
df['open_items_previous_day'] = df['item_intake'].shift(fill_value=0) + df['open_items_previous_day'].shift(fill_value=0) - df['items_shipped'].shift(fill_value=0) + df['items_over_under_sla'].shift(fill_value=0)
df['total_items_to_process'] = df['item_intake'] + df['open_items_previous_day']
df = df.assign(sla_relevant=np.where(df['open_items_previous_day'] + df['item_intake'] > df['item_tolerance'], df['item_tolerance'], df['open_items_previous_day'] + df['item_intake']))
df['items_over_under_sla'] = df['items_shipped'] - df['sla_relevant']
df
Out[1]:
item_tolerance item_intake open_items_previous_day total_items_to_process sla_relevant items_shipped items_over_under_sla
0 230 250 0 250 230 230 0
1 115 100 20 120 115 115 0
2 155 100 -15 85 85 50 -35
The problem that you are facing is not about having to use the previous row (you are working around that just fine using the shift function). The real problem here is that all columns that you are trying to get (except for total_items_to_process) depend on each other, therefore you can't get the rest of the columns without having one of them first (or assuming it is zero initially).
That's why you are going to get different results depending on which column you've calculated first.

How to find the first row with min value of a column in dataframe

I have a data frame where I am trying to get the row of min value by subtracting the abs difference of two columns to make a third column where I am trying to get the first or second min value of the data frame of col[3] I get an error. Is there a better method to get the row of min value from a column[3].
df2 = df[[2,3]]
df2[4] = np.absolute(df[2] - df[3])
#lowest = df.iloc[df[6].min()]
2 3 4
0 -111 -104 7
1 -130 110 240
2 -105 -112 7
3 -118 -100 18
4 -147 123 270
5 225 -278 503
6 102 -122 224
2 3 4
desired result = 2 -105 -112 7
Get difference to Series, add Series.abs and then compare by minimal value in boolean indexing:
s = (df[2] - df[3]).abs()
df = df[s == s.min()]
If want new column for diffence:
df['diff'] = (df[2] - df[3]).abs()
df = df[df['diff'] == df['diff'].min()]
Another idea is get index by minimal value by Series.idxmin and then select by DataFrame.loc, for one row DataFrame are necessary [[]]:
s = (df[2] - df[3]).abs()
df = df.loc[[s.idxmin()]]
EDIT:
For more dynamic code with convert to integers if possible use:
def int_if_possible(x):
try:
return x.astype(int)
except Exception:
return x
df = df.apply(int_if_possible)

How to perform rolling for loop in a Pandas Dataframe?

I have a pandas df with 3 columns:
Close Top_Barrier Bottom_Barrier
0 441.86 441.964112 426.369888
1 448.95 444.162225 425.227108
2 449.99 446.222271 424.285063
3 449.74 447.947051 423.678282
4 451.97 449.879254 423.029413
...
996 436.97 446.468790 426.600543
997 438.16 446.461401 426.599265
998 437.00 446.093899 426.641434
999 437.52 446.024365 426.631635
1000 437.75 446.114093 426.715907
Objective:
For every row, I need to test if any of the next 30 rows Close price touches the top or bottom barrier (from row 0), eg, start from row index 0, test if Close price (441.86) is greater than Top_Barrier (441.96) or lower than Bottom_Barrier (426.36), if it is greater than Top_Barrier, return 1, if it is lower than Bottom_Barrier, return -1. Else, loop to the next row, eg, at index 1, Close price is 448.95, but it is still being tested against barrier price from index 0, ie, Top_Barrier of 441.96, Bottom_Barrier of 426.36. This loop continue until index 29 if Close price never touches the barriers - return 0 if that's the case. Next rolling loop start from index 1 until 30, etc.
Attempts:
I tried using .rolling.apply with the following function but I just could not resolve the errors. Happy to explore any other methods as long as it achieve my objective stated above. Thanks!
def tbl_rolling(x):
start_i = x.index[0]
for i in range(len(x)):
# the barrier freeze at index 0
if x.loc[i, 'Close'] > x.loc[start_i, 'Top_Barrier']:
return 1
elif x.loc[i, 'Close'] < x.loc[start_i, 'Bottom_Barrier']:
return -1
return 0
The following then throws IndexingError: Too many indexers
test = df.rolling(30).apply(tbl_rolling, raw=False)
You can try something like this if your dataset isn't very big:
df = df.reset_index().assign(key=1)
def f(x):
cond1 = x['Close_x'] > x['Top_Barrier_y'].max()
cond2 = x['Close_x'] < x['Bottom_Barrier_y'].min()
return np.select([cond1,cond2],[1,-1], default=0)[0]
df.merge(df, on='key').query('index_y <= index_x').groupby('index_x').apply(f)
Output:
index_x
0 0
1 1
2 1
3 1
4 1
996 0
997 0
998 0
999 0
1000 0
dtype: int64

Python pandas resampling

I have the following dataframe:
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
The table goes on and on like that. The first column is a timestamp which is in milliseconds. S_time1 and End_time_1 are the duration where a particular sign (number) appear. For example, if we take the 5th row, S_time1 is 2526631, End_time_1 is 2520631, and the corresponding sign_1 is 10, which means from 2526631 to 2520631 the sign 10 will be displayed. And the same thing goes to S_time2 and End_time_2. The corresponding values in sign_2 will appear in the duration from S_time2 to End_time_2.
I want to resample the index column (Timestamp) in 100-millisecond bin time and check in which bin times the signs belong. For instance, between each start time and end time there is 2000 milliseconds difference. So the corresponding sign number will appear repeatedly in around 20 consecutive bin times because each bin time is 100 millisecond. So I need to have two columns only: one with the bin times and the second with the signs. Looks like the following table: (I am just making up the bin time just for example)
Bin_time signs
...100 0
...200 0
...300 10
...400 10
...500 10
...600 10
The sign 10 will be for the duration of the corresponding S_time1 to End_time_1. Then the next sign which is 80 continues for the duration of S_time2 to End_time_2. I am not sure if this can be done in pandas or not. But I really need help either in pandas or other methods.
Thanks for your help and suggestion in advance.
Input:
print df
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
2 approaches:
In [231]: %timeit s(df)
1 loops, best of 3: 2.78 s per loop
In [232]: %timeit m(df)
1 loops, best of 3: 690 ms per loop
def m(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df['i'] = 1
df = df.set_index('Timestamp')
df1 = df[[]].resample('100ms', how='first').reset_index()
df1['Timestamp'] = (df1['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#felper column i for merging
df1['i'] = 1
#print df1
out = df1.merge(df,on='i', how='left')
out1 = out[['Timestamp', 'Sign_1']][(out.Timestamp >= out.S_time1) & (out.Timestamp <= out.End_Time_1)]
out2 = out[['Timestamp', 'Sign_2']][(out.Timestamp >= out.S_time2) & (out.Timestamp <= out.End_time_2)]
out1 = out1.rename(columns={'Sign_1':'Bin_time'})
out2 = out2.rename(columns={'Sign_2':'Bin_time'})
df = pd.concat([out1, out2], ignore_index=True).drop_duplicates(subset='Timestamp')
df1 = df1.set_index('Timestamp')
df = df.set_index('Timestamp')
df = df.reindex(df1.index).reset_index()
#print df.head(10)
def s(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df = df.set_index('Timestamp')
out = df[[]].resample('100ms', how='first')
out = out.reset_index()
out['Timestamp'] = (out['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#print out.head(10)
#search start end
def search(x):
mask1 = (df.S_time1<=x['Timestamp']) & (df.End_Time_1>=x['Timestamp'])
#if at least one True return first value of series
if mask1.any():
return df.loc[mask1].Sign_1[0]
#check second start and end time
else:
mask2 = (df.S_time2<=x['Timestamp']) & (df.End_time_2>=x['Timestamp'])
if mask2.any():
#if at least one True return first value
return df.loc[mask2].Sign_2[0]
else:
#if all False return NaN
return np.nan
out['Bin_time'] = out.apply(search, axis=1)
#print out.head(10)

Categories

Resources