I have a df
Side ref_price price price_diff
0 100 110
1 110 100
I want to keep price_diff values based on side values.
if side==0:
df['price_diff']=df['ref_price']*df['price']
else if side==1:
df['price_diff']=df['ref_price']*df['price']*-1
Tried with
df.loc[df.Side == 0, 'price_diff'] = (df['price']*df['ref_price'])
Not working, throwing errors.
You could use "Side" column as a condition in numpy.where:
df['price_diff'] = np.where(df['Side'].astype(bool), df['ref_price']*df['price']*-1, df['ref_price']*df['price'])
or in this specific case, use "Side" column values as power of -1:
df['price_diff'] = df['ref_price']*df['price']*(-1)**df['Side']
Output:
Side ref_price price price_diff
0 0 100 110 11000
1 1 110 100 -11000
You can use np.where:
df['price_diff'] = np.where(df['side'] == 0,
df['ref_price'] * df['price'],
df['ref_price'] * df['price'] * -1)
print(df)
# Output
side ref_price price price_diff
0 0 100 110 11000
1 1 110 100 -11000
I have a column "Employees" that contains the following data:
122.12 (Mark/Jen)
32.11 (John/Albert)
29.1 (Jo/Lian)
I need to count how many values match a specific condition (like x>31).
base = list()
count = 0
count2 = 0
for element in data['Employees']:
base.append(element.split(' ')[0])
if base > 31:
count= count +1
else
count2 = count2 +1
print(count)
print(count2)
The output should tell me that count value is 2, and count2 value is 1. The problem is that I cannot compare float to list. How can I make that if work ?
You have a df with a Employees column that you need to split into number and text, keep the number and convert it into a float, then filter it based on a value:
import pandas as pd
df = pd.DataFrame({'Employees': ["122.12 (Mark/Jen)", "32.11(John/Albert)",
"29.1(Jo/Lian)"]})
print(df)
# split at (
df["value"] = df["Employees"].str.split("(")
# convert to float
df["value"] = pd.to_numeric(df["value"].str[0])
print(df)
# filter it into 2 series
smaller = df["value"] < 31
remainder = df["value"] > 30
print(smaller)
print(remainder)
# counts
smaller31 = sum(smaller) # True == 1 -> sum([True,False,False]) == 1
bigger30 = sum(remainder)
print(f"Smaller: {smaller31} bigger30: {bigger30}")
Output:
# df
Employees
0 122.12 (Mark/Jen)
1 32.11(John/Albert)
2 29.1(Jo/Lian)
# after split/to_numeric
Employees value
0 122.12 (Mark/Jen) 122.12
1 32.11(John/Albert) 32.11
2 29.1(Jo/Lian) 29.10
# smaller
0 False
1 False
2 True
Name: value, dtype: bool
# remainder
0 True
1 True
2 False
Name: value, dtype: bool
# counted
Smaller: 1 bigger30: 2
I am trying to compare a dataframe's different columns with each other row by row like
for (i= startday to endday)
if(df[i]<df[i+1])
counter=counter+1
else
i=endday+1
the goal is find increasing (or decreasing) trends(need to be consecutive)
And my data looks like this
df= 1 2 3 0 1 1 1
1 1 1 1 0 1 2
1 2 1 0 1 1 2
0 0 0 0 1 0 1
(In this example startday to endday is 7 but actually these two are unstable)
As a result i expect to find this {2,0,1,0} and i need it to work fast because my data is quite big(1.2 million). Because of the time limit I tried not to use loops (for, if etc.)
I tried the code below but couldn't find how to stop counter if condition is false
import math
import numpy as np
import pandas as pd
df1=df.copy()
df2=df.copy()
bool1 = (np.less_equal.outer(startday.startday, range(1,13))
& np.greater_equal.outer(endday.endday, range(1,13))
)
bool1= np.c_[np.zeros(len(startday)),bool1].astype('bool')
bool2 = (np.less_equal.outer(startday2.startday2, range(1,13))
& np.greater_equal.outer(endday2.endday2, range(1,13))
)
bool2= np.c_[bool2, np.zeros(len(startday))].astype('bool')
df1.insert(0, 'c_False',math.pi)
df2.insert(12, 'c_False',math.pi)
#df2.head()
arr_bool = (bool1&bool2&(df1.values<df2.values))
df_new = pd.DataFrame(np.sum(arr_bool , axis=1),
index=data_idx, columns=['coll'])
df_new.coll= np.select( condlist = [startday.startday > endday.endday],
choicelist = [-999],
default = df_new.coll)
Add zeros at the end, then use np.diff, then get the first "non positive" using argmin:
(np.diff(np.hstack((df.values, np.zeros((df.values.shape[0], 1)))), axis=1) > 0).argmin(axis=1)
>> array([2, 0, 1, 0], dtype=int64)
I have a pandas df with 3 columns:
Close Top_Barrier Bottom_Barrier
0 441.86 441.964112 426.369888
1 448.95 444.162225 425.227108
2 449.99 446.222271 424.285063
3 449.74 447.947051 423.678282
4 451.97 449.879254 423.029413
...
996 436.97 446.468790 426.600543
997 438.16 446.461401 426.599265
998 437.00 446.093899 426.641434
999 437.52 446.024365 426.631635
1000 437.75 446.114093 426.715907
Objective:
For every row, I need to test if any of the next 30 rows Close price touches the top or bottom barrier (from row 0), eg, start from row index 0, test if Close price (441.86) is greater than Top_Barrier (441.96) or lower than Bottom_Barrier (426.36), if it is greater than Top_Barrier, return 1, if it is lower than Bottom_Barrier, return -1. Else, loop to the next row, eg, at index 1, Close price is 448.95, but it is still being tested against barrier price from index 0, ie, Top_Barrier of 441.96, Bottom_Barrier of 426.36. This loop continue until index 29 if Close price never touches the barriers - return 0 if that's the case. Next rolling loop start from index 1 until 30, etc.
Attempts:
I tried using .rolling.apply with the following function but I just could not resolve the errors. Happy to explore any other methods as long as it achieve my objective stated above. Thanks!
def tbl_rolling(x):
start_i = x.index[0]
for i in range(len(x)):
# the barrier freeze at index 0
if x.loc[i, 'Close'] > x.loc[start_i, 'Top_Barrier']:
return 1
elif x.loc[i, 'Close'] < x.loc[start_i, 'Bottom_Barrier']:
return -1
return 0
The following then throws IndexingError: Too many indexers
test = df.rolling(30).apply(tbl_rolling, raw=False)
You can try something like this if your dataset isn't very big:
df = df.reset_index().assign(key=1)
def f(x):
cond1 = x['Close_x'] > x['Top_Barrier_y'].max()
cond2 = x['Close_x'] < x['Bottom_Barrier_y'].min()
return np.select([cond1,cond2],[1,-1], default=0)[0]
df.merge(df, on='key').query('index_y <= index_x').groupby('index_x').apply(f)
Output:
index_x
0 0
1 1
2 1
3 1
4 1
996 0
997 0
998 0
999 0
1000 0
dtype: int64
This is a weird one: I have 3 dataframes, "prov_data" with contains a provider id and counts on regions and categories (ie. how many times that provider interacted with those regions and categories).
prov_data = DataFrame({'aprov_id':[1122,3344,5566,7788],'prov_region_1':[0,0,4,0],'prov_region_2':[2,0,0,0],
'prov_region_3':[0,1,0,1],'prov_cat_1':[0,2,0,0],'prov_cat_2':[1,0,3,0],'prov_cat_3':[0,0,0,4],
'prov_cat_4':[0,3,0,0]})
"tender_data" which contains the same but for tenders.
tender_data = DataFrame({'atender_id':['AA12','BB33','CC45'],
'ten_region_1':[0,0,1,],'ten_region_2':[0,1,0],
'ten_region_3':[1,1,0],'ten_cat_1':[1,0,0],
'ten_cat_2':[0,1,0],'ten_cat_3':[0,1,0],
'ten_cat_4':[0,0,1]})
And finally a "no_match" DF wich contains forbidden matches between provider and tender.
no_match = DataFrame({ 'prov_id':[1122,3344,5566],
'tender_id':['AA12','BB33','CC45']})
I need to do the following: create a new df that will append the rows of the prov_data & tender_data DataFrames if they (1) match one or more categories (ie the same category is > 0) AND (2) match one or more regions AND (3) are not on the no_match list.
So that would give me this DF:
df = DataFrame({'aprov_id':[1122,3344,7788],'prov_region_1':[0,0,0],'prov_region_2':[2,0,0],
'prov_region_3':[0,1,1],'prov_cat_1':[0,2,0],'prov_cat_2':[1,0,0],'prov_cat_3':[0,0,4],
'prov_cat_4':[0,3,0], 'atender_id':['BB33','AA12','BB33'],
'ten_region_1':[0,0,0],'ten_region_2':[1,0,1],
'ten_region_3':[1,1,1],'ten_cat_1':[0,1,0],
'ten_cat_2':[1,0,1],'ten_cat_3':[1,0,1],
'ten_cat_4':[0,0,0]})
code
# the first columns of each dataframe are the ids
# i'm going to use them several times
tid = tender_data.values[:, 0]
pid = prov_data.values[:, 0]
# first columns [1, 2, 3, 4] are cat columns
# we could have used filter, but this is good
# for this example
pc = prov_data.values[:, 1:5]
tc = tender_data.values[:, 1:5]
# columns [5, 6, 7] are rgn columns
pr = prov_data.values[:, 5:]
tr = tender_data.values[:, 5:]
# I want to mave this an m x n array, where
# m = number of rows in prov df and n = rows in tender
nm = no_match.groupby(['prov_id', 'tender_id']).size().unstack()
nm = nm.reindex_axis(tid, 1).reindex_axis(pid, 0)
nm = ~nm.fillna(0).astype(bool).values * 1
# the dot products of the cat arrays gets a handy
# array where there are > 1 co-positive values
# this combined with the a no_match construct
a = pd.DataFrame(pc.dot(tc.T) * pr.dot(tr.T) * nm > 0, pid, tid)
a = a.mask(~a).stack().index
fp = a.get_level_values(0)
ft = a.get_level_values(1)
pd.concat([
prov_data.set_index('aprov_id').loc[fp].reset_index(),
tender_data.set_index('atender_id').loc[ft].reset_index()
], axis=1)
index prov_cat_1 prov_cat_2 prov_cat_3 prov_cat_4 prov_region_1 \
0 1122 0 1 0 0 0
1 3344 2 0 0 3 0
2 7788 0 0 4 0 0
prov_region_2 prov_region_3 atender_id ten_cat_1 ten_cat_2 ten_cat_3 \
0 2 0 BB33 0 1 1
1 0 1 AA12 1 0 0
2 0 1 BB33 0 1 1
ten_cat_4 ten_region_1 ten_region_2 ten_region_3
0 0 0 1 1
1 0 0 0 1
2 0 0 1 1
explanation
use dot products to determine matches
many other things I'll try to explain more later
Straightforward solution that uses only "standard" pandas techniques.
prov_data['tkey'] = 1
tender_data['tkey'] = 1
df1 = pd.merge(prov_data,tender_data,how='outer',on='tkey')
df1 = pd.merge(df1,no_match,how='outer',left_on = 'aprov_id', right_on = 'prov_id')
df1['dropData'] = df1.apply(lambda x: True if x['tender_id'] == x['atender_id'] else False, axis=1)
df1['dropData'] = df1.apply(lambda x: (x['dropData'] == True) or not(
((x['prov_cat_1'] > 0 and x['ten_cat_1'] > 0) or
(x['prov_cat_2'] > 0 and x['ten_cat_2'] > 0) or
(x['prov_cat_3'] > 0 and x['ten_cat_3'] > 0) or
(x['prov_cat_4'] > 0 and x['ten_cat_4'] > 0)) and(
(x['prov_region_1'] > 0 and x['ten_region_1'] > 0) or
(x['prov_region_2'] > 0 and x['ten_region_2'] > 0) or
(x['prov_region_3'] > 0 and x['ten_region_3'] > 0))),axis=1)
df1 = df1[~df1.dropData]
df1 = df1[[u'aprov_id', u'atender_id', u'prov_cat_1', u'prov_cat_2', u'prov_cat_3',
u'prov_cat_4', u'prov_region_1', u'prov_region_2', u'prov_region_3',
u'ten_cat_1', u'ten_cat_2', u'ten_cat_3', u'ten_cat_4', u'ten_region_1',
u'ten_region_2', u'ten_region_3']].reset_index(drop=True)
print df1.equals(df)
First we do a full cross product of both dataframes and merge that with the no_match dataframe, then add a boolean column to mark all rows to be dropped.
The boolean column is assigned by two boolean lambda functions with all the necessary conditions, then we just take all rows where that column is False.
This solution isn't very ressource-friendly due to the merge operations, so if your data is very large it may be disadvantageous.