I have a data frame like this
ntil ureach_x ureach_y awgt
0 1 1 34 2204.25
1 2 35 42 1700.25
2 3 43 48 898.75
3 4 49 53 160.25
and an array of values like this
ulist = [41,57]
For each value in the list [41,57] I am trying to find if the values fall in between ureach_x and ureach_y and return the awgt value.
awt=[]
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
awt.append(rows['awgt'])
The above code works for within the value ranges of ureach_x and ureach_y. How do I check if the value in the list is greater than the last row of ureach_y. My data frame has dynamic shape with varying number of rows.
For example, The desired output for value 57 in the list is 160.25
I tried the following:
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
awt.append(rows['awgt'])
elif (u >= rows['ureach_x'] and u > rows['ureach_y']):
awt.append(rows['awgt'])
However, this returns multiple values for 41 in the list. How do I refer only the last value in the column of reach_y in a iterrows loop.
The expected output is as follows:
for values in list:
[41,57]
the corresponding values from df has to be returned.
[1700.25 ,160.25]
If I've understood correctly, you can perform a merge_asof:
s = pd.Series([41,57], name='index')
(pd.merge_asof(s, df, left_on='index', right_on='ureach_x')
.set_index('index')['awgt']
)
Output:
index
41 1700.25
57 160.25
Name: awgt, dtype: float64
If you have 0 in the data and you want to have 2204.25 returned, you can add two lines to #mozway's code and perform merge_asof twice, once going backwards and once going forwards; then combine the two.
ulist = [0, 41, 57]
srs = pd.Series(ulist, name='num')
backward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x')
forward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x', direction='forward')
out = backward.combine_first(forward)['awgt']
Output:
0 2204.25
1 1700.25
2 160.25
Name: awgt, dtype: float64
Another option (an explicit loop over ulist):
out = []
for num in ulist:
if ((df['ureach_x'] <= num) & (num <= df['ureach_y'])).any():
x = df.loc[(df['ureach_x'] <= num) & (num <= df['ureach_y']), 'awgt'].iloc[-1]
elif (df['ureach_x'] > num).any():
x = df.loc[df['ureach_x'] > num, 'awgt'].iloc[0]
else:
x = df.loc[df['ureach_y'] < num, 'awgt'].iloc[-1]
out.append(x)
Output:
[2204.25, 1700.25, 160.25]
I have a huge number of points stored with x and y coordinates and an additional value ('value_P') in a pandas.dataframe so the dataframe looks like:
x-coordinate
y-coordinate
value_P
0
0
3
1
1
40
58
1
2
5
4
2
3
76
98
2
4
15
35
3
5
5
4
3
but with around 250000 entries, so i look for a efficient solution. I am trying to add a column that holds the row index of the closest other point. But only the distance between points with value_P!=1 to points with value_P==1 should be considered. Also i am only interested in the index for points where value_P!=1. Its difficult to explain but the desired output should be:
x-coordinate
y-coordinate
value_P
index
0
0
3
1
NaN
1
40
58
1
NaN
2
5
4
2
0
3
76
98
2
1
4
15
35
3
1
5
5
4
3
0
For row 1 the index is NaN because i am not interested in it, since value_P==1. For row 2 its 0, because the point from row 0 is the closest point with a value_P of 1.
I hope its understandable.
I found a solution that involves 2 DataFrame.apply(lambda x:...) functions but it takes a long time. Even if you dont have a concrete solution but an idea how to improve the performance it would be highly appreciated.
My current code is: (P_sort is the data and 'zuord' is the added column)
def index2(x_1,y_1,x_2,y_2,last_1):
h = math.sqrt((x_1 - x_2) ** 2 + (y_1 - y_2) ** 2)
return h
def index(x_1,y_1,x_v,y_v,last_1):
df2 = pnd.DataFrame()
df3 = pnd.DataFrame()
df2['x-coordinate'] = x_v
df2['y-coordinate'] = y_v
df3['distances'] = df2.apply(
lambda x: index2(x['x-coordinate'], x['y-coordinate'], x_1, y_1, last_1), axis=1)
k=df3.idxmin()
print(k)
return k
last_1 = np.count_nonzero(P_sort[:, 2] == 1) - 1
df = pnd.DataFrame(P_sort,
columns=['x-coordinate', 'y-coordinate', 'value_P'])
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['zuord'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
I am new to programming so the code is kind of ugly
I benchmarked four solutions, and the fastest approach is a KD Tree.
Test Dataset
I randomly generated dataframes of various sizes to test the performance of each method.
def generate_spots(n, p=0.005):
x_pos = np.random.uniform(0, 100, n)
y_pos = np.random.uniform(0, 100, n)
value_P = np.random.binomial(size=n, n=1, p=(1 - p)) + 1
df = pd.DataFrame({
'x-coordinate': x_pos,
'y-coordinate': y_pos,
'value_P': value_P
})
df = df.sort_values('value_P').reset_index(drop=True)
return df
This generates a dataframe with n rows, with a probability p that each row is class 1. I also sorted it, because the original method seems to assume that the dataframe is sorted by P.
Method 1: Original
I made some small changes to your code to get it to work for me:
def method1(df):
df = df.copy()
last_1 = np.count_nonzero(df.loc[:, 'value_P'] == 1)
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['index'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
df.loc[0:last_1 - 1, 'index'] = -1
return df
index() and index2() are defined the same way as your question. I also use -1 as a placeholder instead of NaN. No deep reason for this, just personal preference.
Method 2: cdist
Scipy has a function called cdist() which takes the distance between each point among two arrays of points.
import scipy.spatial.distance
def method2(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
nearest_point = scipy.spatial.distance.cdist(source_df, target_df).argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
The cdist function is pretty much the same as what you're doing - it's just implemented in C rather than Python.
Method 3: KD Tree
A KD Tree is a data structure designed to efficiently search for nearby points. You can use SciKit Learn to implement this.
import sklearn.neighbors
def method3(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
tree = sklearn.neighbors.KDTree(target_df)
nearest_point = tree.query(source_df, k=1, return_distance=False)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point.flatten()
return df
Method 4: fastdist
The Python package fastdist bills itself as a faster alternative to scipy's distance calculation methods. Ironically, I found this solution to be slower than cdist at all problem sizes.
from fastdist import fastdist
def method4(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
target_array = target_df.to_numpy()
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
source_array = source_df.to_numpy()
nearest_point = fastdist.matrix_to_matrix_distance(source_array, target_array, fastdist.euclidean, "euclidean").argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
Benchmarks
Each method was run ten times, with various sizes of dataframe, in random order. Here are the results of the benchmark. Note that both the X and Y axes are log-scale.
I didn't benchmark fastdist or the original method for more than 30,000 points, because it took too long.
The fastest methods, in this benchmark, are the cdist method, for fewer than 1000 points, and KD Tree method, for more than 1000 points. At 250K points, the fastest solution is the KD Tree, taking only 0.2 seconds.
As per the following code, using panda, I am doing some analysis on one of the columns (HR):
aa = New_Data['index'].tolist()
aa = [0] + aa
avg = []
for i in range(1,len(aa)):
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])['HR'].diff().mean()
avg.append(val)
New_Data['slope'] = avg
AT the end of the day, it will add a new column to the data ('Slope')
That is fine and is working. The problem is that I want to redo the line (which is specified by **) for every other columns (not just HR) as well. in Other words,:
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])['**another column**'].diff().mean()
avg.append(val)
New_Data['slope'] = avg
Is there any way to do it automatically? I have around 100 columns so doing manually is not enticing. Thanks for your help
Not sure on the pure pandas way but you could just write in a external loop -
aa = New_Data['index'].tolist()
aa = [0] + aa
avg = []
for col in df.columns:
for i in range(1,len(aa)):
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])[col].diff().mean()
avg.append(val)
New_Data['slope'] = avg
In the line
for col in df.columns
you can modify to only use columns you need.
I have 2 timeseries of binary "signals", let's call them "entry" and "stay".
Entry==1 means add 1 to current state (for some maximum amount of time) and stay==0 means set current state to 0.
entry:
0
1
1
0
1
0
stay:
1
1
1
1
0
1
My code now calculates a combined current state:
state:
0
1
2
2
0
1
Currently I use the following code, unfortunately it's (depending on the max-time) quite slow (state/stay/entry are Pandas time series):
state=copy.deepcopy(entry)
state[stay==0]=0
#first iteration
state[(entry.shift(1)==1) & (stay==1)]+=1
#2nd iteration to max time
for lag in range(2,max_time+1):
state[(entry.shift(lag)==1) & (pd.rolling_mean(stay,lag)==1)]+=1
Any idea how to vectorize this code for better performance? Many thanks!
Finally found a solution now, using some NumPy functions:
def calc_state_series(entry,stay, max_time=5):
reduce=(copy.deepcopy(entry)*0).fillna(0) #just for initalization
reduce[(entry.shift(max_time)==1) & (pd.rolling_mean(stay,max_time)==1)]-=1
entry=(entry+stay.shift(1)).fillna(0) #reduce state after max_time
x=entry.values
x = np.concatenate(([0], x))
y=stay.values
y=np.concatenate(([0], y))
nans = y==0
x = np.array(x)
x[nans] = 0
reset_idx = np.zeros(len(x), dtype=int)
reset_idx[nans] = np.arange(len(x))[nans]
reset_idx = np.maximum.accumulate(reset_idx)
cumsum = np.cumsum(x)
cumsum = cumsum - cumsum[reset_idx]
return pd.Series(cumsum[1:], index=entry.index)
I manage to avoid the loop and this solution is (depending on max_time) up to 100x faster for me - but there is probably still potential for further optimization.
In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary
I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?
indices | real data
|
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...
See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:
from __future__ import division
import numpy as np
# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)
items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)
dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
And now the stratified sampling:
# For randomness, shuffle the entire array
np.random.shuffle(dataset)
blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))
# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)
x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B
a_idx = threshold > np.random.rand(len(dataset))
A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
After running it, the split is roughly 80/20, and all blocks are represented in both arrays:
>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True
Here's an alternative solution. I'm open for a code review if it is possible to implement this in a more numpy way (without for loops). #Jamie 's answer is really good, it's just that sometimes it produces skewed ratios within blocks of data.
ratio = 0.8
IDX1 = 0
IDX2 = 1
idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
valid = None
train = None
for i1 in idx1s:
for i2 in idx2:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
curr_data = data[mask,:]
np.random.shuffle(curr_data)
start = np.min(mask)
end = np.max(mask)
thres = start + np.around((end - start) * ratio).astype(np.int)
selected = mask < thres
train_idx = mask[0][selected[0]]
valid_idx = mask[0][~selected[0]]
if train != None:
train = np.vstack((train,data[train_idx]))
valid = np.vstack((valid,data[valid_idx]))
else:
train = data[train_idx]
valid = data[valid_idx]
I'm assuming that each block has at least two entries and also that if it has more than two you want them assigned as closely as possible to 80/20. The easiest way to do this seems to be to assign a random number to all rows, and then choose based on percentiles within each stratified sample. Say this is the data in file strat_sample.csv:
Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291
Then this code (using Pandas data structures) works as desired
import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow
def TreatmentOneCount(n , *args):
#assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1.
OptimalRatio = args[0]
if n < 2:
print("N too small, assignment not defined.")
a = NaN
elif n == 2:
a = 1
else:
"""
There are one of two numbers that are close to the target ratio, one above, the other below
If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
"""
targetassigment = OptimalRatio * n
if targetassigment - floor(targetassigment) > 0.5:
a = min(ceil(targetassigment),n-1)
else:
a = max(floor(targetassigment),1)
return a
df = pd.read_csv('strat_sample.csv', sep=',' , header=0)
#assign a random number to each entry
df['RandScore'] = np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)
#Within each block assign a rank based on random number.
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()
#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)
#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)
#Add the block counts to the data
df = df.merge(dftest, how='left', left_on = 'MasterIdx', right_index= True)
#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <= df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)
from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)