stratified sampling in numpy

stratified sampling in numpy - python

In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary
I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?
indices | real data
|
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...

See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:
from __future__ import division
import numpy as np
# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)
items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)
dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
And now the stratified sampling:
# For randomness, shuffle the entire array
np.random.shuffle(dataset)
blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))
# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)
x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B
a_idx = threshold > np.random.rand(len(dataset))
A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
After running it, the split is roughly 80/20, and all blocks are represented in both arrays:
>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True

Here's an alternative solution. I'm open for a code review if it is possible to implement this in a more numpy way (without for loops). #Jamie 's answer is really good, it's just that sometimes it produces skewed ratios within blocks of data.
ratio = 0.8
IDX1 = 0
IDX2 = 1
idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
valid = None
train = None
for i1 in idx1s:
for i2 in idx2:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
curr_data = data[mask,:]
np.random.shuffle(curr_data)
start = np.min(mask)
end = np.max(mask)
thres = start + np.around((end - start) * ratio).astype(np.int)
selected = mask < thres
train_idx = mask[0][selected[0]]
valid_idx = mask[0][~selected[0]]
if train != None:
train = np.vstack((train,data[train_idx]))
valid = np.vstack((valid,data[valid_idx]))
else:
train = data[train_idx]
valid = data[valid_idx]

I'm assuming that each block has at least two entries and also that if it has more than two you want them assigned as closely as possible to 80/20. The easiest way to do this seems to be to assign a random number to all rows, and then choose based on percentiles within each stratified sample. Say this is the data in file strat_sample.csv:
Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291
Then this code (using Pandas data structures) works as desired
import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow
def TreatmentOneCount(n , *args):
#assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1.
OptimalRatio = args[0]
if n < 2:
print("N too small, assignment not defined.")
a = NaN
elif n == 2:
a = 1
else:
"""
There are one of two numbers that are close to the target ratio, one above, the other below
If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
"""
targetassigment = OptimalRatio * n
if targetassigment - floor(targetassigment) > 0.5:
a = min(ceil(targetassigment),n-1)
else:
a = max(floor(targetassigment),1)
return a
df = pd.read_csv('strat_sample.csv', sep=',' , header=0)
#assign a random number to each entry
df['RandScore'] = np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)
#Within each block assign a rank based on random number.
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()
#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)
#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)
#Add the block counts to the data
df = df.merge(dftest, how='left', left_on = 'MasterIdx', right_index= True)
#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <= df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)

from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)

Related

delete consecutive elements in a pandas dataFrame given a certain rule?

I have a variable with zeros and ones. Each sequence of ones represent "a phase" I want to observe, each sequence of zeros represent the space/distance that intercurr between these phases.
It may happen that a phase carries a sort of "impulse response", for example it can be the echo of a voice: in this case we will have 1,1,1,1,0,0,1,1,1,0,0,0 as an output, the first sequence ones is the shout we made, while the second one is just the echo cause by the shout.
So I made a function that doesn't take into account the echos/response of the main shout/action, and convert the ones sequence of the echo/response into zeros.
(1) If the sequence of zeros is greater or equal than the input threshold nearby_thr the function will recognize that the sequence of ones is an independent phase and it won't delete or change anything.
(2) If the sequence of zeros (between two sequences of ones) is smaller than the input threshold nearby_thr the function will recognize that we have "an impulse response/echo" and we do not take that into account. Infact it will convert the ones into zeros.
I made a naive function that can accomplish this result but I was wondering if pandas already has a function like that, or if it can be accomplished in few lines, without writing a "C-like" function.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
# import utili_funzioni.util00 as ut0
x1 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
# rule = x1==1 ## counting number of consecutive ones
# cumsum_ones = rule.cumsum() - rule.cumsum().where(~rule).ffill().fillna(0).astype(int)
def detect_nearby_el_2(df, nearby_thr):
global el2del
# df = consecut_zeros
# i = 0
print("")
print("")
j = 0
enterOnce_if = 1
reset_count_0s = 0
start2detect = False
count0s = 0 # init
start2_getidxs = False # if this is not true, it won't store idxs to delete
el2del = [] # store idxs to delete elements
for i in range(df.shape[0]):
print("")
print("i: ", i)
x_i = df.iloc[i, 0]
if x_i == 1 and j==0: # first phase (ones) has been detected
start2detect = True # first phase (ones) has been detected
# j += 1
print("count0s:",count0s)
if start2detect == True: # first phase, seen/detected, --> (wait) has ended..
if x_i == 0: # 1st phase detected and ended with "a zero"
if reset_count_0s == 1:
count0s = 0
reset_count_0s = 0
count0s += 1
if enterOnce_if == 1:
start2_getidxs=True # avoiding to delete first phase
enterOnce_0 = 0
if start2_getidxs==True: # avoiding to delete first phase
if x_i == 1 and count0s < nearby_thr:
print("this is NOT a new phase!")
el2del = [*el2del, i] # idxs to delete
reset_count_0s = 1 # reset counter
if x_i == 1 and count0s >= nearby_thr:
print("this is a new phase!") # nothing to delete
reset_count_0s = 1 # reset counter
return el2del
def convert_nearby_el_into_zeros(df,idx):
df0 = df + 0 # error original dataframe is modified!
if len(idx) > 0:
# df.drop(df.index[idx]) # to delete completely
df0.iloc[idx] = 0
else:
print("no elements nearby to delete!!")
return df0
######
print("")
x1_2del = detect_nearby_el_2(df=x1,nearby_thr=3)
x2_2del = detect_nearby_el_2(df=x2,nearby_thr=3)
## deleting nearby elements
x1_a = convert_nearby_el_into_zeros(df=x1,idx=x1_2del)
x2_a = convert_nearby_el_into_zeros(df=x2,idx=x2_2del)
## PLOTTING
# ut0.grayplt()
fig1 = plt.figure()
fig1.suptitle("x1",fontsize=20)
ax1 = fig1.add_subplot(1,2,1)
ax2 = fig1.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x1)
line2, = ax2.plot(x1_a)
fig2 = plt.figure()
fig2.suptitle("x2",fontsize=20)
ax1 = fig2.add_subplot(1,2,1)
ax2 = fig2.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x2)
line2, = ax2.plot(x2_a)
You can see that x1 has two "response/echoes" that I want to not take into account, while x2 has none, infact nothing changed in x2
My question is: How this can be accomplished in few lines using pandas?
Thank You

Interesting problem, and I'm sure there's a more elegant solution out there, but here is my attempt - it's at least fairly performant:
x1 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
def remove_echos(series, threshold):
starting_points = (series==1) & (series.shift()==0)
echo_starting_points = starting_points & series.shift(threshold)==1
echo_starting_points = series[echo_starting_points].index
change_points = series[starting_points].index.to_list() + [series.index[-1]]
for (start, end) in zip(change_points, change_points[1:]):
if start in echo_starting_points:
series.loc[start:end] = 0
return series
x1 = remove_echos(x1, 3)
x2 = remove_echos(x2, 3)
(I changed x1 and x2 to be Series instead of DataFrame, it's easy to adapt this code to work with a df if you need to.)
Explanation: we define the "starting point" of each section as a 1 preceded by a 0. Of those we define an "echo" starting point if the point threshold places before is a 1. (The assumption is that we don't have a phases which is shorter than threshold.) For each echo starting point, we zero from it to the next starting point or the end of the Series.

Generate binary outcome dummy data based on probability of items and its feature

I want to generate a synthetic data from scratch which is a binary outcome sequence data (0/1). My data has following property-
For the sake of an example, lets say there are only 3 items in the sequence, namely A,B and C
So data is -
Its sequence based data so item A,B,C will happen in an order
Items A,B,C have Features S,T,U,V,X,Y,Z...etc (these features needs to have some effect on generating outcome 1, think of them as feature importance)
Probability of conversion when A or B or C is encountered in the data is user defined (I want control over if A occurs in any part of the sequence the overall probability of conversion to outcome 1 is 2% lets say, more below)
Items can repeat in a sequence so a Sequence can be like C->C->A etc .
Given the probability of conversion for each item when it occurs in data (like when ever A is encountered in the sequence, probability of outcome 1 is about 2%, when B occurs, its 2.6% and so on, just an example), I want to generate data randomly. So generated data should look something like this -
ID Sequence Feature Outcome
1 A->B X 0
2 C->C->B Y 1
3 A->B X 1
4 A Z 0
5 A->B->A Z 0
6 C->C Y 1
and so on
When generating this data, I want to have control over -
Conversion probability of A,B and C essentially defining when A occurs probability of conversion is let say 2%, for B is 4% and for C is 3.6%.
Number of converted sequence for each sequence length (for example there can be max 3 sequence so for 3 sequence I want at-least 100000 data points having outcome 1)
Control over how many Items I can include (so A,B,C and D, 4 sequence length instead of 3)
Total number of data points if possible?
Is there any simple way through which I generate this data with keeping in mind all these parameters?

import pandas as pd
import itertools
import numpy as np
import random
alphabets=['A','B','C']
combinations=[]
for i in range(1,len(alphabets)+1):
combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))
weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''
df=pd.DataFrame(random.choices(
population=combinations,weights=weights,
k=1000000),columns=['sequence'])
# -
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.hist(weights, bins = 20)
plt.show()
distribution=df.groupby('sequence').agg({'sequence':'count'}).rename(columns={'sequence':'Total_Numbers'}).reset_index()
plt.hist(distribution.Total_Numbers)
plt.show()
# + tags=[]
from tqdm import tqdm
A=0.2
B=0.8
C=0.1
count_AAA=count_AA=count_A=0
count_BBB=count_BB=count_B=0
count_CCC=count_CC=count_C=0
for i in tqdm(range(0,len(df))):
if(df.sequence[i]=='A->A->A'):
count_AAA+=1
if('A->A' in df.sequence[i]):
count_AA+=1
if('A' in df.sequence[i]):
count_A+=1
if(df.sequence[i]=='B->B->B'):
count_BBB+=1
if('B->B' in df.sequence[i]):
count_BB+=1
if('B' in df.sequence[i]):
count_B+=1
if(df.sequence[i]=='C->C->C'):
count_CCC+=1
if('C->C' in df.sequence[i]):
count_CC+=1
if('C' in df.sequence[i]):
count_C+=1
bi_AAA = np.random.binomial(1, A*0.9, count_AAA)
bi_AA = np.random.binomial(1, A*0.5, count_AA)
bi_A = np.random.binomial(1, A*0.1, count_A)
bi_BBB = np.random.binomial(1, B*0.9, count_BBB)
bi_BB = np.random.binomial(1, B*0.5, count_BB)
bi_B = np.random.binomial(1, B*0.1, count_B)
bi_CCC = np.random.binomial(1, C*0.9, count_CCC)
bi_CC = np.random.binomial(1, C*0.5, count_CC)
bi_C = np.random.binomial(1, C*0.15, count_C)
# -
bi_BBB.sum()/count_BBB
# + tags=[]
AAA=AA=A=BBB=BB=B=CCC=CC=C=0
for i in tqdm(range(0,len(df))):
if(df.sequence[i]=='A->A->A'):
df.at[i, 'Outcome_AAA'] = bi_AAA[AAA]
AAA+=1
if('A->A' in df.sequence[i]):
df.at[i, 'Outcome_AA'] = bi_AA[AA]
AA+=1
if('A' in df.sequence[i]):
df.at[i, 'Outcome_A'] = bi_A[A]
A+=1
if(df.sequence[i]=='B->B->B'):
df.at[i, 'Outcome_BBB'] = bi_BBB[BBB]
BBB+=1
if('B->B' in df.sequence[i]):
df.at[i, 'Outcome_BB'] = bi_BB[BB]
BB+=1
if('B' in df.sequence[i]):
df.at[i, 'Outcome_B'] = bi_B[B]
B+=1
if(df.sequence[i]=='C->C->C'):
df.at[i, 'Outcome_CCC'] = bi_CCC[CCC]
CCC+=1
if('C->C' in df.sequence[i]):
df.at[i, 'Outcome_CC'] = bi_CC[CC]
CC+=1
if('C' in df.sequence[i]):
df.at[i, 'Outcome_C'] = bi_C[C]
C+=1
df=df.fillna(0)
df['Outcome']=df.apply(lambda x: 1 if x.Outcome_AAA+x.Outcome_BBB+x.Outcome_CCC+\
x.Outcome_AA+x.Outcome_BB+x.Outcome_CC+\
x.Outcome_A+x.Outcome_B+x.Outcome_C>0 else 0,1)
dataset=df[['sequence','Outcome']]

Although it may not be the most elegant method, you can achieve this using a for loop. For each row, split a that element of Sequence into a list of events using .split(). You can find the count of each element using .count(). You can find the length using len(), and the average/total outcome using np.sum() and np.mean(). Try using this code as a starting point:
df['Outcome'] = 0
for i, j in df.iterrows():
list_of_events = j['Sequence'].split('->')
# do your calculations on list_of_events here
print(len(list_of_events))
print(list_of_events.count("A"))
my_calculation_for_outcome = list_of_events.count("B")*0.02
df.loc(i, ['Outcome']) = my_calculation_for_outcome
May want to look here for ensuring the Outcome column has a given number of True values: A fast way to find the largest N elements in an numpy array

add a column to a pandas.dataframe that holds the index of the closest point with a certain condition

I have a huge number of points stored with x and y coordinates and an additional value ('value_P') in a pandas.dataframe so the dataframe looks like:
x-coordinate
y-coordinate
value_P
0
0
3
1
1
40
58
1
2
5
4
2
3
76
98
2
4
15
35
3
5
5
4
3
but with around 250000 entries, so i look for a efficient solution. I am trying to add a column that holds the row index of the closest other point. But only the distance between points with value_P!=1 to points with value_P==1 should be considered. Also i am only interested in the index for points where value_P!=1. Its difficult to explain but the desired output should be:
x-coordinate
y-coordinate
value_P
index
0
0
3
1
NaN
1
40
58
1
NaN
2
5
4
2
0
3
76
98
2
1
4
15
35
3
1
5
5
4
3
0
For row 1 the index is NaN because i am not interested in it, since value_P==1. For row 2 its 0, because the point from row 0 is the closest point with a value_P of 1.
I hope its understandable.
I found a solution that involves 2 DataFrame.apply(lambda x:...) functions but it takes a long time. Even if you dont have a concrete solution but an idea how to improve the performance it would be highly appreciated.
My current code is: (P_sort is the data and 'zuord' is the added column)
def index2(x_1,y_1,x_2,y_2,last_1):
h = math.sqrt((x_1 - x_2) ** 2 + (y_1 - y_2) ** 2)
return h
def index(x_1,y_1,x_v,y_v,last_1):
df2 = pnd.DataFrame()
df3 = pnd.DataFrame()
df2['x-coordinate'] = x_v
df2['y-coordinate'] = y_v
df3['distances'] = df2.apply(
lambda x: index2(x['x-coordinate'], x['y-coordinate'], x_1, y_1, last_1), axis=1)
k=df3.idxmin()
print(k)
return k
last_1 = np.count_nonzero(P_sort[:, 2] == 1) - 1
df = pnd.DataFrame(P_sort,
columns=['x-coordinate', 'y-coordinate', 'value_P'])
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['zuord'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
I am new to programming so the code is kind of ugly

I benchmarked four solutions, and the fastest approach is a KD Tree.
Test Dataset
I randomly generated dataframes of various sizes to test the performance of each method.
def generate_spots(n, p=0.005):
x_pos = np.random.uniform(0, 100, n)
y_pos = np.random.uniform(0, 100, n)
value_P = np.random.binomial(size=n, n=1, p=(1 - p)) + 1
df = pd.DataFrame({
'x-coordinate': x_pos,
'y-coordinate': y_pos,
'value_P': value_P
})
df = df.sort_values('value_P').reset_index(drop=True)
return df
This generates a dataframe with n rows, with a probability p that each row is class 1. I also sorted it, because the original method seems to assume that the dataframe is sorted by P.
Method 1: Original
I made some small changes to your code to get it to work for me:
def method1(df):
df = df.copy()
last_1 = np.count_nonzero(df.loc[:, 'value_P'] == 1)
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['index'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
df.loc[0:last_1 - 1, 'index'] = -1
return df
index() and index2() are defined the same way as your question. I also use -1 as a placeholder instead of NaN. No deep reason for this, just personal preference.
Method 2: cdist
Scipy has a function called cdist() which takes the distance between each point among two arrays of points.
import scipy.spatial.distance
def method2(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
nearest_point = scipy.spatial.distance.cdist(source_df, target_df).argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
The cdist function is pretty much the same as what you're doing - it's just implemented in C rather than Python.
Method 3: KD Tree
A KD Tree is a data structure designed to efficiently search for nearby points. You can use SciKit Learn to implement this.
import sklearn.neighbors
def method3(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
tree = sklearn.neighbors.KDTree(target_df)
nearest_point = tree.query(source_df, k=1, return_distance=False)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point.flatten()
return df
Method 4: fastdist
The Python package fastdist bills itself as a faster alternative to scipy's distance calculation methods. Ironically, I found this solution to be slower than cdist at all problem sizes.
from fastdist import fastdist
def method4(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
target_array = target_df.to_numpy()
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
source_array = source_df.to_numpy()
nearest_point = fastdist.matrix_to_matrix_distance(source_array, target_array, fastdist.euclidean, "euclidean").argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
Benchmarks
Each method was run ten times, with various sizes of dataframe, in random order. Here are the results of the benchmark. Note that both the X and Y axes are log-scale.
I didn't benchmark fastdist or the original method for more than 30,000 points, because it took too long.
The fastest methods, in this benchmark, are the cdist method, for fewer than 1000 points, and KD Tree method, for more than 1000 points. At 250K points, the fastest solution is the KD Tree, taking only 0.2 seconds.

Chunk a variable into parts and sum the total in each part

My dataset has 2 million observations. I want to split it into 200 categories based on the value of a variable, 'rv'. For example, imagine I had the categories 0-1000, 1000-2000, 2000-3000, 3000-4000, 4000-5000 I would want to split an observation with value 4500 like this: 1000 in each of the 1st 4 categories, and 500 in the final category. I have the following code, which works but is very slow:
# create random data set
import pandas as pd
import numpy as np
data = np.random.randint(0, 5000, size=2000)
df = pd.DataFrame({'rv': data})
#%% slice
sizes = [0, 1000, 2000, 3000, 4000, 5000]
size_names = ['{:.0f} to {:.0f}'.format(lower, upper) for lower, upper in zip(sizes[0:-1], sizes[1:])]
for lower, upper, name in zip(sizes[0:-1], sizes[1:], size_names):
df[name] = df['rv'].apply(lambda x: max(0, (min(x, upper) - lower)))
# summary table
df_slice = df[size_names].sum()
Are there better ways of doing this, where better means faster principally? With 2 million observations and 200 categories this takes quite a long time (not sure how long as I stopped the code before it had finished).

I wrote an algorithm that sorts the data beforehand, which takes it from a O(n*m) loop (over the data and the categories) to a O(n) loop (just over the data, albeit there is a O(n log n) time for sorting it). By sorting it, you already know which bin you're in and just have to take care of the summing for that particular bin, then apply the sum to that bin and all bins below it once per bin. It takes about 1.2 seconds on 2 million data points over 200 categories. Hope it helps:
from time import time
from random import randint
data = [randint(0, 4999) for i in range(2000000)]
sizes = range(0, 5001, 25)
bound_pairs = [[sizes[i], sizes[i + 1]] for i in range(len(sizes) - 1)]
results = [0 for i in range(len(sizes) - 1)]
data.sort()
curr_bin = 0
curr_bin_count = 0
curr_bin_sum = 0
for d in data:
if d >= bound_pairs[curr_bin][1]:
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
curr_bin_count = 0
curr_bin_sum = 0
while d >= bound_pairs[curr_bin][1]:
curr_bin += 1
curr_bin_count += 1
curr_bin_sum += d - bound_pairs[curr_bin][0]
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
EDIT: There may be some issues here depending on whether you want the upper bound or lower bound to be inclusive or exclusive. I leave the particulars to you.

Storing all values when creating a Pandas Pivot Table

Basically, I'm aggregating prices over three indices to determine: mean, std, as well as an upper/lower limit. So far so good. However, now I want to also find the lowest identified price which is still >= the computed lower limit.
My first idea was to use np.min to find the lowest price -> this obviously disregards the lower-limit and is not useful. Now I'm trying to store all the values the pivot table identified to find the price which still is >= lower-limit. Any ideas?
pivot = pd.pivot_table(temp, index=['A','B','C'],values=['price'], aggfunc=[np.mean,np.std],fill_value=0)
pivot['lower_limit'] = pivot['mean'] - 2 * pivot['std']
pivot['upper_limit'] = pivot['mean'] + 2 * pivot['std']

First, merge pivoted[lower_limit] back into temp. Thus, for each price in temp there is also a lower_limit value.
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
Then you can restrict your attention to those rows in temp for which the price is >= lower_limit:
temp.loc[temp['price'] >= temp['lower_limit']]
The desired result can be found by computing a groupby/min:
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
For example,
import numpy as np
import pandas as pd
np.random.seed(2017)
N = 1000
ABC = list('ABC')
temp = pd.DataFrame(np.random.randint(2, size=(N,3)), columns=ABC)
temp['price'] = np.random.random(N)
pivoted = pd.pivot_table(temp, index=['A','B','C'],values=['price'],
aggfunc=[np.mean,np.std],fill_value=0)
pivoted['lower_limit'] = pivoted['mean'] - 2 * pivoted['std']
pivoted['upper_limit'] = pivoted['mean'] + 2 * pivoted['std']
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
print(result)
yields
A B C
0 0 0 0.003628
1 0.000132
1 0 0.005833
1 0.000159
1 0 0 0.006203
1 0.000536
1 0 0.001745
1 0.025713

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

stratified sampling in numpy - python

from sklearn import cross_validation X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)

Related

delete consecutive elements in a pandas dataFrame given a certain rule?

Generate binary outcome dummy data based on probability of items and its feature

add a column to a pandas.dataframe that holds the index of the closest point with a certain condition

Chunk a variable into parts and sum the total in each part

Storing all values when creating a Pandas Pivot Table

Categories

Resources