delete consecutive elements in a pandas dataFrame given a certain rule?

delete consecutive elements in a pandas dataFrame given a certain rule? - python

I have a variable with zeros and ones. Each sequence of ones represent "a phase" I want to observe, each sequence of zeros represent the space/distance that intercurr between these phases.
It may happen that a phase carries a sort of "impulse response", for example it can be the echo of a voice: in this case we will have 1,1,1,1,0,0,1,1,1,0,0,0 as an output, the first sequence ones is the shout we made, while the second one is just the echo cause by the shout.
So I made a function that doesn't take into account the echos/response of the main shout/action, and convert the ones sequence of the echo/response into zeros.
(1) If the sequence of zeros is greater or equal than the input threshold nearby_thr the function will recognize that the sequence of ones is an independent phase and it won't delete or change anything.
(2) If the sequence of zeros (between two sequences of ones) is smaller than the input threshold nearby_thr the function will recognize that we have "an impulse response/echo" and we do not take that into account. Infact it will convert the ones into zeros.
I made a naive function that can accomplish this result but I was wondering if pandas already has a function like that, or if it can be accomplished in few lines, without writing a "C-like" function.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
# import utili_funzioni.util00 as ut0
x1 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
# rule = x1==1 ## counting number of consecutive ones
# cumsum_ones = rule.cumsum() - rule.cumsum().where(~rule).ffill().fillna(0).astype(int)
def detect_nearby_el_2(df, nearby_thr):
global el2del
# df = consecut_zeros
# i = 0
print("")
print("")
j = 0
enterOnce_if = 1
reset_count_0s = 0
start2detect = False
count0s = 0 # init
start2_getidxs = False # if this is not true, it won't store idxs to delete
el2del = [] # store idxs to delete elements
for i in range(df.shape[0]):
print("")
print("i: ", i)
x_i = df.iloc[i, 0]
if x_i == 1 and j==0: # first phase (ones) has been detected
start2detect = True # first phase (ones) has been detected
# j += 1
print("count0s:",count0s)
if start2detect == True: # first phase, seen/detected, --> (wait) has ended..
if x_i == 0: # 1st phase detected and ended with "a zero"
if reset_count_0s == 1:
count0s = 0
reset_count_0s = 0
count0s += 1
if enterOnce_if == 1:
start2_getidxs=True # avoiding to delete first phase
enterOnce_0 = 0
if start2_getidxs==True: # avoiding to delete first phase
if x_i == 1 and count0s < nearby_thr:
print("this is NOT a new phase!")
el2del = [*el2del, i] # idxs to delete
reset_count_0s = 1 # reset counter
if x_i == 1 and count0s >= nearby_thr:
print("this is a new phase!") # nothing to delete
reset_count_0s = 1 # reset counter
return el2del
def convert_nearby_el_into_zeros(df,idx):
df0 = df + 0 # error original dataframe is modified!
if len(idx) > 0:
# df.drop(df.index[idx]) # to delete completely
df0.iloc[idx] = 0
else:
print("no elements nearby to delete!!")
return df0
######
print("")
x1_2del = detect_nearby_el_2(df=x1,nearby_thr=3)
x2_2del = detect_nearby_el_2(df=x2,nearby_thr=3)
## deleting nearby elements
x1_a = convert_nearby_el_into_zeros(df=x1,idx=x1_2del)
x2_a = convert_nearby_el_into_zeros(df=x2,idx=x2_2del)
## PLOTTING
# ut0.grayplt()
fig1 = plt.figure()
fig1.suptitle("x1",fontsize=20)
ax1 = fig1.add_subplot(1,2,1)
ax2 = fig1.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x1)
line2, = ax2.plot(x1_a)
fig2 = plt.figure()
fig2.suptitle("x2",fontsize=20)
ax1 = fig2.add_subplot(1,2,1)
ax2 = fig2.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x2)
line2, = ax2.plot(x2_a)
You can see that x1 has two "response/echoes" that I want to not take into account, while x2 has none, infact nothing changed in x2
My question is: How this can be accomplished in few lines using pandas?
Thank You

Interesting problem, and I'm sure there's a more elegant solution out there, but here is my attempt - it's at least fairly performant:
x1 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
def remove_echos(series, threshold):
starting_points = (series==1) & (series.shift()==0)
echo_starting_points = starting_points & series.shift(threshold)==1
echo_starting_points = series[echo_starting_points].index
change_points = series[starting_points].index.to_list() + [series.index[-1]]
for (start, end) in zip(change_points, change_points[1:]):
if start in echo_starting_points:
series.loc[start:end] = 0
return series
x1 = remove_echos(x1, 3)
x2 = remove_echos(x2, 3)
(I changed x1 and x2 to be Series instead of DataFrame, it's easy to adapt this code to work with a df if you need to.)
Explanation: we define the "starting point" of each section as a 1 preceded by a 0. Of those we define an "echo" starting point if the point threshold places before is a 1. (The assumption is that we don't have a phases which is shorter than threshold.) For each echo starting point, we zero from it to the next starting point or the end of the Series.

Related

Walk along 2D numpy array as long as values remain the same

Short description
I want to walk along a numpy 2D array starting from different points in specified directions (either 1 or -1) until a column changes (see below)
Current code
First let's generate a dataset:
# Generate big random dataset
# first column is id and second one is a number
np.random.seed(123)
c1 = np.random.randint(0,100,size = 1000000)
c2 = np.random.randint(0,20,size = 1000000)
c3 = np.random.choice([1,-1],1000000 )
m = np.vstack((c1, c2, c3)).T
m = m[m[:,0].argsort()]
Then I wrote the following code that starts at specific rows in the matrix (start_points) then keeps extending in the specified direction (direction_array) until the metadata changes:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = start_mat[:,1]
direction_array = start_mat[:,2]
walk_array = start_array
while True:
walk_array = np.add(walk_array, direction_array)
try:
walk_mat = mat[walk_array]
walk_metadata = walk_mat[:,1]
if sorted(metadata) != sorted(walk_metadata):
raise IndexError
except IndexError:
return start_mat, mat[walk_array + (direction_array *-1)]
s = time.time()
for i in range(100000):
start_points = np.random.randint(0,1000000,size = 3)
res = walk(m, start_points)
Question
While the above code works fine I think there must be an easier/more elegant way to walk along a numpy 2D array from different start points until the value of another column changes? This for example requires me to slice the input array for every step in the while loop which seems quite inefficient (especially when I have to run walk millions of times).

You don't have to whole input array in while loop. You could just use the column that values you want to check.
I refactored a little bit your code as well so there is no while True statement and so there is no if that raises error for no particular reason.
Code:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = sorted(start_mat[:,1])
direction_array = start_mat[:,2]
data = mat[:,1]
walk_array = np.add(start_array, direction_array)
try:
while metadata == sorted(data[walk_array]):
walk_array = np.add(walk_array, direction_array)
except IndexError:
pass
return start_mat, mat[walk_array - direction_array]
In this particular reason if len(start_array) is a big number (thousands of elements) you could use collections.Counter instead of sort as it will be much faster.
I was thinking of another approach that could be used and I that there could be a array with desired slices in correct direction.
But this approach seems very dirty. Anyway I will post it maybe you will find it anyhow useful
Code:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = sorted(start_mat[:,1])
direction_array = start_mat[:,2]
_data = mat[:,1]
walk_slices = zip(*[
data[start_points[i]+direction_array[i]::direction_array[i]]
for i in range(len(start_points))
])
for step, walk_metadata in enumerate(walk_slices):
if metadata != sorted(walk_metadata):
break
return start_mat, mat[start_array + (direction_array * step)]

To perform operation starting from a single row, define the following class:
class Walker:
def __init__(self, tbl, row):
self.tbl = tbl
self.row = row
self.dir = self.tbl[self.row, 2]
# How many rows can I move from "row" in the indicated direction
# while metadata doesn't change
def numEq(self):
# Metadata from "row" in the required direction
md = self.tbl[self.row::self.dir, 1]
return ((md != md[0]).cumsum() == 0).sum() - 1
# Get row "n" positions from "row" in the indicated direction
def getRow(self, n):
return self.tbl[self.row + n * self.dir]
Then, to get the result, run:
def walk_2(m, start_points):
# Create walkers for each starting point
wlk = [ Walker(m, n) for n in start_points ]
# How many rows can I move
dist = min([ w.numEq() for w in wlk ])
# Return rows from changed positions
return np.vstack([ w.getRow(dist) for w in wlk ])
The execution time of my code is roughly the same as yours,
but in my opinion my code is more readable and concise.

Apply function but remember index position

I have a pandas df containing weights. Rows contain dates and columns contain asset names. Every row sum to 1.
I want to run
df_with_stocks_weight.apply(rescale_w, weight_min=0.01, weight_max=0.30)
in order to change so that weights still sum to 1 but have min value 1% and max value 30%. I tried using the function below, but I get problems with the index: The calculated values are correct but the output refers to the wrong asset!
def rescale_w(row_input, weight_min, weight_max):
'''
:param row_input: a row from a pandas df
:param weight_min: the floor. type float.
:param weight_max: the cap. type float.
:return: a pandas row where weights are adjusted to specify min max.
step 1:
while any asset has weight above weight_max,
set that asset's weight to == weight_max
and distribute the leftovers to all other assets (whose weight are >0)
in accordance with their weight.
step 2:
if there is a positive weight below min_weight,
force it to == min_weight
by stealing from every other asset
(except those whose weight == max_weight).
note that the function produce strange output with few assets.
for example with 3 assets and max 30% the sum is 0.90
and if A=50% B=20% and one other asset is 1% then
these are not practical problems as we will analyze on data with many assets.
'''
# rename
w1 = row_input
# na
# script returned many errors regarding na
# so i a fillna(0) here.
# if that will be the final solution, some cleaning up can be done
# eg remove _null objects and remove some assertions.
w1 = w1.fillna(0)
# remove zeroes to get a faster script
w1nz = w1[w1 > 0]
w1z = w1[w1 == 0]
assert len(w1) == len(w1nz) + len(w1z)
assert set(w1nz.index).intersection(set(w1z.index)) == set()
# input must sum to 1
assert abs(w1nz.sum()-1) < 0.001
# only execute if there is at least one notnull value
# below will work with nz
if len(w1nz) > 0:
# step 1: make sure upper threshold is satisfied
while max(w1nz) > weight_max:
# clip at 30%
w2 = w1nz.clip(upper=weight_max)
# calc leftovers from this upper clip
leftover_upper = 1 - w2.sum()
# add leftovers to the untouched, in accordance with weight
w2_touched = w2[w2 == weight_max]
w2_unt = w2[(weight_max > w2) & (w2 > 0)]
w2_unt_added = w2_unt + leftover_upper * w2_unt / w2_unt.sum()
# concat all back
w3 = pd.concat([w2_touched, w2_unt_added], axis=0)
# same index for output and input
#w3 = w3.reindex(w1nz.index) # todo prövar nu att ta bort .reindex överallt. ser om pd löser det själv automatiskt
# rename w3 so that it works in a while loop
w1nz = w3
usestep2 = False
if usestep2:
# step 2: make sure lower threshold is satisfied
if min(w1nz) < weight_min:
# three parts: lower, middle, upper.
# those in "lower" will recieve from those in "middle"
upper = w1nz[w1nz >= weight_max]
middle = w1nz[(w1nz > weight_min) & (w1nz < weight_max)]
lower = w1nz[w1nz <= weight_min]
# assert len
assert (len(upper) + len(middle) + len(lower) == len(w1nz))
# change lower to == weight_min
lower_modified = lower.clip(lower=weight_min)
# the weights given to "lower" is stolen from "middle"
stolen_weigths = lower_modified.sum() - lower.sum()
middle_modified = middle - stolen_weigths * middle / middle.sum()
# concat
w4 = pd.concat([lower_modified,
middle_modified,
upper], axis=0)
# reindex
#w4 = w4.reindex(w1nz.index)
# rename
w1nz = w4
# lastly, concat adjusted nonzero with zero.
w1adj = pd.concat([w1nz, w1z], axis=0)
w1adj = w1adj.reindex(w1.index) # works?
assert (w1adj.index == w1.index).all()
assert abs(w1adj.sum() - 1 < 0.001)
return (w1adj)

Python method only works once

I'm writing a method for calculating the covariance of 2 to 8 time-series variables. I'm intending for the variables to be contained in list objects when they are passed to this method. The method should return 1 number, not a covariance matrix.
The method works fine the first time it's called. Anytime it's called after that, it returns a 0. An example is attached at the bottom, below my code. Any advice/feeback regarding the variable scope issues here would be greatly appreciated. Thanks!
p = [3,4,4,654]
o = [4,67,4,1]
class Toolkit():
def CovarianceScalar(self, column1, column2 = [], column3 = [], column4 = [],column5 = [],column6 = [],column7 = [],column8 = []):
"""Assumes all columns have length equal to Len(column1)"""
#If only the first column is passed, this will act as a variance function
import numpy as npObject
#This is a binary-style number that is assigned a value of 1 if one of the input vectors/lists has zero length. This way, the CovarianceResult variable can be computed, and the relevant
# terms can have a 1 added to them if they would otherwise go to 0, preventing the CovarianceResult value from incorrectly going to 0.
binUnityFlag2 = 1 if (len(column2) == 0) else 0
binUnityFlag3 = 1 if (len(column3) == 0) else 0
binUnityFlag4 = 1 if (len(column4) == 0) else 0
binUnityFlag5 = 1 if (len(column5) == 0) else 0
binUnityFlag6 = 1 if (len(column6) == 0) else 0
binUnityFlag7 = 1 if (len(column7) == 0) else 0
binUnityFlag8 = 1 if (len(column8) == 0) else 0
# Some initial housekeeping: ensure that all input column lengths match that of the first column. (Will later advise the user if they do not.)
lngExpectedColumnLength = len(column1)
inputList = [column2, column3, column4, column5, column6, column7, column8]
inputListNames = ["column2","column3","column4","column5","column6","column7","column8"]
for i in range(0,len(inputList)):
while len(inputList[i]) < lngExpectedColumnLength: #Empty inputs now become vectors of 1's.
inputList[i].append(1)
#Now start calculating the covariance of the inputs:
avgColumn1 = sum(column1)/len(column1) #<-- Each column's average
avgColumn2 = sum(column2)/len(column2)
avgColumn3 = sum(column3)/len(column3)
avgColumn4 = sum(column4)/len(column4)
avgColumn5 = sum(column5)/len(column5)
avgColumn6 = sum(column6)/len(column6)
avgColumn7 = sum(column7)/len(column7)
avgColumn8 = sum(column8)/len(column8)
avgList = [avgColumn1,avgColumn2,avgColumn3,avgColumn4,avgColumn5, avgColumn6, avgColumn7,avgColumn8]
#start building the scalar-valued result:
CovarianceResult = float(0)
for i in range(0,lngExpectedColumnLength):
CovarianceResult +=((column1[i] - avgColumn1) * ((column2[i] - avgColumn2) + binUnityFlag2) * ((column3[i] - avgColumn3) + binUnityFlag3 ) * ((column4[i] - avgColumn4) + binUnityFlag4 ) *((column5[i] - avgColumn5) + binUnityFlag5) * ((column6[i] - avgColumn6) + binUnityFlag6 ) * ((column7[i] - avgColumn7) + binUnityFlag7)* ((column8[i] - avgColumn8) + binUnityFlag8))
#Finally, divide the sum of the multiplied deviations by the sample size:
CovarianceResult = float(CovarianceResult)/float(lngExpectedColumnLength) #Coerce both terms to a float-type to prevent return of array-type objects.
return CovarianceResult
Example:
myInst = Toolkit() #Create a class instance.
First execution of the function:
myInst.CovarianceScalar(o,p)
#Returns -2921.25, the covariance of the numbers in lists o and p.
Second time around:
myInst.CovarianceScalar(o,p)
#Returns: 0.0

I belive that the problem you are facing is due to mutable default arguments. Basicily, when you first execute myInst.CovarianceScalar(o,p) all columns other than first two are []. During this execution, you change the arguments. Thus when you execute the same function as before, myInst.CovarianceScalar(o,p), the other columns in the arguments are not [] anymore. They take values of whatever value they have as a result of the first execution.

Compute Higher Moments of Data Matrix

this probably leads to scipy/numpy, but right now I'm happy with any functionality as I couldn't find anything in those packages. I have a matrix that contains data for a multi-variate distribution (let's say, 2, for the fun of it). Is there any function to compute (higher) moments of that? All I could find was numpy.mean() and numpy.cov() :o
Thanks :)
/edit:
So some more detail: I have multivariate data, that is, a matrix where rows display variables and columns observations. Now I would like to have a simple way of computing the joint moments of that data, as defined in http://en.wikipedia.org/wiki/Central_moment#Multivariate_moments .
I'm pretty new to python/scipy so I'm not sure I'd be the best person to code this one up, especially for the n-variables case (note that the wikipedia definition is for n=2), and I kind of expected there to be some out-of-the-box thing to use as I thought this would be a standard problem.
/edit2:
Just for the future, in case someone wants to do something similar, the following code (which is still under review) should give the sample equivalent of the raw moments E(X^2), E(Y^2), etc. It only works for two variables right now, but it should be extendable if one feels the need. If you see some mistakes or unclean/unpython-nish code, feel free to comment.
from numpy import *
# this function should return something as
# moments[0] = 1
# moments[1] = mean(X), mean(Y)
# moments[2] = 1/n*X'X, 1/n*X'Y, 1/n*Y'Y
# moments[3] = mean(X'X'X), mean(X'X'Y), mean(X'Y'Y),
# mean(Y'Y'Y)
# etc
def getRawMoments(data, moment, axis=0):
a = moment
if (axis==0):
n = float(data.shape[1])
X = matrix(data[0,:]).reshape((n,1))
Y = matrix(data[1,:]).reshape((n,1))
else:
n = float(data.shape[0])
X = matrix(data[:,0]).reshape((n,1))
Y = matrix(data[:,1]).reshape((n,11))
result = 1
Z = hstack((X,Y))
iota = ones((1,n))
moments = {}
moments[0] = 1
#first, generate huge-ass matrix containing all x-y combinations
# for every power-combination k,l such that k+l = i
# for all 0 <= i <= a
for i in arange(1,a):
if i==2:
moments[i] = moments[i-1]*Z
# if even, postmultiply with X.
elif i%2 == 1:
moments[i] = kron(moments[i-1], Z.T)
# Else, postmultiply with X.T
elif i%2==0:
temp = moments[i-1]
temp2 = temp[:,0:n]*Z
temp3 = temp[:,n:2*n]*Z
moments[i] = hstack((temp2, temp3))
# since now we have many multiple moments
# such as x**2*y and x*y*x, filter non-distinct elements
momentsDistinct = {}
momentsDistinct[0] = 1
for i in arange(1,a):
if i%2 == 0:
data = 1/n*moments[i]
elif i == 1:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
data = 1/n*hstack((temp2))
else:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
temp3 = temp[:,n:2*n]*iota.T
data = 1/n*hstack((temp2, temp3))
momentsDistinct[i] = unique(data.flat)
return momentsDistinct(result, axis=1)

stratified sampling in numpy

In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary
I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?
indices | real data
|
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...

See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:
from __future__ import division
import numpy as np
# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)
items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)
dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
And now the stratified sampling:
# For randomness, shuffle the entire array
np.random.shuffle(dataset)
blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))
# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)
x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B
a_idx = threshold > np.random.rand(len(dataset))
A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
After running it, the split is roughly 80/20, and all blocks are represented in both arrays:
>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True

Here's an alternative solution. I'm open for a code review if it is possible to implement this in a more numpy way (without for loops). #Jamie 's answer is really good, it's just that sometimes it produces skewed ratios within blocks of data.
ratio = 0.8
IDX1 = 0
IDX2 = 1
idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
valid = None
train = None
for i1 in idx1s:
for i2 in idx2:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
curr_data = data[mask,:]
np.random.shuffle(curr_data)
start = np.min(mask)
end = np.max(mask)
thres = start + np.around((end - start) * ratio).astype(np.int)
selected = mask < thres
train_idx = mask[0][selected[0]]
valid_idx = mask[0][~selected[0]]
if train != None:
train = np.vstack((train,data[train_idx]))
valid = np.vstack((valid,data[valid_idx]))
else:
train = data[train_idx]
valid = data[valid_idx]

I'm assuming that each block has at least two entries and also that if it has more than two you want them assigned as closely as possible to 80/20. The easiest way to do this seems to be to assign a random number to all rows, and then choose based on percentiles within each stratified sample. Say this is the data in file strat_sample.csv:
Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291
Then this code (using Pandas data structures) works as desired
import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow
def TreatmentOneCount(n , *args):
#assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1.
OptimalRatio = args[0]
if n < 2:
print("N too small, assignment not defined.")
a = NaN
elif n == 2:
a = 1
else:
"""
There are one of two numbers that are close to the target ratio, one above, the other below
If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
"""
targetassigment = OptimalRatio * n
if targetassigment - floor(targetassigment) > 0.5:
a = min(ceil(targetassigment),n-1)
else:
a = max(floor(targetassigment),1)
return a
df = pd.read_csv('strat_sample.csv', sep=',' , header=0)
#assign a random number to each entry
df['RandScore'] = np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)
#Within each block assign a rank based on random number.
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()
#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)
#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)
#Add the block counts to the data
df = df.merge(dftest, how='left', left_on = 'MasterIdx', right_index= True)
#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <= df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)

from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

delete consecutive elements in a pandas dataFrame given a certain rule? - python

Related

Walk along 2D numpy array as long as values remain the same

Apply function but remember index position

Python method only works once

Compute Higher Moments of Data Matrix

stratified sampling in numpy

Categories

Resources