I'm writing a method for calculating the covariance of 2 to 8 time-series variables. I'm intending for the variables to be contained in list objects when they are passed to this method. The method should return 1 number, not a covariance matrix.
The method works fine the first time it's called. Anytime it's called after that, it returns a 0. An example is attached at the bottom, below my code. Any advice/feeback regarding the variable scope issues here would be greatly appreciated. Thanks!
p = [3,4,4,654]
o = [4,67,4,1]
class Toolkit():
def CovarianceScalar(self, column1, column2 = [], column3 = [], column4 = [],column5 = [],column6 = [],column7 = [],column8 = []):
"""Assumes all columns have length equal to Len(column1)"""
#If only the first column is passed, this will act as a variance function
import numpy as npObject
#This is a binary-style number that is assigned a value of 1 if one of the input vectors/lists has zero length. This way, the CovarianceResult variable can be computed, and the relevant
# terms can have a 1 added to them if they would otherwise go to 0, preventing the CovarianceResult value from incorrectly going to 0.
binUnityFlag2 = 1 if (len(column2) == 0) else 0
binUnityFlag3 = 1 if (len(column3) == 0) else 0
binUnityFlag4 = 1 if (len(column4) == 0) else 0
binUnityFlag5 = 1 if (len(column5) == 0) else 0
binUnityFlag6 = 1 if (len(column6) == 0) else 0
binUnityFlag7 = 1 if (len(column7) == 0) else 0
binUnityFlag8 = 1 if (len(column8) == 0) else 0
# Some initial housekeeping: ensure that all input column lengths match that of the first column. (Will later advise the user if they do not.)
lngExpectedColumnLength = len(column1)
inputList = [column2, column3, column4, column5, column6, column7, column8]
inputListNames = ["column2","column3","column4","column5","column6","column7","column8"]
for i in range(0,len(inputList)):
while len(inputList[i]) < lngExpectedColumnLength: #Empty inputs now become vectors of 1's.
inputList[i].append(1)
#Now start calculating the covariance of the inputs:
avgColumn1 = sum(column1)/len(column1) #<-- Each column's average
avgColumn2 = sum(column2)/len(column2)
avgColumn3 = sum(column3)/len(column3)
avgColumn4 = sum(column4)/len(column4)
avgColumn5 = sum(column5)/len(column5)
avgColumn6 = sum(column6)/len(column6)
avgColumn7 = sum(column7)/len(column7)
avgColumn8 = sum(column8)/len(column8)
avgList = [avgColumn1,avgColumn2,avgColumn3,avgColumn4,avgColumn5, avgColumn6, avgColumn7,avgColumn8]
#start building the scalar-valued result:
CovarianceResult = float(0)
for i in range(0,lngExpectedColumnLength):
CovarianceResult +=((column1[i] - avgColumn1) * ((column2[i] - avgColumn2) + binUnityFlag2) * ((column3[i] - avgColumn3) + binUnityFlag3 ) * ((column4[i] - avgColumn4) + binUnityFlag4 ) *((column5[i] - avgColumn5) + binUnityFlag5) * ((column6[i] - avgColumn6) + binUnityFlag6 ) * ((column7[i] - avgColumn7) + binUnityFlag7)* ((column8[i] - avgColumn8) + binUnityFlag8))
#Finally, divide the sum of the multiplied deviations by the sample size:
CovarianceResult = float(CovarianceResult)/float(lngExpectedColumnLength) #Coerce both terms to a float-type to prevent return of array-type objects.
return CovarianceResult
Example:
myInst = Toolkit() #Create a class instance.
First execution of the function:
myInst.CovarianceScalar(o,p)
#Returns -2921.25, the covariance of the numbers in lists o and p.
Second time around:
myInst.CovarianceScalar(o,p)
#Returns: 0.0
I belive that the problem you are facing is due to mutable default arguments. Basicily, when you first execute myInst.CovarianceScalar(o,p) all columns other than first two are []. During this execution, you change the arguments. Thus when you execute the same function as before, myInst.CovarianceScalar(o,p), the other columns in the arguments are not [] anymore. They take values of whatever value they have as a result of the first execution.
Related
I have a variable with zeros and ones. Each sequence of ones represent "a phase" I want to observe, each sequence of zeros represent the space/distance that intercurr between these phases.
It may happen that a phase carries a sort of "impulse response", for example it can be the echo of a voice: in this case we will have 1,1,1,1,0,0,1,1,1,0,0,0 as an output, the first sequence ones is the shout we made, while the second one is just the echo cause by the shout.
So I made a function that doesn't take into account the echos/response of the main shout/action, and convert the ones sequence of the echo/response into zeros.
(1) If the sequence of zeros is greater or equal than the input threshold nearby_thr the function will recognize that the sequence of ones is an independent phase and it won't delete or change anything.
(2) If the sequence of zeros (between two sequences of ones) is smaller than the input threshold nearby_thr the function will recognize that we have "an impulse response/echo" and we do not take that into account. Infact it will convert the ones into zeros.
I made a naive function that can accomplish this result but I was wondering if pandas already has a function like that, or if it can be accomplished in few lines, without writing a "C-like" function.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
# import utili_funzioni.util00 as ut0
x1 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
# rule = x1==1 ## counting number of consecutive ones
# cumsum_ones = rule.cumsum() - rule.cumsum().where(~rule).ffill().fillna(0).astype(int)
def detect_nearby_el_2(df, nearby_thr):
global el2del
# df = consecut_zeros
# i = 0
print("")
print("")
j = 0
enterOnce_if = 1
reset_count_0s = 0
start2detect = False
count0s = 0 # init
start2_getidxs = False # if this is not true, it won't store idxs to delete
el2del = [] # store idxs to delete elements
for i in range(df.shape[0]):
print("")
print("i: ", i)
x_i = df.iloc[i, 0]
if x_i == 1 and j==0: # first phase (ones) has been detected
start2detect = True # first phase (ones) has been detected
# j += 1
print("count0s:",count0s)
if start2detect == True: # first phase, seen/detected, --> (wait) has ended..
if x_i == 0: # 1st phase detected and ended with "a zero"
if reset_count_0s == 1:
count0s = 0
reset_count_0s = 0
count0s += 1
if enterOnce_if == 1:
start2_getidxs=True # avoiding to delete first phase
enterOnce_0 = 0
if start2_getidxs==True: # avoiding to delete first phase
if x_i == 1 and count0s < nearby_thr:
print("this is NOT a new phase!")
el2del = [*el2del, i] # idxs to delete
reset_count_0s = 1 # reset counter
if x_i == 1 and count0s >= nearby_thr:
print("this is a new phase!") # nothing to delete
reset_count_0s = 1 # reset counter
return el2del
def convert_nearby_el_into_zeros(df,idx):
df0 = df + 0 # error original dataframe is modified!
if len(idx) > 0:
# df.drop(df.index[idx]) # to delete completely
df0.iloc[idx] = 0
else:
print("no elements nearby to delete!!")
return df0
######
print("")
x1_2del = detect_nearby_el_2(df=x1,nearby_thr=3)
x2_2del = detect_nearby_el_2(df=x2,nearby_thr=3)
## deleting nearby elements
x1_a = convert_nearby_el_into_zeros(df=x1,idx=x1_2del)
x2_a = convert_nearby_el_into_zeros(df=x2,idx=x2_2del)
## PLOTTING
# ut0.grayplt()
fig1 = plt.figure()
fig1.suptitle("x1",fontsize=20)
ax1 = fig1.add_subplot(1,2,1)
ax2 = fig1.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x1)
line2, = ax2.plot(x1_a)
fig2 = plt.figure()
fig2.suptitle("x2",fontsize=20)
ax1 = fig2.add_subplot(1,2,1)
ax2 = fig2.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x2)
line2, = ax2.plot(x2_a)
You can see that x1 has two "response/echoes" that I want to not take into account, while x2 has none, infact nothing changed in x2
My question is: How this can be accomplished in few lines using pandas?
Thank You
Interesting problem, and I'm sure there's a more elegant solution out there, but here is my attempt - it's at least fairly performant:
x1 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
def remove_echos(series, threshold):
starting_points = (series==1) & (series.shift()==0)
echo_starting_points = starting_points & series.shift(threshold)==1
echo_starting_points = series[echo_starting_points].index
change_points = series[starting_points].index.to_list() + [series.index[-1]]
for (start, end) in zip(change_points, change_points[1:]):
if start in echo_starting_points:
series.loc[start:end] = 0
return series
x1 = remove_echos(x1, 3)
x2 = remove_echos(x2, 3)
(I changed x1 and x2 to be Series instead of DataFrame, it's easy to adapt this code to work with a df if you need to.)
Explanation: we define the "starting point" of each section as a 1 preceded by a 0. Of those we define an "echo" starting point if the point threshold places before is a 1. (The assumption is that we don't have a phases which is shorter than threshold.) For each echo starting point, we zero from it to the next starting point or the end of the Series.
Let's say I wanted to call a function to do some calculation, but I also wanted to use that calculated value in a later function. When I return the value of the first function can I not just send it to my next function? Here is an example of what I am talking about:
def add(x,y):
addition = x + y
return addition
def square(a):
result = a * a
return result
sum = add(1,4)
product = square(addition)
If I call the add function, it'll return 5 as the addition result. But I want to use that number 5 in the next function, can I just send it to the next function as shown? In the main program I am working on it does not work like this.
Edit: This is a sample of the code I am actually working on which will give a better idea of what the problem is. The problem is when I send the mean to the calculateStdDev function.
#import libraries to be used
import time
import StatisticsCalculations
#global variables
mean = 0
stdDev = 0
#get file from user
fileChoice = input("Enter the .csv file name: ")
inputFile = open(fileChoice)
headers = inputFile.readline().strip('\n').split(',') #create headers for columns and strips unnecessary characters
#create a list with header-number of lists in it
dataColumns = []
for i in headers:
dataColumns.append([]) #fills inital list with as many empty lists as there are columns
#counts how many rows there are and adds a column of data into each empty list
rowCount = 0
for row in inputFile:
rowCount = rowCount + 1
comps = row.strip().split(',') #components of data
for j in range(len(comps)):
dataColumns[j].append(float(comps[j])) #appends the jth entry into the jth column, separating data into categories
k = 0
for entry in dataColumns:
print("{:>11}".format(headers[k]),"|", "{:>10.2f}".format(StatisticsCalculations.findMax(dataColumns[k])),"|",
"{:>10.2f}".format(StatisticsCalculations.findMin(dataColumns[k])),"|","{:>10.2f}".format(StatisticsCalculations.calculateMean(dataColumns[k], rowCount)),"|","{:>10.2f}".format()) #format each data entry to be right aligned and be correctly spaced in its column
#prining break line for each row
k = k + 1 #counting until dataColumns is exhausted
inputFile.close()
And the StatisticsCalculations module:
import math
def calculateMean(data, rowCount):
sumForMean = 0
for entry in data:
sumForMean = sumForMean + entry
mean = sumForMean/rowCount
return mean
def calculateStdDev(data, mean, rowCount, entry):
stdDevSum = 0
for x in data:
stdDevSum = float(stdDevSum) + ((float(entry[x]) - mean)** 2) #getting sum of squared difference to be used in std dev formula
stdDev = math.sqrt(stdDevSum / rowCount) #using the stdDevSum for the remaining parts of std dev formula
return stdDev
def findMin(data):
lowestNum = 1000
for component in data:
if component < lowestNum:
lowestNum = component
return lowestNum
def findMax(data):
highestNum = -1
for number in data:
if number > highestNum:
highestNum = number
return highestNum
First of all, sum is a reserved word, you shouldn't use it as a variable.
You can do it this way:
def add(x,y):
addition = x + y
return addition
def square(a):
result = a * a
return result
s = add(1, 4)
product = square(s)
Or directly:
product = square(add(1, 4))
Below, I'm trying to code a Crank-Nicholson numerical solution to the Navier-Stokes equation for momentum (simplified with placeholders for time being), but am having issues with solving for umat[timecount,:], and keep getting the error "ValueError: setting an array element with a sequence". I'm extremely new to Python, does anyone know what I could do differently to avoid this problem?
Thanks!!
def step(timesteps,dn,dt,Numvpts,Cd,g,alpha,Sl,gamma,theta_L,umat):
for timecount in range(0, timesteps+1):
if timecount == 0:
umat[timecount,:] = 0
else:
Km = 1 #placeholder for eddy viscosity
thetaM = 278.15 #placeholder for theta_m for time being
A = Km*dt/(2*(dn**2))
B = (-g*dt/theta_L)*thetaM*np.sin(alpha)
C = -dt*(1/(2*Sl) + Cd)
W_arr = np.zeros(Numvpts+1)
D = np.zeros(Numvpts+1)
for x in (0,Numvpts): #creating the vertical veocity term
if x==0:
W_arr[x] = 0
D[x] = 0
else:
W_arr[x] = W_arr[x-1] - (dn/Sl)*umat[timecount-1,x-1]
D = W_arr/(4*dn)
coef_mat_u = Neumann_mat(Numvpts,D-A,(1+2*A),-(A+D))
b_arr_u = np.zeros(Numvpts+1) #the array of known quantities
umat_forward = umat[timecount-1,2:Numvpts]
umat_center = umat[timecount-1,1:Numvpts-1]
umat_backward = umat[timecount-1,0:Numvpts-2]
b_arr_u = np.zeros(Numvpts+1)
for j in (0,Numvpts):
if j==0:
b_arr_u[j] = 0
elif j==Numvpts:
b_arr_u[j] = 0
else:
b_arr_u[j] = (A+D[j])*umat_backward[j]*(1-2*A)*umat_center[j] + (A-D[j])*umat_forward[j] - C*(umat_center[j]*umat_center[j]) - B
umat[timecount,:] = np.linalg.solve(coef_mat_u,b_arr_u)
return(umat)
Please note that,
for i in (0, 20):
print(i),
will give result 0 20 not 0 1 2 3 4 ... 20
So you have to use the range() function
for i in range(0, 20 + 1):
print(i),
to get 0 1 2 3 4 ... 20
I have not gone through your code rigorously, but I think the problem is in your two inner for loops:
for x in (0,Numvpts): #creating the vertical veocity term
which is setting values only at zero th and (Numvpts-1) th index. I think you must use
for x in range(0,Numvpts):
Similar is the case in (range() must be used):
for j in (0,Numvpts):
Also, here j never becomes == Numvpts, but you are checking the condition? I guess it must be == Numvpts-1
And also the else condition is called for every index other than 0? So in your code the right hand side vector has same numbers from index 1 onwards!
I think the fundamental problem is that you are not using range(). Also it is a good idea to solve the NS eqns for a small grid and manually check the A and b matrix to see whether they are being set correctly.
this probably leads to scipy/numpy, but right now I'm happy with any functionality as I couldn't find anything in those packages. I have a matrix that contains data for a multi-variate distribution (let's say, 2, for the fun of it). Is there any function to compute (higher) moments of that? All I could find was numpy.mean() and numpy.cov() :o
Thanks :)
/edit:
So some more detail: I have multivariate data, that is, a matrix where rows display variables and columns observations. Now I would like to have a simple way of computing the joint moments of that data, as defined in http://en.wikipedia.org/wiki/Central_moment#Multivariate_moments .
I'm pretty new to python/scipy so I'm not sure I'd be the best person to code this one up, especially for the n-variables case (note that the wikipedia definition is for n=2), and I kind of expected there to be some out-of-the-box thing to use as I thought this would be a standard problem.
/edit2:
Just for the future, in case someone wants to do something similar, the following code (which is still under review) should give the sample equivalent of the raw moments E(X^2), E(Y^2), etc. It only works for two variables right now, but it should be extendable if one feels the need. If you see some mistakes or unclean/unpython-nish code, feel free to comment.
from numpy import *
# this function should return something as
# moments[0] = 1
# moments[1] = mean(X), mean(Y)
# moments[2] = 1/n*X'X, 1/n*X'Y, 1/n*Y'Y
# moments[3] = mean(X'X'X), mean(X'X'Y), mean(X'Y'Y),
# mean(Y'Y'Y)
# etc
def getRawMoments(data, moment, axis=0):
a = moment
if (axis==0):
n = float(data.shape[1])
X = matrix(data[0,:]).reshape((n,1))
Y = matrix(data[1,:]).reshape((n,1))
else:
n = float(data.shape[0])
X = matrix(data[:,0]).reshape((n,1))
Y = matrix(data[:,1]).reshape((n,11))
result = 1
Z = hstack((X,Y))
iota = ones((1,n))
moments = {}
moments[0] = 1
#first, generate huge-ass matrix containing all x-y combinations
# for every power-combination k,l such that k+l = i
# for all 0 <= i <= a
for i in arange(1,a):
if i==2:
moments[i] = moments[i-1]*Z
# if even, postmultiply with X.
elif i%2 == 1:
moments[i] = kron(moments[i-1], Z.T)
# Else, postmultiply with X.T
elif i%2==0:
temp = moments[i-1]
temp2 = temp[:,0:n]*Z
temp3 = temp[:,n:2*n]*Z
moments[i] = hstack((temp2, temp3))
# since now we have many multiple moments
# such as x**2*y and x*y*x, filter non-distinct elements
momentsDistinct = {}
momentsDistinct[0] = 1
for i in arange(1,a):
if i%2 == 0:
data = 1/n*moments[i]
elif i == 1:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
data = 1/n*hstack((temp2))
else:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
temp3 = temp[:,n:2*n]*iota.T
data = 1/n*hstack((temp2, temp3))
momentsDistinct[i] = unique(data.flat)
return momentsDistinct(result, axis=1)
In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary
I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?
indices | real data
|
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...
See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:
from __future__ import division
import numpy as np
# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)
items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)
dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
And now the stratified sampling:
# For randomness, shuffle the entire array
np.random.shuffle(dataset)
blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))
# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)
x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B
a_idx = threshold > np.random.rand(len(dataset))
A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
After running it, the split is roughly 80/20, and all blocks are represented in both arrays:
>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True
Here's an alternative solution. I'm open for a code review if it is possible to implement this in a more numpy way (without for loops). #Jamie 's answer is really good, it's just that sometimes it produces skewed ratios within blocks of data.
ratio = 0.8
IDX1 = 0
IDX2 = 1
idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
valid = None
train = None
for i1 in idx1s:
for i2 in idx2:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
curr_data = data[mask,:]
np.random.shuffle(curr_data)
start = np.min(mask)
end = np.max(mask)
thres = start + np.around((end - start) * ratio).astype(np.int)
selected = mask < thres
train_idx = mask[0][selected[0]]
valid_idx = mask[0][~selected[0]]
if train != None:
train = np.vstack((train,data[train_idx]))
valid = np.vstack((valid,data[valid_idx]))
else:
train = data[train_idx]
valid = data[valid_idx]
I'm assuming that each block has at least two entries and also that if it has more than two you want them assigned as closely as possible to 80/20. The easiest way to do this seems to be to assign a random number to all rows, and then choose based on percentiles within each stratified sample. Say this is the data in file strat_sample.csv:
Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291
Then this code (using Pandas data structures) works as desired
import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow
def TreatmentOneCount(n , *args):
#assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1.
OptimalRatio = args[0]
if n < 2:
print("N too small, assignment not defined.")
a = NaN
elif n == 2:
a = 1
else:
"""
There are one of two numbers that are close to the target ratio, one above, the other below
If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
"""
targetassigment = OptimalRatio * n
if targetassigment - floor(targetassigment) > 0.5:
a = min(ceil(targetassigment),n-1)
else:
a = max(floor(targetassigment),1)
return a
df = pd.read_csv('strat_sample.csv', sep=',' , header=0)
#assign a random number to each entry
df['RandScore'] = np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)
#Within each block assign a rank based on random number.
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()
#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)
#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)
#Add the block counts to the data
df = df.merge(dftest, how='left', left_on = 'MasterIdx', right_index= True)
#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <= df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)
from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)