I am writing a program to discretize a set of attributes via entropy discretization. The goal is to parse the dataset
A,Class
5,1
12.5,1
11.5,2
8.6,2
7,1
6,1
5.9,2
1.5,2
9,2
7.8,1
2.1,1
13.5,2
12.45,2
Into
A,Class
1,1
3,1
3,2
2,2
2,1
2,1
1,2
1,2
3,2
2,1
1,1
3,2
3,2
The specific problem that I am facing with my program is determining the frequencies of 1 and 2 in the class column.
df = s['Class']
df['freq'] = df.groupby('Class')['Class'].transform('count')
print("*****************")
print(df['freq'])
I would like to use a pandas method to return the frequency of 1 and 2 so that I can calculate probabilities p1 and p2.
import pandas as pd
import numpy as np
import entropy_based_binning as ebb
from math import log2
def main():
df = pd.read_csv('S1.csv')
s = df
s = entropy_discretization(s)
# This method discretizes s A1
# If the information gain is 0, i.e the number of
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
informationGain = {}
# while(uniqueValue(s)):
# Step 1: pick a threshold
threshold = 6
# Step 2: Partititon the data set into two parttitions
s1 = s[s['A'] < threshold]
print("s1 after spitting")
print(s1)
print("******************")
s2 = s[s['A'] >= threshold]
print("s2 after spitting")
print(s2)
print("******************")
# Step 3: calculate the information gain.
informationGain = information_gain(s1,s2,s)
print(informationGain)
# # Step 5: calculate the max information gain
# minInformationGain = min(informationGain)
# # Step 6: keep the partitions of S based on the value of threshold_i
# s = bestPartition(minInformationGain, s)
def uniqueValue(s):
# are records in s the same? return true
if s.nunique()['A'] == 1:
return False
# otherwise false
else:
return True
def bestPartition(maxInformationGain):
# determine be threshold_i
threshold_i = 6
return
def information_gain(s1, s2, s):
# calculate cardinality for s1
cardinalityS1 = len(pd.Index(s1['A']).value_counts())
print(f'The Cardinality of s1 is: {cardinalityS1}')
# calculate cardinality for s2
cardinalityS2 = len(pd.Index(s2['A']).value_counts())
print(f'The Cardinality of s2 is: {cardinalityS2}')
# calculate cardinality of s
cardinalityS = len(pd.Index(s['A']).value_counts())
print(f'The Cardinality of s is: {cardinalityS}')
# calculate informationGain
informationGain = (cardinalityS1/cardinalityS) * entropy(s1) + (cardinalityS2/cardinalityS) * entropy(s2)
print(f'The total informationGain is: {informationGain}')
return informationGain
def entropy(s):
# calculate the number of classes in s
numberOfClasses = s['Class'].nunique()
print(f'Number of classes: {numberOfClasses}')
# TODO calculate pi for each class.
# calculate the frequency of class_i in S1
value_counts = s['Class'].value_counts()
print(f'value_counts : {value_counts}')
df = s['Class']
df['freq'] = df.groupby('Class')['Class'].transform('count')
print("*****************")
print(df['freq'])
# p1 = s.groupby('Class').count()
# p2 = s.groupby('Class').count()
# print(f'p1: {p1}')
# print(f'p2: {p2}')
p1 = 2/4
p2 = 3/4
ent = -(p1*log2(p2)) - (p2*log2(p2))
return ent
Ideally, I'd like to print Number of classes: 2. This way I can loop over the classes and calculate the frequencies for the attribute Class from the dataset. I've reviewed the pandas documentation, but I got stuck trying to count the frequency of 1 and 2 from the class section.
Use value_counts:
>>> df.value_counts('Class')
Class
2 7
1 6
dtype: int64
Update:
How do get the individual frequencies returned from the value_counts method?
counts = df.value_counts('Class')
print(counts[1]) # Freq of 1
6
print(counts[2]) # Freq of 2
7
Related
I have a pandas dataframe similar to the one below:
Output var1 var2 var3
1 0.487981 0.297929 0.214090
1 0.945660 0.031666 0.022674
2 0.119845 0.828661 0.051495
2 0.095186 0.852232 0.052582
3 0.059520 0.053307 0.887173
3 0.091049 0.342226 0.566725
3 0.119295 0.414376 0.466329
... ... ... ... ...
Basically, I have 3 columns (propensity score values) and one output (treatment). I want to calculate the within-trio distance to find trios of outputs with the smallest within-trio distance.
The experiment is taken from the paper: "Matching by Propensity Score in Cohort Studies with Three Treatment Groups", Rassen et al. Looking at their explanation is like calculating the perimeter of a triangle, but I am not sure.
I think that at this GitHub link: https://github.com/bwh-dope/pharmacoepi_toolbox/blob/master/src/org/drugepi/match/MatchDistanceCalculator.java there is Java code doing this stuff more or less, but I am not sure on how to use it. I use Python, so I have two options: try to adapt this previous code or write something else.
My idea is that var1, var2 and var3 can be considered like spatial x,y,z coordinates, and the output is like a point in the space.
I found a function that calculates the distance between 2 points:
#found here https://stackoverflow.com/questions/68938033/min-distance-between-point-cloud-xyz-points-in-python
import numpy as np
distance = lambda p1, p2: np.sqrt(np.sum((p1 - p2) ** 2, axis=0))
import itertools
def min_distance(cloud):
pairs = itertools.combinations(cloud, 2)
return np.min(map(lambda pair: distance(*pair), pairs))
def get_points(filename):
with open(filename, 'r') as file:
rows = np.genfromtxt(file, delimiter=',', skip_header=True)
return rows
filename = 'cloud.csv'
cloud = get_points(filename)
min_dist = min_distance(cloud)
However, I want to calculate the distance between 3 points, so I think that I need to iterate all the possible combinations of 3 points like XY, XZ and YZ, but I am not sure of this procedure.
Finally, I tried with my own solution, that I think it is correct, but maybe too much computationally expensive.
I created my 3 dataset, according to the Output value: dataset1 = dataset[dataset["Output"]==1] and the same for Output=2 and Output=3.
This is my distance function:
def Euclidean_Dist(df1, df2):
return np.linalg.norm(df1 - df2)
My variables:
tripletta_for = []
tripletta_tot_wr = []
p_inf = float('inf')
counter = 1
These are the steps used to computed the within-trio distance. Hope they are correct.
'''
i[0] = index
i[1] = treatment prop1
i[1][0] = treatment
i[1][1] = prop
'''
#io voglio calcolare la distanza tra i[1][1], j[1][1] e k[1][1]
for i in dataset1.iterrows():
minimum_distance = p_inf
print(counter)
counter = counter + 1
for j in dataset2.iterrows():
dist12 = Euclidean_Dist(i[1][1], j[1][1])
for k in dataset3.iterrows():
dist13 = Euclidean_Dist(i[1][1], k[1][1])
dist23 = Euclidean_Dist(j[1][1], k[1][1])
somma = dist12 + dist13 + dist23
if somma < minimum_distance:
minimum_distance = somma
tripletta_for = i[0], j[0], k[0]
#print(tripletta_for)
dataset2.drop(index=tripletta_for[1], inplace=True)
dataset3.drop(tripletta_for[2], inplace=True)
#print(len(dataset3))
tripletta_tot_wr.append(tripletta_for)
#print(tripletta_tot_wr)
I have a huge number of points stored with x and y coordinates and an additional value ('value_P') in a pandas.dataframe so the dataframe looks like:
x-coordinate
y-coordinate
value_P
0
0
3
1
1
40
58
1
2
5
4
2
3
76
98
2
4
15
35
3
5
5
4
3
but with around 250000 entries, so i look for a efficient solution. I am trying to add a column that holds the row index of the closest other point. But only the distance between points with value_P!=1 to points with value_P==1 should be considered. Also i am only interested in the index for points where value_P!=1. Its difficult to explain but the desired output should be:
x-coordinate
y-coordinate
value_P
index
0
0
3
1
NaN
1
40
58
1
NaN
2
5
4
2
0
3
76
98
2
1
4
15
35
3
1
5
5
4
3
0
For row 1 the index is NaN because i am not interested in it, since value_P==1. For row 2 its 0, because the point from row 0 is the closest point with a value_P of 1.
I hope its understandable.
I found a solution that involves 2 DataFrame.apply(lambda x:...) functions but it takes a long time. Even if you dont have a concrete solution but an idea how to improve the performance it would be highly appreciated.
My current code is: (P_sort is the data and 'zuord' is the added column)
def index2(x_1,y_1,x_2,y_2,last_1):
h = math.sqrt((x_1 - x_2) ** 2 + (y_1 - y_2) ** 2)
return h
def index(x_1,y_1,x_v,y_v,last_1):
df2 = pnd.DataFrame()
df3 = pnd.DataFrame()
df2['x-coordinate'] = x_v
df2['y-coordinate'] = y_v
df3['distances'] = df2.apply(
lambda x: index2(x['x-coordinate'], x['y-coordinate'], x_1, y_1, last_1), axis=1)
k=df3.idxmin()
print(k)
return k
last_1 = np.count_nonzero(P_sort[:, 2] == 1) - 1
df = pnd.DataFrame(P_sort,
columns=['x-coordinate', 'y-coordinate', 'value_P'])
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['zuord'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
I am new to programming so the code is kind of ugly
I benchmarked four solutions, and the fastest approach is a KD Tree.
Test Dataset
I randomly generated dataframes of various sizes to test the performance of each method.
def generate_spots(n, p=0.005):
x_pos = np.random.uniform(0, 100, n)
y_pos = np.random.uniform(0, 100, n)
value_P = np.random.binomial(size=n, n=1, p=(1 - p)) + 1
df = pd.DataFrame({
'x-coordinate': x_pos,
'y-coordinate': y_pos,
'value_P': value_P
})
df = df.sort_values('value_P').reset_index(drop=True)
return df
This generates a dataframe with n rows, with a probability p that each row is class 1. I also sorted it, because the original method seems to assume that the dataframe is sorted by P.
Method 1: Original
I made some small changes to your code to get it to work for me:
def method1(df):
df = df.copy()
last_1 = np.count_nonzero(df.loc[:, 'value_P'] == 1)
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['index'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
df.loc[0:last_1 - 1, 'index'] = -1
return df
index() and index2() are defined the same way as your question. I also use -1 as a placeholder instead of NaN. No deep reason for this, just personal preference.
Method 2: cdist
Scipy has a function called cdist() which takes the distance between each point among two arrays of points.
import scipy.spatial.distance
def method2(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
nearest_point = scipy.spatial.distance.cdist(source_df, target_df).argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
The cdist function is pretty much the same as what you're doing - it's just implemented in C rather than Python.
Method 3: KD Tree
A KD Tree is a data structure designed to efficiently search for nearby points. You can use SciKit Learn to implement this.
import sklearn.neighbors
def method3(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
tree = sklearn.neighbors.KDTree(target_df)
nearest_point = tree.query(source_df, k=1, return_distance=False)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point.flatten()
return df
Method 4: fastdist
The Python package fastdist bills itself as a faster alternative to scipy's distance calculation methods. Ironically, I found this solution to be slower than cdist at all problem sizes.
from fastdist import fastdist
def method4(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
target_array = target_df.to_numpy()
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
source_array = source_df.to_numpy()
nearest_point = fastdist.matrix_to_matrix_distance(source_array, target_array, fastdist.euclidean, "euclidean").argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
Benchmarks
Each method was run ten times, with various sizes of dataframe, in random order. Here are the results of the benchmark. Note that both the X and Y axes are log-scale.
I didn't benchmark fastdist or the original method for more than 30,000 points, because it took too long.
The fastest methods, in this benchmark, are the cdist method, for fewer than 1000 points, and KD Tree method, for more than 1000 points. At 250K points, the fastest solution is the KD Tree, taking only 0.2 seconds.
I am writing a program to discretize a set of attributes via entropy discretization. The goal is to parse the dataset
A,Class
5,1
12.5,1
11.5,2
8.6,2
7,1
6,1
5.9,2
1.5,2
9,2
7.8,1
2.1,1
13.5,2
12.45,2
Into
A,Class
1,1
3,1
3,2
2,2
2,1
2,1
1,2
1,2
3,2
2,1
1,1
3,2
3,2
The specific problem that I am facing with my program is determining the number of classes in my dataset. This takes place at numberOfClasses = s['Class'].value_counts(). I would like to use a pandas method to return the number of distinct classes. In this example there are only two. However I get back
Number of classes: 2 5
1 4
From the print statement.
import pandas as pd
import numpy as np
import entropy_based_binning as ebb
from math import log2
def main():
df = pd.read_csv('S1.csv')
s = df
s = entropy_discretization(s)
# This method discretizes s A1
# If the information gain is 0, i.e the number of
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
informationGain = {}
# while(uniqueValue(s)):
# Step 1: pick a threshold
threshold = 6
# Step 2: Partititon the data set into two parttitions
s1 = s[s['A'] < threshold]
print("s1 after spitting")
print(s1)
print("******************")
s2 = s[s['A'] >= threshold]
print("s2 after spitting")
print(s2)
print("******************")
# Step 3: calculate the information gain.
informationGain = information_gain(s1,s2,s)
print(informationGain)
# # Step 5: calculate the max information gain
# minInformationGain = min(informationGain)
# # Step 6: keep the partitions of S based on the value of threshold_i
# s = bestPartition(minInformationGain, s)
def uniqueValue(s):
# are records in s the same? return true
if s.nunique()['A'] == 1:
return False
# otherwise false
else:
return True
def bestPartition(maxInformationGain):
# determine be threshold_i
threshold_i = 6
return
def information_gain(s1, s2, s):
# calculate cardinality for s1
cardinalityS1 = len(pd.Index(s1['A']).value_counts())
print(f'The Cardinality of s1 is: {cardinalityS1}')
# calculate cardinality for s2
cardinalityS2 = len(pd.Index(s2['A']).value_counts())
print(f'The Cardinality of s2 is: {cardinalityS2}')
# calculate cardinality of s
cardinalityS = len(pd.Index(s['A']).value_counts())
print(f'The Cardinality of s is: {cardinalityS}')
# calculate informationGain
informationGain = (cardinalityS1/cardinalityS) * entropy(s1) + (cardinalityS2/cardinalityS) * entropy(s2)
print(f'The total informationGain is: {informationGain}')
return informationGain
def entropy(s):
# calculate the number of classes in s
numberOfClasses = s['Class'].value_counts()
print(f'Number of classes: {numberOfClasses}')
# TODO calculate pi for each class.
# calculate the frequency of class_i in S1
p1 = 2/4
p2 = 3/4
ent = -(p1*log2(p2)) - (p2*log2(p2))
return ent
main()
Ideally, I'd like to print Number of classes: 2. This way I can loop over the classes and calculate the frequencies for the attribute A from the dataset. I've reviewed the pandas documentation, but I got stuck at value_counts(). Any help would be greatly appreciated.
Maybe try:
number_of_classes = len(s['Class'].unique())
which will return the number of unique classes in the column Class.
Or even shorter:
s['Class'].nunique()
I have a pandas df containing weights. Rows contain dates and columns contain asset names. Every row sum to 1.
I want to run
df_with_stocks_weight.apply(rescale_w, weight_min=0.01, weight_max=0.30)
in order to change so that weights still sum to 1 but have min value 1% and max value 30%. I tried using the function below, but I get problems with the index: The calculated values are correct but the output refers to the wrong asset!
def rescale_w(row_input, weight_min, weight_max):
'''
:param row_input: a row from a pandas df
:param weight_min: the floor. type float.
:param weight_max: the cap. type float.
:return: a pandas row where weights are adjusted to specify min max.
step 1:
while any asset has weight above weight_max,
set that asset's weight to == weight_max
and distribute the leftovers to all other assets (whose weight are >0)
in accordance with their weight.
step 2:
if there is a positive weight below min_weight,
force it to == min_weight
by stealing from every other asset
(except those whose weight == max_weight).
note that the function produce strange output with few assets.
for example with 3 assets and max 30% the sum is 0.90
and if A=50% B=20% and one other asset is 1% then
these are not practical problems as we will analyze on data with many assets.
'''
# rename
w1 = row_input
# na
# script returned many errors regarding na
# so i a fillna(0) here.
# if that will be the final solution, some cleaning up can be done
# eg remove _null objects and remove some assertions.
w1 = w1.fillna(0)
# remove zeroes to get a faster script
w1nz = w1[w1 > 0]
w1z = w1[w1 == 0]
assert len(w1) == len(w1nz) + len(w1z)
assert set(w1nz.index).intersection(set(w1z.index)) == set()
# input must sum to 1
assert abs(w1nz.sum()-1) < 0.001
# only execute if there is at least one notnull value
# below will work with nz
if len(w1nz) > 0:
# step 1: make sure upper threshold is satisfied
while max(w1nz) > weight_max:
# clip at 30%
w2 = w1nz.clip(upper=weight_max)
# calc leftovers from this upper clip
leftover_upper = 1 - w2.sum()
# add leftovers to the untouched, in accordance with weight
w2_touched = w2[w2 == weight_max]
w2_unt = w2[(weight_max > w2) & (w2 > 0)]
w2_unt_added = w2_unt + leftover_upper * w2_unt / w2_unt.sum()
# concat all back
w3 = pd.concat([w2_touched, w2_unt_added], axis=0)
# same index for output and input
#w3 = w3.reindex(w1nz.index) # todo prövar nu att ta bort .reindex överallt. ser om pd löser det själv automatiskt
# rename w3 so that it works in a while loop
w1nz = w3
usestep2 = False
if usestep2:
# step 2: make sure lower threshold is satisfied
if min(w1nz) < weight_min:
# three parts: lower, middle, upper.
# those in "lower" will recieve from those in "middle"
upper = w1nz[w1nz >= weight_max]
middle = w1nz[(w1nz > weight_min) & (w1nz < weight_max)]
lower = w1nz[w1nz <= weight_min]
# assert len
assert (len(upper) + len(middle) + len(lower) == len(w1nz))
# change lower to == weight_min
lower_modified = lower.clip(lower=weight_min)
# the weights given to "lower" is stolen from "middle"
stolen_weigths = lower_modified.sum() - lower.sum()
middle_modified = middle - stolen_weigths * middle / middle.sum()
# concat
w4 = pd.concat([lower_modified,
middle_modified,
upper], axis=0)
# reindex
#w4 = w4.reindex(w1nz.index)
# rename
w1nz = w4
# lastly, concat adjusted nonzero with zero.
w1adj = pd.concat([w1nz, w1z], axis=0)
w1adj = w1adj.reindex(w1.index) # works?
assert (w1adj.index == w1.index).all()
assert abs(w1adj.sum() - 1 < 0.001)
return (w1adj)
In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary
I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?
indices | real data
|
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...
See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:
from __future__ import division
import numpy as np
# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)
items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)
dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
And now the stratified sampling:
# For randomness, shuffle the entire array
np.random.shuffle(dataset)
blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))
# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)
x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B
a_idx = threshold > np.random.rand(len(dataset))
A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
After running it, the split is roughly 80/20, and all blocks are represented in both arrays:
>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True
Here's an alternative solution. I'm open for a code review if it is possible to implement this in a more numpy way (without for loops). #Jamie 's answer is really good, it's just that sometimes it produces skewed ratios within blocks of data.
ratio = 0.8
IDX1 = 0
IDX2 = 1
idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
valid = None
train = None
for i1 in idx1s:
for i2 in idx2:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
curr_data = data[mask,:]
np.random.shuffle(curr_data)
start = np.min(mask)
end = np.max(mask)
thres = start + np.around((end - start) * ratio).astype(np.int)
selected = mask < thres
train_idx = mask[0][selected[0]]
valid_idx = mask[0][~selected[0]]
if train != None:
train = np.vstack((train,data[train_idx]))
valid = np.vstack((valid,data[valid_idx]))
else:
train = data[train_idx]
valid = data[valid_idx]
I'm assuming that each block has at least two entries and also that if it has more than two you want them assigned as closely as possible to 80/20. The easiest way to do this seems to be to assign a random number to all rows, and then choose based on percentiles within each stratified sample. Say this is the data in file strat_sample.csv:
Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291
Then this code (using Pandas data structures) works as desired
import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow
def TreatmentOneCount(n , *args):
#assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1.
OptimalRatio = args[0]
if n < 2:
print("N too small, assignment not defined.")
a = NaN
elif n == 2:
a = 1
else:
"""
There are one of two numbers that are close to the target ratio, one above, the other below
If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
"""
targetassigment = OptimalRatio * n
if targetassigment - floor(targetassigment) > 0.5:
a = min(ceil(targetassigment),n-1)
else:
a = max(floor(targetassigment),1)
return a
df = pd.read_csv('strat_sample.csv', sep=',' , header=0)
#assign a random number to each entry
df['RandScore'] = np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)
#Within each block assign a rank based on random number.
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()
#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)
#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)
#Add the block counts to the data
df = df.merge(dftest, how='left', left_on = 'MasterIdx', right_index= True)
#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <= df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)
from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)