I am writing a program to discretize a set of attributes via entropy discretization. The goal is to parse the dataset
A,Class
5,1
12.5,1
11.5,2
8.6,2
7,1
6,1
5.9,2
1.5,2
9,2
7.8,1
2.1,1
13.5,2
12.45,2
Into
A,Class
1,1
3,1
3,2
2,2
2,1
2,1
1,2
1,2
3,2
2,1
1,1
3,2
3,2
The specific problem that I am facing with my program is determining the number of classes in my dataset. This takes place at numberOfClasses = s['Class'].value_counts(). I would like to use a pandas method to return the number of distinct classes. In this example there are only two. However I get back
Number of classes: 2 5
1 4
From the print statement.
import pandas as pd
import numpy as np
import entropy_based_binning as ebb
from math import log2
def main():
df = pd.read_csv('S1.csv')
s = df
s = entropy_discretization(s)
# This method discretizes s A1
# If the information gain is 0, i.e the number of
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
informationGain = {}
# while(uniqueValue(s)):
# Step 1: pick a threshold
threshold = 6
# Step 2: Partititon the data set into two parttitions
s1 = s[s['A'] < threshold]
print("s1 after spitting")
print(s1)
print("******************")
s2 = s[s['A'] >= threshold]
print("s2 after spitting")
print(s2)
print("******************")
# Step 3: calculate the information gain.
informationGain = information_gain(s1,s2,s)
print(informationGain)
# # Step 5: calculate the max information gain
# minInformationGain = min(informationGain)
# # Step 6: keep the partitions of S based on the value of threshold_i
# s = bestPartition(minInformationGain, s)
def uniqueValue(s):
# are records in s the same? return true
if s.nunique()['A'] == 1:
return False
# otherwise false
else:
return True
def bestPartition(maxInformationGain):
# determine be threshold_i
threshold_i = 6
return
def information_gain(s1, s2, s):
# calculate cardinality for s1
cardinalityS1 = len(pd.Index(s1['A']).value_counts())
print(f'The Cardinality of s1 is: {cardinalityS1}')
# calculate cardinality for s2
cardinalityS2 = len(pd.Index(s2['A']).value_counts())
print(f'The Cardinality of s2 is: {cardinalityS2}')
# calculate cardinality of s
cardinalityS = len(pd.Index(s['A']).value_counts())
print(f'The Cardinality of s is: {cardinalityS}')
# calculate informationGain
informationGain = (cardinalityS1/cardinalityS) * entropy(s1) + (cardinalityS2/cardinalityS) * entropy(s2)
print(f'The total informationGain is: {informationGain}')
return informationGain
def entropy(s):
# calculate the number of classes in s
numberOfClasses = s['Class'].value_counts()
print(f'Number of classes: {numberOfClasses}')
# TODO calculate pi for each class.
# calculate the frequency of class_i in S1
p1 = 2/4
p2 = 3/4
ent = -(p1*log2(p2)) - (p2*log2(p2))
return ent
main()
Ideally, I'd like to print Number of classes: 2. This way I can loop over the classes and calculate the frequencies for the attribute A from the dataset. I've reviewed the pandas documentation, but I got stuck at value_counts(). Any help would be greatly appreciated.
Maybe try:
number_of_classes = len(s['Class'].unique())
which will return the number of unique classes in the column Class.
Or even shorter:
s['Class'].nunique()
Related
I have a pandas dataframe similar to the one below:
Output var1 var2 var3
1 0.487981 0.297929 0.214090
1 0.945660 0.031666 0.022674
2 0.119845 0.828661 0.051495
2 0.095186 0.852232 0.052582
3 0.059520 0.053307 0.887173
3 0.091049 0.342226 0.566725
3 0.119295 0.414376 0.466329
... ... ... ... ...
Basically, I have 3 columns (propensity score values) and one output (treatment). I want to calculate the within-trio distance to find trios of outputs with the smallest within-trio distance.
The experiment is taken from the paper: "Matching by Propensity Score in Cohort Studies with Three Treatment Groups", Rassen et al. Looking at their explanation is like calculating the perimeter of a triangle, but I am not sure.
I think that at this GitHub link: https://github.com/bwh-dope/pharmacoepi_toolbox/blob/master/src/org/drugepi/match/MatchDistanceCalculator.java there is Java code doing this stuff more or less, but I am not sure on how to use it. I use Python, so I have two options: try to adapt this previous code or write something else.
My idea is that var1, var2 and var3 can be considered like spatial x,y,z coordinates, and the output is like a point in the space.
I found a function that calculates the distance between 2 points:
#found here https://stackoverflow.com/questions/68938033/min-distance-between-point-cloud-xyz-points-in-python
import numpy as np
distance = lambda p1, p2: np.sqrt(np.sum((p1 - p2) ** 2, axis=0))
import itertools
def min_distance(cloud):
pairs = itertools.combinations(cloud, 2)
return np.min(map(lambda pair: distance(*pair), pairs))
def get_points(filename):
with open(filename, 'r') as file:
rows = np.genfromtxt(file, delimiter=',', skip_header=True)
return rows
filename = 'cloud.csv'
cloud = get_points(filename)
min_dist = min_distance(cloud)
However, I want to calculate the distance between 3 points, so I think that I need to iterate all the possible combinations of 3 points like XY, XZ and YZ, but I am not sure of this procedure.
Finally, I tried with my own solution, that I think it is correct, but maybe too much computationally expensive.
I created my 3 dataset, according to the Output value: dataset1 = dataset[dataset["Output"]==1] and the same for Output=2 and Output=3.
This is my distance function:
def Euclidean_Dist(df1, df2):
return np.linalg.norm(df1 - df2)
My variables:
tripletta_for = []
tripletta_tot_wr = []
p_inf = float('inf')
counter = 1
These are the steps used to computed the within-trio distance. Hope they are correct.
'''
i[0] = index
i[1] = treatment prop1
i[1][0] = treatment
i[1][1] = prop
'''
#io voglio calcolare la distanza tra i[1][1], j[1][1] e k[1][1]
for i in dataset1.iterrows():
minimum_distance = p_inf
print(counter)
counter = counter + 1
for j in dataset2.iterrows():
dist12 = Euclidean_Dist(i[1][1], j[1][1])
for k in dataset3.iterrows():
dist13 = Euclidean_Dist(i[1][1], k[1][1])
dist23 = Euclidean_Dist(j[1][1], k[1][1])
somma = dist12 + dist13 + dist23
if somma < minimum_distance:
minimum_distance = somma
tripletta_for = i[0], j[0], k[0]
#print(tripletta_for)
dataset2.drop(index=tripletta_for[1], inplace=True)
dataset3.drop(tripletta_for[2], inplace=True)
#print(len(dataset3))
tripletta_tot_wr.append(tripletta_for)
#print(tripletta_tot_wr)
I have objects that store values are dataframes. I have been able to compare if values from two dataframes are within 10% of each other. However, I am having difficulty extending this to multiple dataframes. Moreover, I am wondering how I should apporach this problem if dataframes are not the same size?
def add_well_peak(self, *other):
if len(self.Bell) == len(other.Bell): #if dataframes ARE the same size
for k in range(len(self.Bell)):
for j in range(len(other.Bell)):
if int(self.Size[k]) - int(self.Size[k])*(1/10) <= int(other.Size[j]) <= int(self.Size[k]) + int(self.Size[k])*(1/10):
#average all
For example, in the image below, there are objects that contain dataframes (i.e., self, other1, other2). The colors represent matches (i.e, values that are within 10% of each other). If a match exist, then average the values. If a match does not exist still include the unmatch number. I want to be able to generalize this for any number of objects greater or equal than 2 (other 1, other 2, other 3, other ....). Any help would be appreciated. Please let me know if anything is unclear. This is my first time posting. Thanks again.
matching data
Results:
Using my solution on the dataframes of your image, I get the following:
Threshold outlier = 0.2:
0
0 1.000000
1 1493.500000
2 5191.333333
3 35785.333333
4 43586.500000
5 78486.000000
6 100000.000000
Threshold outlier = 0.5:
0 1
0 1.000000 NaN
1 1493.500000 NaN
2 5191.333333 NaN
3 43586.500000 35785.333333
4 78486.000000 100000.000000
Explanations:
The lines are averaged peaks, the columns representing the different values obtained for these peaks. I assumed the average emanating from the biggest number of elements was the legitimate one, and the rest within the THRESHOLD_OUTLIER were the outliers (should be sorted, the more probable you are as a legitimate peak, the more you are on the left (the 0th column is the most probable)). For instance, on line 3 of the 0.5 outlier threshold results, 43586.500000 is an average coming from 3 dataframes, while 35785.333333 comes from only 2, thus the most probable is the first one.
Issues:
The solution is quite complicated. I assume a big part of it could be removed, but I can't see how for the moment, and as it works, I'll certainly leave the optimization to you.
Still, I tried commenting my best, and if you have any question, do not hesitate!
Files:
CombinationLib.py
from __future__ import annotations
from typing import Dict, List
from Errors import *
class Combination():
"""
Support class, to make things easier.
Contains a string `self.combination` which is a binary number stored as a string.
This allows to test every combination of value (i.e. "101" on the list `[1, 2, 3]`
would signify grouping `1` and `3` together).
There are some methods:
- `__add__` overrides the `+` operator
- `compute_degree` gives how many `1`s are in the combination
- `overlaps` allows to verify if combination overlaps (use the same value twice)
(i.e. `100` and `011` don't overlap, while `101` and `001` do)
"""
def __init__(self, combination:str) -> Combination:
self.combination:str = combination
self.degree:int = self.compute_degree()
def __add__(self, other: Combination) -> Combination:
if self.combination == None:
return other.copy()
if other.combination == None:
return self.copy()
if self.overlaps(other):
raise CombinationsOverlapError()
result = ""
for c1, c2 in zip(self.combination, other.combination):
result += "1" if (c1 == "1" or c2 == "1") else "0"
return Combination(result)
def __str__(self) -> str:
return self.combination
def compute_degree(self) -> int:
if self.combination == None:
return 0
degree = 0
for bit in self.combination:
if bit == "1":
degree += 1
return degree
def copy(self) -> Combination:
return Combination(self.combination)
def overlaps(self, other:Combination) -> bool:
for c1, c2 in zip(self.combination, other.combination):
if c1 == "1" and c1 == c2:
return True
return False
class CombinationNode():
"""
The main class.
The main idea was to build a tree of possible "combinations of combinations":
100-011 => 111
|---010-001 => 111
|---001-010 => 111
At each node, the combination applied to the current list of values was to be acceptable
(all within THREASHOLD_AVERAGING).
Also, the shorter a path, the better the solution as it means it found a way to average
a lot of the values, with the minimum amount of outliers possible, maybe by grouping
the outliers together in a way that makes sense, ...
- `populate` fills the tree automatically, with every solution possible
- `path` is used mainly on leaves, to obtain the path taken to arrive there.
"""
def __init__(self, combination:Combination) -> CombinationNode:
self.combination:Combination = combination
self.children:List[CombinationNode] = []
self.parent:CombinationNode = None
self.total_combination:Combination = combination
def __str__(self) -> str:
list_paths = self.recur_paths()
list_paths = [",".join([combi.combination.combination for combi in path]) for path in list_paths]
return "\n".join(list_paths)
def add_child(self, child:CombinationNode) -> None:
if child.combination.degree > self.combination.degree and not self.total_combination.overlaps(child.combination):
raise ChildDegreeExceedParentDegreeError(f"{child.combination} > {self.combination}")
self.children.append(child)
child.parent = self
child.total_combination += self.total_combination
def path(self) -> List[CombinationNode]:
path = []
current = self
while current.parent != None:
path.append(current)
current = current.parent
path.append(current)
return path[::-1]
def populate(self, combination_dict:Dict[int, List[Combination]]) -> None:
missing_degrees = len(self.combination.combination)-self.total_combination.degree
if missing_degrees == 0:
return
for i in range(min(self.combination.degree, missing_degrees), 0, -1):
for combination in combination_dict[i]:
if not self.total_combination.overlaps(combination):
self.add_child(CombinationNode(combination))
for child in self.children:
child.populate(combination_dict)
def recur_paths(self) -> List[List[CombinationNode]]:
if len(self.children) == 0:
return [self.path()]
paths = []
for child in self.children:
for path in child.recur_paths():
paths.append(path)
return paths
Errors.py
class ChildDegreeExceedParentDegreeError(Exception):
pass
class CombinationsOverlapError(Exception):
pass
class ToImplementError(Exception):
pass
class UncompletePathError(Exception):
pass
main.py
from typing import Dict, List, Set, Tuple, Union
import pandas as pd
from CombinationLib import *
best_depth:int = -1
best_path:List[CombinationNode] = []
THRESHOLD_OUTLIER = 0.2
THRESHOLD_AVERAGING = 0.1
def verif_averaging_pct(combination:Combination, values:List[float]) -> bool:
"""
For a given combination of values, we must have all the values within
THRESHOLD_AVERAGING of the average of the combination
"""
avg = 0
for c,v in zip(combination.combination, values):
if c == "1":
avg += v
avg /= combination.degree
for c,v in zip(combination.combination, values):
if c == "1"and (v > avg*(1+THRESHOLD_AVERAGING) or v < avg*(1-THRESHOLD_AVERAGING)):
return False
return True
def recursive_check(node:CombinationNode, depth:int, values:List[Union[float, int]]) -> None:
"""
Here is where we preferencially ask for a small number of bigger groups
"""
global best_depth
global best_path
# If there are more groups than the current best way to do, stop
if best_depth != -1 and depth > best_depth:
return
# If all the values of the combination are not within THRESHOLD_AVERAGING, stop
if not verif_averaging_pct(node.combination, values):
return
# If we finished the list of combinations, and this way is the best, keep it, stop
if len(node.children) == 0:
if best_depth == -1 or depth < best_depth:
best_depth = depth
best_path = node.path()
return
# If we are still not finished (not every value has been used), continue
for cnode in node.children:
recursive_check(cnode, depth+1, values)
def groups_from_list(values:List[Union[float, int]]) -> List[List[Union[float, int]]]:
"""
From a list of values, get the smallest list of groups of elements
within THRESHOLD_AVERAGING of each other.
It implies that we will try and recursively find the biggest group possible
within the unsused values (i.e. groups with combinations of size [3, 1] are prefered
over [2, 2])
"""
global best_depth
global best_path
groups:List[List[float]] = []
# Generate all the combinations (I used binary for this)
combination_dict:Dict[int, List[Combination]] = {}
for i in range(1, 2**len(values)):
combination = format(i, f"0{len(values)}b") # Here is the binary conversion
counter = 0
for c in combination:
if c == "1":
counter += 1
if counter not in combination_dict:
combination_dict[counter] = []
combination_dict[counter].append(Combination(combination))
# Generate of the combinations of combinations that use all values (without using one twice)
combination_trees:List[List[CombinationNode]] = []
for key in combination_dict:
for combination in combination_dict[key]:
cn = CombinationNode(combination)
cn.populate(combination_dict)
combination_trees.append(cn)
best_depth = -1
best_path = None
for root in combination_trees:
recursive_check(root, 0, values)
# print(",".join([combination.combination.combination for combination in best_path]))
for combination in best_path:
temp = []
for c,v in zip(combination.combination.combination, values):
if c == "1":
temp.append(v)
groups.append(temp)
return groups
def averages_from_groups(gs:List[List[Union[float, int]]]) -> List[float]:
"""Computing the averages of each group"""
avgs:List[float] = []
for group in gs:
avg = 0
for elt in group:
avg += elt
avg /= len(group)
avgs.append(avg)
return avgs
def end_check(ds:List[pd.DataFrame], ids:List[int]) -> bool:
"""Check if we finished consuming all the dataframes"""
for d,i in zip(ds, ids):
if i < len(d[0]):
return False
return True
def search(group:List[Union[float, int]], values_list:List[Union[float, int]]) -> List[int]:
"""Obtain all the indices corresponding to a set of values"""
# We will get all the indices in values_list of the values in group
# If a value is present in group, all the occurences of this value will be too,
# so we can use a set and search every occurence for each value.
indices:List[int] = []
group_set = set(group)
for value in group_set:
for i,v in enumerate(values_list):
if value == v:
indices.append(i)
return indices
def threshold_grouper(total_list:List[Union[float, int]]) -> pd.DataFrame:
"""Building a 2D pd.DataFrame with the averages (x) and the outliers (y)"""
result_list:List[List[Union[float, int]]] = [[total_list[0]]]
result_index = 0
total_index = 1
while total_index < len(total_list):
# Only checking if the bigger one is within THRESHOLD_OUTLIER of the little one.
# If it is the case, the opposite is true too.
# If yes, it is an outlier
if result_list[result_index][0]*(1+THRESHOLD_OUTLIER) >= total_list[total_index]:
result_list[result_index].append(total_list[total_index])
# Else it is a new peak
else:
result_list.append([total_list[total_index]])
result_index += 1
total_index += 1
result:pd.DataFrame = pd.DataFrame(result_list)
return result
def dataframes_merger(dataframes:List[pd.DataFrame]) -> pd.DataFrame:
"""Merging the dataframes, with THRESHOLDS"""
# Store the averages for the within 10% cells, in ascending order
result = []
# Keep tabs on where we are regarding each dataframe (needed for when we skip cells)
curr_indices:List[int] = [0 for _ in range(len(dataframes))]
# Repeat until all the cells in every dataframe has been seen once
while not end_check(dataframes, curr_indices):
# Get the values of the current indices in the dataframes
curr_values = [dataframe[0][i] for dataframe,i in zip(dataframes, curr_indices)]
# Get the largest 10% groups from the current list of values
groups = groups_from_list(curr_values)
# Compute the average of these groups
avgs = averages_from_groups(groups)
# Obtain the minimum average...
avg_min = min(avgs)
# ... and its index
avg_min_index = avgs.index(avg_min)
# Then get the group corresponding to the minimum average
avg_min_group = groups[avg_min_index]
# Get the indices of the values included in this group
indices_to_increment = search(avg_min_group, curr_values)
# Add the average to the result merged list
result.append(avg_min)
# For every element in the average we added, increment the corresponding index
for index in indices_to_increment:
curr_indices[index] += 1
# Re-assemble the dataframe, taking the threshold% around average into account
result = threshold_grouper(result)
print(result)
df1 = pd.DataFrame([1, 1487, 5144, 35293, 78486, 100000])
df2 = pd.DataFrame([1, 1500, 5144, 36278, 45968, 100000])
df3 = pd.DataFrame([1, 5286, 35785, 41205, 100000])
dataframes_merger([df3, df2, df1])
I am writing a program to discretize a set of attributes via entropy discretization. The goal is to parse the dataset
A,Class
5,1
12.5,1
11.5,2
8.6,2
7,1
6,1
5.9,2
1.5,2
9,2
7.8,1
2.1,1
13.5,2
12.45,2
Into
A,Class
1,1
3,1
3,2
2,2
2,1
2,1
1,2
1,2
3,2
2,1
1,1
3,2
3,2
The specific problem that I am facing with my program is determining the frequencies of 1 and 2 in the class column.
df = s['Class']
df['freq'] = df.groupby('Class')['Class'].transform('count')
print("*****************")
print(df['freq'])
I would like to use a pandas method to return the frequency of 1 and 2 so that I can calculate probabilities p1 and p2.
import pandas as pd
import numpy as np
import entropy_based_binning as ebb
from math import log2
def main():
df = pd.read_csv('S1.csv')
s = df
s = entropy_discretization(s)
# This method discretizes s A1
# If the information gain is 0, i.e the number of
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
informationGain = {}
# while(uniqueValue(s)):
# Step 1: pick a threshold
threshold = 6
# Step 2: Partititon the data set into two parttitions
s1 = s[s['A'] < threshold]
print("s1 after spitting")
print(s1)
print("******************")
s2 = s[s['A'] >= threshold]
print("s2 after spitting")
print(s2)
print("******************")
# Step 3: calculate the information gain.
informationGain = information_gain(s1,s2,s)
print(informationGain)
# # Step 5: calculate the max information gain
# minInformationGain = min(informationGain)
# # Step 6: keep the partitions of S based on the value of threshold_i
# s = bestPartition(minInformationGain, s)
def uniqueValue(s):
# are records in s the same? return true
if s.nunique()['A'] == 1:
return False
# otherwise false
else:
return True
def bestPartition(maxInformationGain):
# determine be threshold_i
threshold_i = 6
return
def information_gain(s1, s2, s):
# calculate cardinality for s1
cardinalityS1 = len(pd.Index(s1['A']).value_counts())
print(f'The Cardinality of s1 is: {cardinalityS1}')
# calculate cardinality for s2
cardinalityS2 = len(pd.Index(s2['A']).value_counts())
print(f'The Cardinality of s2 is: {cardinalityS2}')
# calculate cardinality of s
cardinalityS = len(pd.Index(s['A']).value_counts())
print(f'The Cardinality of s is: {cardinalityS}')
# calculate informationGain
informationGain = (cardinalityS1/cardinalityS) * entropy(s1) + (cardinalityS2/cardinalityS) * entropy(s2)
print(f'The total informationGain is: {informationGain}')
return informationGain
def entropy(s):
# calculate the number of classes in s
numberOfClasses = s['Class'].nunique()
print(f'Number of classes: {numberOfClasses}')
# TODO calculate pi for each class.
# calculate the frequency of class_i in S1
value_counts = s['Class'].value_counts()
print(f'value_counts : {value_counts}')
df = s['Class']
df['freq'] = df.groupby('Class')['Class'].transform('count')
print("*****************")
print(df['freq'])
# p1 = s.groupby('Class').count()
# p2 = s.groupby('Class').count()
# print(f'p1: {p1}')
# print(f'p2: {p2}')
p1 = 2/4
p2 = 3/4
ent = -(p1*log2(p2)) - (p2*log2(p2))
return ent
Ideally, I'd like to print Number of classes: 2. This way I can loop over the classes and calculate the frequencies for the attribute Class from the dataset. I've reviewed the pandas documentation, but I got stuck trying to count the frequency of 1 and 2 from the class section.
Use value_counts:
>>> df.value_counts('Class')
Class
2 7
1 6
dtype: int64
Update:
How do get the individual frequencies returned from the value_counts method?
counts = df.value_counts('Class')
print(counts[1]) # Freq of 1
6
print(counts[2]) # Freq of 2
7
I have a pandas df containing weights. Rows contain dates and columns contain asset names. Every row sum to 1.
I want to run
df_with_stocks_weight.apply(rescale_w, weight_min=0.01, weight_max=0.30)
in order to change so that weights still sum to 1 but have min value 1% and max value 30%. I tried using the function below, but I get problems with the index: The calculated values are correct but the output refers to the wrong asset!
def rescale_w(row_input, weight_min, weight_max):
'''
:param row_input: a row from a pandas df
:param weight_min: the floor. type float.
:param weight_max: the cap. type float.
:return: a pandas row where weights are adjusted to specify min max.
step 1:
while any asset has weight above weight_max,
set that asset's weight to == weight_max
and distribute the leftovers to all other assets (whose weight are >0)
in accordance with their weight.
step 2:
if there is a positive weight below min_weight,
force it to == min_weight
by stealing from every other asset
(except those whose weight == max_weight).
note that the function produce strange output with few assets.
for example with 3 assets and max 30% the sum is 0.90
and if A=50% B=20% and one other asset is 1% then
these are not practical problems as we will analyze on data with many assets.
'''
# rename
w1 = row_input
# na
# script returned many errors regarding na
# so i a fillna(0) here.
# if that will be the final solution, some cleaning up can be done
# eg remove _null objects and remove some assertions.
w1 = w1.fillna(0)
# remove zeroes to get a faster script
w1nz = w1[w1 > 0]
w1z = w1[w1 == 0]
assert len(w1) == len(w1nz) + len(w1z)
assert set(w1nz.index).intersection(set(w1z.index)) == set()
# input must sum to 1
assert abs(w1nz.sum()-1) < 0.001
# only execute if there is at least one notnull value
# below will work with nz
if len(w1nz) > 0:
# step 1: make sure upper threshold is satisfied
while max(w1nz) > weight_max:
# clip at 30%
w2 = w1nz.clip(upper=weight_max)
# calc leftovers from this upper clip
leftover_upper = 1 - w2.sum()
# add leftovers to the untouched, in accordance with weight
w2_touched = w2[w2 == weight_max]
w2_unt = w2[(weight_max > w2) & (w2 > 0)]
w2_unt_added = w2_unt + leftover_upper * w2_unt / w2_unt.sum()
# concat all back
w3 = pd.concat([w2_touched, w2_unt_added], axis=0)
# same index for output and input
#w3 = w3.reindex(w1nz.index) # todo prövar nu att ta bort .reindex överallt. ser om pd löser det själv automatiskt
# rename w3 so that it works in a while loop
w1nz = w3
usestep2 = False
if usestep2:
# step 2: make sure lower threshold is satisfied
if min(w1nz) < weight_min:
# three parts: lower, middle, upper.
# those in "lower" will recieve from those in "middle"
upper = w1nz[w1nz >= weight_max]
middle = w1nz[(w1nz > weight_min) & (w1nz < weight_max)]
lower = w1nz[w1nz <= weight_min]
# assert len
assert (len(upper) + len(middle) + len(lower) == len(w1nz))
# change lower to == weight_min
lower_modified = lower.clip(lower=weight_min)
# the weights given to "lower" is stolen from "middle"
stolen_weigths = lower_modified.sum() - lower.sum()
middle_modified = middle - stolen_weigths * middle / middle.sum()
# concat
w4 = pd.concat([lower_modified,
middle_modified,
upper], axis=0)
# reindex
#w4 = w4.reindex(w1nz.index)
# rename
w1nz = w4
# lastly, concat adjusted nonzero with zero.
w1adj = pd.concat([w1nz, w1z], axis=0)
w1adj = w1adj.reindex(w1.index) # works?
assert (w1adj.index == w1.index).all()
assert abs(w1adj.sum() - 1 < 0.001)
return (w1adj)
In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary
I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?
indices | real data
|
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...
See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:
from __future__ import division
import numpy as np
# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)
items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)
dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
And now the stratified sampling:
# For randomness, shuffle the entire array
np.random.shuffle(dataset)
blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))
# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)
x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B
a_idx = threshold > np.random.rand(len(dataset))
A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
After running it, the split is roughly 80/20, and all blocks are represented in both arrays:
>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True
Here's an alternative solution. I'm open for a code review if it is possible to implement this in a more numpy way (without for loops). #Jamie 's answer is really good, it's just that sometimes it produces skewed ratios within blocks of data.
ratio = 0.8
IDX1 = 0
IDX2 = 1
idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
valid = None
train = None
for i1 in idx1s:
for i2 in idx2:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
curr_data = data[mask,:]
np.random.shuffle(curr_data)
start = np.min(mask)
end = np.max(mask)
thres = start + np.around((end - start) * ratio).astype(np.int)
selected = mask < thres
train_idx = mask[0][selected[0]]
valid_idx = mask[0][~selected[0]]
if train != None:
train = np.vstack((train,data[train_idx]))
valid = np.vstack((valid,data[valid_idx]))
else:
train = data[train_idx]
valid = data[valid_idx]
I'm assuming that each block has at least two entries and also that if it has more than two you want them assigned as closely as possible to 80/20. The easiest way to do this seems to be to assign a random number to all rows, and then choose based on percentiles within each stratified sample. Say this is the data in file strat_sample.csv:
Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291
Then this code (using Pandas data structures) works as desired
import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow
def TreatmentOneCount(n , *args):
#assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1.
OptimalRatio = args[0]
if n < 2:
print("N too small, assignment not defined.")
a = NaN
elif n == 2:
a = 1
else:
"""
There are one of two numbers that are close to the target ratio, one above, the other below
If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
"""
targetassigment = OptimalRatio * n
if targetassigment - floor(targetassigment) > 0.5:
a = min(ceil(targetassigment),n-1)
else:
a = max(floor(targetassigment),1)
return a
df = pd.read_csv('strat_sample.csv', sep=',' , header=0)
#assign a random number to each entry
df['RandScore'] = np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)
#Within each block assign a rank based on random number.
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()
#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)
#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)
#Add the block counts to the data
df = df.merge(dftest, how='left', left_on = 'MasterIdx', right_index= True)
#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <= df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)
from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)