I have a Dataframe like this:
Interesting genre_1 probabilities
1 no Empty 0.251306
2 yes Empty 0.042043
3 no Alternative 5.871099
4 yes Alternative 5.723896
5 no Blues 0.027028
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316
10 yes Classical 1.044135
I would like to perform GINI index on the same category based on the interesting column. After that, I would like to add such a value in a new pandas column.
This is the function to get the Gini index:
#Gini Function
#a and b are the quantities of each class
def gini(a,b):
a1 = (a/(a+b))**2
b1 = (b/(a+b))**2
return 1 - (a1 + b1)
EDIT* SORRY I had an error in my final desired Dataframe. Being interesting or not matters when it comes to choose prob(A) and prob(B) but the Gini score will be the same, because it will measure how much impurity are we getting to classify a song as interesting or not. So if the probabilities are around 50/50% then it will mean that the Gini score will reach it maximum (0.5) and this is because is equally possible to just be mistaken to choose interesting or not.
So for the first two rows, the Gini index will be:
a=no; b=Empty -> gini(0.251306, 0.042043)= 0.245559831601612
a=yes; b=Empty -> gini(0.042043, 0.251306)= 0.245559831601612
Then I would like to get something like:
Interesting genre_1 percentages. GINI INDEX
1 no Empty 0.251306 0.245559831601612
2 yes Empty 0.042043 0.245559831601612
3 no Alternative 5.871099 0.4999194135183881
4 yes Alternative 5.723896. 0.4999194135183881
5 no Blues 0.027028 ..
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316 ..
10 yes Classical 1.044135 ..
Ok, I think I know what you mean. The code below does not care, if the Interesting value is 'yes' or 'no'. But what you want, is to calculate the GINI coefficient in two different ways for each row based on the value in the Interesting value of that row. So if interesting == no, then the result is 0.5, because a == b. But if interesting is 'yes', then you need to use a = probability[i] and b = probability[i+1]. So skip this section for the updated code below.
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
probs = df['probabilities']
def ROLLING_GINI(probabilities):
a1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
b1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
res = 1 - (a1 + b1)
yield res
for i in range(len(probabilities)-1):
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
df['GINI'] = [val for val in ROLLING_GINI(probs)]
print(df)
This is where the real trouble starts, because if I understand your idea correctly, then you cannot calculate the last GINI value, because your dataframe won't allow it. The important bit here is that the last Interesting value in your dataframe is 'yes'. This means I have to use a = probability[i] and b = probability[i+1]. But your dataframe doesn't have a row number 11. You have 10 rows and on row i == 10, you'd need a probability in row 11 to calculate a GINI coefficient. So in order for your idea to work, the last Interesting value MUST be 'no', otherwise you will always get an index error.
Here's the code anyways:
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)-1):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
EDIT NUMBER THREE (Sorry for the late realization):
So it does work if I apply the indexing correctly. The problem was that I wanted to use the Next probability, not the previous one. So it's a = probabilities[i-1] and b = probabilities[i]
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i-1]/(probabilities[i-1]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i-1]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
I am not sure how the Interesting column plays into all of this, but I highly recommend that you make the new column by using numpy.where(). The syntax would be something like:
import numpy as np
df['GINI INDEX'] = np.where(__condition__,__what to do if true__,__what to do if false__)
Related
I have a pandas dataframe similar to the one below:
Output var1 var2 var3
1 0.487981 0.297929 0.214090
1 0.945660 0.031666 0.022674
2 0.119845 0.828661 0.051495
2 0.095186 0.852232 0.052582
3 0.059520 0.053307 0.887173
3 0.091049 0.342226 0.566725
3 0.119295 0.414376 0.466329
... ... ... ... ...
Basically, I have 3 columns (propensity score values) and one output (treatment). I want to calculate the within-trio distance to find trios of outputs with the smallest within-trio distance.
The experiment is taken from the paper: "Matching by Propensity Score in Cohort Studies with Three Treatment Groups", Rassen et al. Looking at their explanation is like calculating the perimeter of a triangle, but I am not sure.
I think that at this GitHub link: https://github.com/bwh-dope/pharmacoepi_toolbox/blob/master/src/org/drugepi/match/MatchDistanceCalculator.java there is Java code doing this stuff more or less, but I am not sure on how to use it. I use Python, so I have two options: try to adapt this previous code or write something else.
My idea is that var1, var2 and var3 can be considered like spatial x,y,z coordinates, and the output is like a point in the space.
I found a function that calculates the distance between 2 points:
#found here https://stackoverflow.com/questions/68938033/min-distance-between-point-cloud-xyz-points-in-python
import numpy as np
distance = lambda p1, p2: np.sqrt(np.sum((p1 - p2) ** 2, axis=0))
import itertools
def min_distance(cloud):
pairs = itertools.combinations(cloud, 2)
return np.min(map(lambda pair: distance(*pair), pairs))
def get_points(filename):
with open(filename, 'r') as file:
rows = np.genfromtxt(file, delimiter=',', skip_header=True)
return rows
filename = 'cloud.csv'
cloud = get_points(filename)
min_dist = min_distance(cloud)
However, I want to calculate the distance between 3 points, so I think that I need to iterate all the possible combinations of 3 points like XY, XZ and YZ, but I am not sure of this procedure.
Finally, I tried with my own solution, that I think it is correct, but maybe too much computationally expensive.
I created my 3 dataset, according to the Output value: dataset1 = dataset[dataset["Output"]==1] and the same for Output=2 and Output=3.
This is my distance function:
def Euclidean_Dist(df1, df2):
return np.linalg.norm(df1 - df2)
My variables:
tripletta_for = []
tripletta_tot_wr = []
p_inf = float('inf')
counter = 1
These are the steps used to computed the within-trio distance. Hope they are correct.
'''
i[0] = index
i[1] = treatment prop1
i[1][0] = treatment
i[1][1] = prop
'''
#io voglio calcolare la distanza tra i[1][1], j[1][1] e k[1][1]
for i in dataset1.iterrows():
minimum_distance = p_inf
print(counter)
counter = counter + 1
for j in dataset2.iterrows():
dist12 = Euclidean_Dist(i[1][1], j[1][1])
for k in dataset3.iterrows():
dist13 = Euclidean_Dist(i[1][1], k[1][1])
dist23 = Euclidean_Dist(j[1][1], k[1][1])
somma = dist12 + dist13 + dist23
if somma < minimum_distance:
minimum_distance = somma
tripletta_for = i[0], j[0], k[0]
#print(tripletta_for)
dataset2.drop(index=tripletta_for[1], inplace=True)
dataset3.drop(tripletta_for[2], inplace=True)
#print(len(dataset3))
tripletta_tot_wr.append(tripletta_for)
#print(tripletta_tot_wr)
If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03
Basically, I'm aggregating prices over three indices to determine: mean, std, as well as an upper/lower limit. So far so good. However, now I want to also find the lowest identified price which is still >= the computed lower limit.
My first idea was to use np.min to find the lowest price -> this obviously disregards the lower-limit and is not useful. Now I'm trying to store all the values the pivot table identified to find the price which still is >= lower-limit. Any ideas?
pivot = pd.pivot_table(temp, index=['A','B','C'],values=['price'], aggfunc=[np.mean,np.std],fill_value=0)
pivot['lower_limit'] = pivot['mean'] - 2 * pivot['std']
pivot['upper_limit'] = pivot['mean'] + 2 * pivot['std']
First, merge pivoted[lower_limit] back into temp. Thus, for each price in temp there is also a lower_limit value.
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
Then you can restrict your attention to those rows in temp for which the price is >= lower_limit:
temp.loc[temp['price'] >= temp['lower_limit']]
The desired result can be found by computing a groupby/min:
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
For example,
import numpy as np
import pandas as pd
np.random.seed(2017)
N = 1000
ABC = list('ABC')
temp = pd.DataFrame(np.random.randint(2, size=(N,3)), columns=ABC)
temp['price'] = np.random.random(N)
pivoted = pd.pivot_table(temp, index=['A','B','C'],values=['price'],
aggfunc=[np.mean,np.std],fill_value=0)
pivoted['lower_limit'] = pivoted['mean'] - 2 * pivoted['std']
pivoted['upper_limit'] = pivoted['mean'] + 2 * pivoted['std']
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
print(result)
yields
A B C
0 0 0 0.003628
1 0.000132
1 0 0.005833
1 0.000159
1 0 0 0.006203
1 0.000536
1 0 0.001745
1 0.025713
I want numpy to go through column 1 and find all numbers that are greater than 0.... Then I want numpy to print out column1 all postive numbers that it found and column2 what ever that number is associated with column 1.
import numpy
N = 3
a1 = [-0.0119,0.0754,0.0272,0.0107,-0.0053,-0.0114,0.0148,0.0062,0.0043,0,0.022,-0.0153,0.0207,-0.0065,0.0069,-0.0018,0.0149,-0.0084,-0.0021,0.0072,0.0095,0.0004,0.0068,0.0016,-0.0048,0.0051,0.0025,0.0081,-0.0203,-0.0008,-0.0008,-0.0047,-0.0007,-0.0291,0.0071,0.0033,0.0179,-0.0016,0.0397,0.0075,0.0061,-0.0075,0.0026,-0.0055,-0.006,0.0026,-0.0046,0.0046,0.0201,0.023,0.0014,-0.0029,0.0115,0.0066,0.0071,0.0061,-0.0081,-0.0071,0.0005,-0.0076,0.0102,-0.0051,0.018,0.0017,0.0123,0.0021,-0.0032,0.0049,0.0004,0.0053,-0.0004,0.0138,-0.0215,0.0019,0.0023,-0.0059,-0.013,-0.0478,-0.0009,0.0089,0.0006,0.014,-0.0077,0.0006,0.0024,0.0113,0.0062,-0.0162,0.0198,0.0096,0.0167,-0.0018,0.0038,0.0088,0.0023,-0.0063,-0.0109,0.0127,-0.027,0,0.0089,-0.0003,0.023,-0.0009,0.02,-0.0059,0.0029,0.0219,-0.0003,0.0029,0.0072,-0.009,0.0025,0.0123,0.0106,-0.0024,-0.0267,0.0124,0.0012,0.0046,-0.0131,0.0133,-0.0075,0.009,0.0209,0.0106,0.0031,0.0019,-0.0122,0.002,-0.0261,-0.004,0.4034]
a1= a1[::-1]
a1 = numpy.array(a1)
numbers_mean = numpy.convolve(a1, numpy.ones((N,))/N)[(N-1):]
numbers_mean = numbers_mean[::-1]
numbers_mean = numbers_mean.reshape(-1,1)
a1 = a1.reshape(-1,1)
x = numpy.column_stack((a1,numbers_mean))
l = x[0<a1]
When I print l all I get is the results from column1 I want also what number is showing in column2(column2 is not filtered)
this is the solution how to sort by column1 and bring all columns with.
xx = x[x[:,0]>0,:]
In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary
I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?
indices | real data
|
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...
See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:
from __future__ import division
import numpy as np
# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)
items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)
dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
And now the stratified sampling:
# For randomness, shuffle the entire array
np.random.shuffle(dataset)
blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))
# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)
x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B
a_idx = threshold > np.random.rand(len(dataset))
A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
After running it, the split is roughly 80/20, and all blocks are represented in both arrays:
>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True
Here's an alternative solution. I'm open for a code review if it is possible to implement this in a more numpy way (without for loops). #Jamie 's answer is really good, it's just that sometimes it produces skewed ratios within blocks of data.
ratio = 0.8
IDX1 = 0
IDX2 = 1
idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
valid = None
train = None
for i1 in idx1s:
for i2 in idx2:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
curr_data = data[mask,:]
np.random.shuffle(curr_data)
start = np.min(mask)
end = np.max(mask)
thres = start + np.around((end - start) * ratio).astype(np.int)
selected = mask < thres
train_idx = mask[0][selected[0]]
valid_idx = mask[0][~selected[0]]
if train != None:
train = np.vstack((train,data[train_idx]))
valid = np.vstack((valid,data[valid_idx]))
else:
train = data[train_idx]
valid = data[valid_idx]
I'm assuming that each block has at least two entries and also that if it has more than two you want them assigned as closely as possible to 80/20. The easiest way to do this seems to be to assign a random number to all rows, and then choose based on percentiles within each stratified sample. Say this is the data in file strat_sample.csv:
Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291
Then this code (using Pandas data structures) works as desired
import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow
def TreatmentOneCount(n , *args):
#assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1.
OptimalRatio = args[0]
if n < 2:
print("N too small, assignment not defined.")
a = NaN
elif n == 2:
a = 1
else:
"""
There are one of two numbers that are close to the target ratio, one above, the other below
If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
"""
targetassigment = OptimalRatio * n
if targetassigment - floor(targetassigment) > 0.5:
a = min(ceil(targetassigment),n-1)
else:
a = max(floor(targetassigment),1)
return a
df = pd.read_csv('strat_sample.csv', sep=',' , header=0)
#assign a random number to each entry
df['RandScore'] = np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)
#Within each block assign a rank based on random number.
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()
#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)
#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)
#Add the block counts to the data
df = df.merge(dftest, how='left', left_on = 'MasterIdx', right_index= True)
#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <= df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)
from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)