l have a dataset (numpy vector) with 50 classes and 9000 training examples.
x_train=(9000,2048)
y_train=(9000,) # Classes are strings
classes=list(set(y_train))
l would like to build a sub-dataset such that each class will have 5 examples
which means l get 5*50=250 training examples. Hence my sub-dataset will take this form :
sub_train_data=(250,2048)
sub_train_labels=(250,)
Remark : we take randomly 5 examples from each class (total number of classes = 50)
Thank you
Here is a solution for that problem :
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
def balanced_sample_maker(X, y, sample_size, random_seed=42):
uniq_levels = np.unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
np.random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of each label
balanced_copy_idx = []
for gb_level, gb_idx in groupby_levels.items():
over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
balanced_copy_idx+=over_sample_idx
np.random.shuffle(balanced_copy_idx)
data_train=X[balanced_copy_idx]
labels_train=y[balanced_copy_idx]
if ((len(data_train)) == (sample_size*len(uniq_levels))):
print('number of sampled example ', sample_size*len(uniq_levels), 'number of sample per class ', sample_size, ' #classes: ', len(list(set(uniq_levels))))
else:
print('number of samples is wrong ')
labels, values = zip(*Counter(labels_train).items())
print('number of classes ', len(list(set(labels_train))))
check = all(x == values[0] for x in values)
print(check)
if check == True:
print('Good all classes have the same number of examples')
else:
print('Repeat again your sampling your classes are not balanced')
indexes = np.arange(len(labels))
width = 0.5
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
return data_train,labels_train
X_train,y_train=balanced_sample_maker(X,y,10)
inspired by Scikit-learn balanced subsampling
Pure numpy solution:
def sample(X, y, samples):
unique_ys = np.unique(y, axis=0)
result = []
for unique_y in unique_ys:
val_indices = np.argwhere(y==unique_y).flatten()
random_samples = np.random.choice(val_indices, samples, replace=False)
ret.append(X[random_samples])
return np.concatenate(result)
I usually use a trick from scikit-learn for this. I use the StratifiedShuffleSplit function. So if I have to select 1/n fraction of my train set, I divide the data into n folds and set the proportion of test data (test_size) as 1-1/n. Here is an example where I use only 1/10 of my data.
sp = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=seed)
for train_index, _ in sp.split(x_train, y_train):
x_train, y_train = x_train[train_index], y_train[train_index]
You can use dataframe as input (as in my case), and use simple code below:
col = target
nsamples = min(t4m[col].value_counts().values)
res = pd.DataFrame()
for val in t4m[col].unique():
t = t4m.loc[t4m[col] == val].sample(nsamples)
res = pd.concat([res, t], ignore_index=True).sample(frac=1)
col is the name of your column with classes. Code finds minority class, shuffles dataframe, then takes sample of size of minority class from each class.
Then you can convert result back to np.array
I'm trying to implement the Hinge loss function in Python and faced with some misleadings.
In some sources that I used to read (for example, "Regression Analysis in Python"under Luca Massoron) states that Hinge sometimes calls as Softmax function.
But for me it is kind of strange because, Hinge:
and Softmax is just exponential function like:
I made that function in Python (for Softmax) this way:
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x/e_x.sum(axis=0)
Have two questions:
Can I use that softmax function like an equivalent to hinge function?
If not, how can hinge be implemented in Python?
Thanks.
Can I use that softmax function like an equivalent to hinge function?
no - they are not equivalent.
a hinge function is a loss function and do not provide well-calibrated probabilities, whereas softmax is a mapping function (one that maps a set of scores into a distribution, one that sums to one).
If not, how can hinge be implemented in Python?
this following snippet captures the essence of hinge loss functions:
import numpy as np
import matplotlib.pyplot as plt
xmin, xmax = -1, 2
xx = np.linspace(xmin, xmax, 100)
plt.plot(xx, np.where(xx < 1, 1 - xx, 0), label="Hinge loss")
you can also implement softmax functions in pure python :)
import numpy as np
import math as math
def sofyMax(data):
# pure python
# math:: $rezult(powe,sumColumn) = \dfrac{powe(data)}{sumColumn(powe(data))}$
def powe(data):
outp = [[] for _ in range(len(data))]
for column in range(len(data[0])):
r = 0
for row in data:
outp[r]+=[math.exp(row[column])]
r+=1
return outp
def sumColumn(data):
outps = []
for column in range(len(data[0])):
total = 0
for row in data:
total+=row[column]
outps += [total]
return outps
def rezult(data,sumcolumn):
outp = [[] for _ in range(len(data))]
l = 0
for row in data:
for c,s in zip(row,sumcolumn) :
outp[l] += [c/s]
l+=1
return outp
et1 = powe(data)
et2 = sumColumn(et1)
return rezult(et1,et2)
data = np.random.randn(10,5)
(np.exp(data)/np.sum(np.exp(data),axis=0)) == (np.array(sofyMax(data)))
Can you please tell me how to calculate distance between every point in my testData properly.
For now I am getting only one single value, whereas I should get distance from each point in data set and be able to assign it a class. I have to use numpy for this.
========================================================================
Now the problem is that I am getting this error and don't know how to fix it.
KeyError: 0
I am trying to obtain accuracy of classified labels.
Any ideas, please?
import matplotlib.pyplot as plt
import random
import numpy as np
import operator
from sklearn.cross_validation import train_test_split
# In[1]
def readFile():
f = open('iris.data', 'r')
d = np.dtype([ ('features',np.float,(4,)),('class',np.str_,20)])
data = np.genfromtxt(f, dtype = d ,delimiter=",")
dataPoints = data['features']
labels = data['class']
return dataPoints, labels
# In[2]
def normalizeData(dataPoints):
#normalize the data so the values will be between 0 and 1
dataPointsNorm = (dataPoints - dataPoints.min())/(dataPoints.max() - dataPoints.min())
return dataPointsNorm
def crossVal(dataPointsNorm):
# spliting for train and test set for crossvalidation
trainData, testData = train_test_split(dataPointsNorm, test_size=0.20, random_state=25)
return trainData, testData
def calculateDistance(trainData, testData):
#Euclidean distance calculation on numpy arrays
distance = np.sqrt(np.sum((trainData - testData)**2, axis=-1))
# Argsort sorts indices from closest to furthest neighbor, in ascending order
sortDistance = distance.argsort()
return distance, sortDistance
# In[4]
def classifyKnn(testData, trainData, labels, k):
# Calculating nearest neighbours and based on majority vote assigning the class
classCount = {}
for i in range(k):
distance, sortedDistIndices = calculateDistance(trainData, testData[i])
voteLabel = labels[sortedDistIndices][i]
#print voteLabel
classCount[voteLabel] = classCount.get(voteLabel,0)+1
print 'Class Count: ', classCount
# Sorting dictionary to return voted class
sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0], classCount
def testAccuracy(testData, classCount):
correct = 0
for x in range(len(testData)):
print 'HERE !!!!!!!!!!!!!!'
if testData[x][-1] is classCount[x]:
correct += 1
return (correct/float(len(testData))) * 100.0
def main():
dataPoints, labels = readFile()
dataPointsNorm = normalizeData(dataPoints)
trainData, testData = crossVal(dataPointsNorm)
result, classCount = classifyKnn(testData, trainData, labels, 5)
print result
accuracy = testAccuracy(testData, classCount)
print accuracy
main()
I have it normalized, split into train and test calc distance (wrong).
Thanks for any tips.
I am trying to write a script where I will calculate the similarity of few documents. I want to do it by using LSA. I have found the following code and change it a bit. I has as an input 3 documents and then as output a 3x3 matrix with the similarity between them. I want to do the same similarity calculation but only with sklearn library. Is that possible?
from numpy import zeros
from scipy.linalg import svd
from math import log
from numpy import asarray, sum
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
titles = [doc1,doc2,doc3]
ignorechars = ''',:'!'''
class LSA(object):
def __init__(self, stopwords, ignorechars):
self.stopwords = stopwords.words('english')
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0
def parse(self, doc):
words = doc.split();
for w in words:
w = w.lower()
if w in self.stopwords:
continue
elif w in self.wdict:
self.wdict[w].append(self.dcount)
else:
self.wdict[w] = [self.dcount]
self.dcount += 1
def build(self):
self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
self.keys.sort()
self.A = zeros([len(self.keys), self.dcount])
for i, k in enumerate(self.keys):
for d in self.wdict[k]:
self.A[i,d] += 1
def calc(self):
self.U, self.S, self.Vt = svd(self.A)
return -1*self.Vt
def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
mylsa = LSA(stopwords, ignorechars)
for t in titles:
mylsa.parse(t)
mylsa.build()
a = mylsa.calc()
cosine_similarity(a)
From #ogrisel's answer:
I run the following code, but my mouth is still open :) When TFIDF has max 80% similarity on two documents with the same subject, this code give me 99.99%. That's why I think that it is something wrong :P
dataset = [doc1,doc2,doc3]
vectorizer = TfidfVectorizer(max_df=0.5,stop_words='english')
X = vectorizer.fit_transform(dataset)
lsa = TruncatedSVD()
X = lsa.fit_transform(X)
X = Normalizer(copy=False).fit_transform(X)
cosine_similarity(X)
You can use the TruncatedSVD transformer from sklearn 0.14+: you call it with fit_transform on your database of documents and then call the transform method (from the same TruncatedSVD method) on the query document and then can compute the cosine similarity of the transformed query documents with the transformed database with the function: sklearn.metrics.pairwise.cosine_similarity and numpy.argsort the result to find the index of most similar document.
Note that under the hood, scikit-learn also uses NumPy but in a more efficient way than the snippet you gave (by using the Randomized SVD trick by Halko, Martinsson and Tropp).
In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:
def entropy(labels):
""" Computes entropy of 0-1 vector. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts[np.nonzero(counts)] / n_labels
n_classes = len(probs)
if n_classes <= 1:
return 0
return - np.sum(probs * np.log(probs)) / np.log(n_classes)
Is there a faster way?
#Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.
Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy
import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd
import timeit
def entropy1(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
return entropy(counts, base=base)
def entropy2(labels, base=None):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
value,counts = np.unique(labels, return_counts=True)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute entropy
base = e if base is None else base
for i in probs:
ent -= i * log(i, base)
return ent
def entropy3(labels, base=None):
vc = pd.Series(labels).value_counts(normalize=True, sort=False)
base = e if base is None else base
return -(vc * np.log(vc)/np.log(base)).sum()
def entropy4(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
norm_counts = counts / counts.sum()
base = e if base is None else base
return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()
Timeit operations:
repeat_number = 1000000
a = timeit.repeat(stmt='''entropy1(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
repeat=3, number=repeat_number)
b = timeit.repeat(stmt='''entropy2(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
repeat=3, number=repeat_number)
c = timeit.repeat(stmt='''entropy3(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
repeat=3, number=repeat_number)
d = timeit.repeat(stmt='''entropy4(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
repeat=3, number=repeat_number)
Timeit results:
# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))
Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938
Winner: numpy/math (entropy2)
It's also worth noting that the entropy2 function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.
With the data as a pd.Series and scipy.stats, calculating the entropy of a given quantity is pretty straightforward:
import pandas as pd
import scipy.stats
def ent(data):
"""Calculates entropy of the passed `pd.Series`
"""
p_data = data.value_counts() # counts occurrence of each value
entropy = scipy.stats.entropy(p_data) # get entropy from counts
return entropy
Note: scipy.stats will normalize the provided data, so this doesn't need to be done explicitly, i.e. passing an array of counts works fine.
An answer that doesn't rely on numpy, either:
import math
from collections import Counter
def eta(data, unit='natural'):
base = {
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.
}
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
This will accept any datatype you could throw at it:
>>> eta(['mary', 'had', 'a', 'little', 'lamb'])
1.6094379124341005
>>> eta([c for c in "mary had a little lamb"])
2.311097886212714
The answer provided by #Jarad suggested timings as well. To that end:
repeat_number = 1000000
e = timeit.repeat(
stmt='''eta(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import eta''',
repeat=3,
number=repeat_number)
Timeit results: (I believe this is ~4x faster than the best numpy approach)
print('Method: {}, Avg.: {:.6f}'.format("eta", np.array(e).mean()))
Method: eta, Avg.: 10.461799
Following the suggestion from unutbu I create a pure python implementation.
def entropy2(labels):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute standard entropy.
for i in probs:
ent -= i * log(i, base=n_classes)
return ent
The point I was missing was that labels is a large array, however probs is 3 or 4 elements long. Using pure python my application now is twice as fast.
Here is my approach:
labels = [0, 0, 1, 1]
from collections import Counter
from scipy import stats
stats.entropy(list(Counter(labels).values()), base=2)
My favorite function for entropy is the following:
def entropy(labels):
prob_dict = {x:labels.count(x)/len(labels) for x in labels}
probs = np.array(list(prob_dict.values()))
return - probs.dot(np.log2(probs))
I am still looking for a nicer way to avoid the dict -> values -> list -> np.array conversion. Will comment again if I found it.
Uniformly distributed data (high entropy):
s=range(0,256)
Shannon entropy calculation step by step:
import collections
import math
# calculate probability for each byte as number of occurrences / array length
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
# [0.00390625, 0.00390625, 0.00390625, ...]
# calculate per-character entropy fractions
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
# [0.03125, 0.03125, 0.03125, ...]
# sum fractions to obtain Shannon entropy
entropy = sum(e_x)
>>> entropy
8.0
One-liner (assuming import collections):
def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]])
A proper function:
import collections
import math
def H(s):
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
return sum(e_x)
Test cases - English text taken from CyberChef entropy estimator:
>>> H(range(0,256))
8.0
>>> H(range(0,64))
6.0
>>> H(range(0,128))
7.0
>>> H([0,1])
1.0
>>> H('Standard English text usually falls somewhere between 3.5 and 5')
4.228788210509104
This method extends the other solutions by allowing for binning. For example, bin=None (default) won't bin x and will compute an empirical probability for each element of x, while bin=256 chunks x into 256 bins before computing the empirical probabilities.
import numpy as np
def entropy(x, bins=None):
N = x.shape[0]
if bins is None:
counts = np.bincount(x)
else:
counts = np.histogram(x, bins=bins)[0] # 0th idx is counts
p = counts[np.nonzero(counts)]/N # avoids log(0)
H = -np.dot( p, np.log2(p) )
return H
This is the fastest Python implementation I've found so far:
import numpy as np
def entropy(labels):
ps = np.bincount(labels) / len(labels)
return -np.sum([p * np.log2(p) for p in ps if p > 0])
from collections import Counter
from scipy import stats
labels = [0.9, 0.09, 0.1]
stats.entropy(list(Counter(labels).keys()), base=2)
BiEntropy wont be the fastest way of computing entropy, but it is rigorous and builds upon Shannon Entropy in a well defined way. It has been tested in various fields including image related applications.
It is implemented in Python on Github.
Bit late for the party, but I stumbled at this and all answers seems to rely on Kullback–Leibler divergence, which has no upper bound, and hence, doesn't fit my needs.
Here I have an approximation (the TODO!could be improved) of an entropy function that goes from [0,1].
It calculates the biass of a single column.
class Pandas_Dataframe_helper:
#some other methods here...
#staticmethod
def column_biass(df_column):
df_column_as_list = list(df_column)
N = len(df_column_as_list)
values,counts = np.unique(df_column_as_list, return_counts=True)
#generate synth list (TODO! what if not even number? Minimum Comun Multiple of(num_different_labels,[x for x in counts]))
num_different_labels = len(values)
num_items_per_label = N // num_different_labels
synthetic_list = []
for current_value in values:
synthetic_list.extend([current_value] * num_items_per_label)
#TODO! aproximacion
if(len(synthetic_list) != len(df_column_as_list)):
synthetic_list.extend([current_value] * (len(df_column_as_list) - len(synthetic_list)))
#now, extrapolate differences between sorted-input-list and synsthetic_list
df_column_as_list_sorted = sorted(df_column_as_list)
counter_unmatches = 0
for i in range(0,N):
if(df_column_as_list_sorted[i] != synthetic_list[i]):
counter_unmatches += 1
#upper_bound = g(N,num_different_labels)
#((K-1)M)-1 K==num_different_labels , M==num theorically perfect distribution's items per label
upper_bound = ((num_different_labels-1)*num_items_per_label)-1
return counter_unmatches/upper_bound
#---------------------------------------------------------------------------------------------------------------------
Complete code at https://github.com/glezo1/pcommonlibs/blob/master/com/glezo/pandas_dataframe_helper/Pandas_Dataframe_Helper.py
The above answer is good, but if you need a version that can operate along different axes, here's a working implementation.
def entropy(A, axis=None):
"""Computes the Shannon entropy of the elements of A. Assumes A is
an array-like of nonnegative ints whose max value is approximately
the number of unique values present.
>>> a = [0, 1]
>>> entropy(a)
1.0
>>> A = np.c_[a, a]
>>> entropy(A)
1.0
>>> A # doctest: +NORMALIZE_WHITESPACE
array([[0, 0], [1, 1]])
>>> entropy(A, axis=0) # doctest: +NORMALIZE_WHITESPACE
array([ 1., 1.])
>>> entropy(A, axis=1) # doctest: +NORMALIZE_WHITESPACE
array([[ 0.], [ 0.]])
>>> entropy([0, 0, 0])
0.0
>>> entropy([])
0.0
>>> entropy([5])
0.0
"""
if A is None or len(A) < 2:
return 0.
A = np.asarray(A)
if axis is None:
A = A.flatten()
counts = np.bincount(A) # needs small, non-negative ints
counts = counts[counts > 0]
if len(counts) == 1:
return 0. # avoid returning -0.0 to prevent weird doctests
probs = counts / float(A.size)
return -np.sum(probs * np.log2(probs))
elif axis == 0:
entropies = map(lambda col: entropy(col), A.T)
return np.array(entropies)
elif axis == 1:
entropies = map(lambda row: entropy(row), A)
return np.array(entropies).reshape((-1, 1))
else:
raise ValueError("unsupported axis: {}".format(axis))
def entropy(base, prob_a, prob_b ):
import math
base=2
x=prob_a
y=prob_b
expression =-((x*math.log(x,base)+(y*math.log(y,base))))
return [expression]