How to qcut with non unique bin edges? - python

My question is the same as this previous one:
Binning with zero values in pandas
however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.
Update 1/26/14:
I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.
def fractile_cut(ser, num_fractiles):
num_valid = ser.valid().shape[0]
remain_fractiles = num_fractiles
vcounts = ser.value_counts()
high_freq = []
i = 0
while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
curr_val = vcounts.index[i]
high_freq.append(curr_val)
remain_fractiles -= 1
num_valid = num_valid - vcounts[i]
i += 1
curr_ser = ser.copy()
curr_ser = curr_ser[~curr_ser.isin(high_freq)]
qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
qcut_bins = qcut[1]
all_bins = list(qcut_bins)
for val in high_freq:
bisect.insort(all_bins, val)
cut = pd.cut(ser, bins=all_bins)
ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
return ser_fractiles

The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).
The solutions are:
1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.
2 - Decrease the number of quantiles. Less quantiles means more elements per quantile
3 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first')
Example:
pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"
Then use this instead:
pd.qcut(df.rank(method='first'), nbins)
4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile
5 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin

Another way to do this is to introduce a minimal amount of noise, which will artificially create unique bin edges. Here's an example:
a = pd.Series(range(100) + ([0]*20))
def jitter(a_series, noise_reduction=1000000):
return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))
# and now this works by adding a little noise
a_deciles = pd.qcut(a + jitter(a), 10, labels=False)
we can recreate the original error using something like this:
a_deciles = pd.qcut(a, 10, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 173, in qcut
precision=precision, include_lowest=True)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 192, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 0. , 0. , 0. , 3.8 ,
11.73333333, 19.66666667, 27.6 , 35.53333333,
43.46666667, 51.4 , 59.33333333, 67.26666667,
75.2 , 83.13333333, 91.06666667, 99. ])

You ask about binning with non-unique bin edges, for which I have a fairly simple answer. In the case of your example, your intent and the behavior of qcut diverge where in the pandas.tools.tile.qcut function where bins are defined:
bins = algos.quantile(x, quantiles)
Which, because your data is 50% 0s, causes bins to be returned with multiple bin edges at the value 0 for any value of quantiles greater than 2. I see two possible resolutions. In the first, the fractile space is divided evenly, binning all 0s, but not only 0s, in the first bin. In the second, the fractile space is divided evenly for values greater than 0, binning all 0s and only 0s in the first bin.
import numpy as np
import pandas as pd
import pandas.core.algorithms as algos
from pandas import Series
In both cases, I'll create some random sample data fitting your description of 50% zeroes and the remaining values between 1 and 100
zs = np.zeros(300)
rs = np.random.randint(1, 100, size=300)
arr=np.concatenate((zs, rs))
ser = Series(arr)
Solution 1: bin 1 contains both 0s and low values
bins = algos.quantile(np.unique(ser), np.linspace(0, 1, 11))
result = pd.tools.tile._bins_to_cuts(ser, bins, include_lowest=True)
The result is
In[61]: result.value_counts()
Out[61]:
[0, 9.3] 323
(27.9, 38.2] 37
(9.3, 18.6] 37
(88.7, 99] 35
(57.8, 68.1] 32
(68.1, 78.4] 31
(78.4, 88.7] 30
(38.2, 48.5] 27
(48.5, 57.8] 26
(18.6, 27.9] 22
dtype: int64
Solution 2: bin1 contains only 0s
mx = np.ma.masked_equal(arr, 0, copy=True)
bins = algos.quantile(arr[~mx.mask], np.linspace(0, 1, 11))
bins = np.insert(bins, 0, 0)
bins[1] = bins[1]-(bins[1]/2)
result = pd.tools.tile._bins_to_cuts(arr, bins, include_lowest=True)
The result is:
In[133]: result.value_counts()
Out[133]:
[0, 0.5] 300
(0.5, 11] 32
(11, 18.8] 28
(18.8, 29.7] 30
(29.7, 39] 35
(39, 50] 26
(50, 59] 31
(59, 71] 31
(71, 79.2] 27
(79.2, 90.2] 30
(90.2, 99] 30
dtype: int64
There is work that could be done to Solution 2 to make it a little prettier I think, but you can see that the masked array is a useful tool to approach your goals.

If you want to enforce equal size bins, even in the presence of duplicate values, you can use the following, 2 step process:
Rank your values, using method='first' to have python assign a unique rank to all your records. If there is a duplicate value (i.e. a tie in the rank), this method will choose the first record it comes to and rank in that order.
df['rank'] = df['value'].rank(method='first')
Use qcut on the rank to determine equal sized quantiles. Below example creates deciles (bins=10).
df['decile'] = pd.qcut(df['rank'].values, 10).codes

I've had a lot of problems with qcut as well, so I used the Series.rank function combined with creating my own bins using those results. My code is on Github:
https://gist.github.com/ashishsingal1/e1828ffd1a449513b8f8

I had this problem as well, so I wrote a small function, which only treats the non zero values and then inserts the labels where the original was not 0.
def qcut2(x, n=10):
x = np.array(x)
x_index_not0 = [i for i in range(len(x)) if x[i] > 0]
x_cut_not0 = pd.qcut(x[x > 0], n-1, labels=False) + 1
y = np.zeros(len(x))
y[x_index_not0] = x_cut_not0
return y

Related

Pandas Crosstab when used with Pandas cut : Row names of the output crosstab are mismatched

I have some data with binary (0 and 1) labels. I am using pd.cut to discretize one feature continuous_value, before doing a pd.crosstab on the new field.
The problem is that when I do crosstab, the output seems to mismatch the row names (which are the boundaries for each bin) with the corresponding counts.
Here is my code for a minimal example of the issue. continuous_value is non-negative. 118 data points have continuous_value = 0, so the smallest bin edge is -1, to include these. In this minimal example, the data is cut into the two bins: (-1 to 0), which only includes those with 0 value, and (0 to 5000000) which includes all other data points.
bins=[-1,0,5000000]
input_df['discrete_value']=pd.cut(input_df['continuous_value'],bins=bins)
pd.crosstab(input_df.discrete_value, input_df.label)
As you can see in the attached picture, the bin names are mismatched. The count for the non-zero bin sums to 118, and the count for the zero bin sums to 10237. This should be the other way around.
EDIT to include dummy data and code snippet that reproduces the error:
values = [5100,5400,9400,10000,16000,10500,0,0,0,87500,14000,11250]
d = {'continuous': values,'label': [0]*6+[1]*6}
df = pd.DataFrame(data=d)
bins = [-1, 0, 5000000]
df['discrete'] = pd.cut(df['continuous'], bins=bins)
This seems to be an issue with Pandas version 0.23.0. In the latest version, 0.23.2, it correctly assigns the labels.
import pandas as pd
values = [5100,5400,9400,10000,16000,10500,0,0,0,87500,14000,11250]
d = {'continuous': values,'label': [0]*6+[1]*6}
df = pd.DataFrame(data=d)
bins = [-1, 0, 5000000]
df['discrete'] = pd.cut(df['continuous'], bins=bins)
print(pd.crosstab(df.discrete, df.label))
Gives the desired output
label 0 1
discrete
(-1, 0] 0 3
(0, 5000000] 6 3

General way to quantize floating point numbers into arbitrary number of bins?

I want to quantize a series of numbers which have a maximum and minimum value of X and Y respectively into arbitrary number of bins. For instance, if the maximum value of my array is 65535 and the minimum is 0 (do not assume these are all integers) and I want to quantize the values into 2 bins, all values more than floor(65535/2) would become 65535 and the rest become 0. Similar story repeats if I want to quantize the array from any number between 1 to 65535. I wonder, is there an efficient and easy way to do this? If not, how can I do this efficiently for number of bins being powers of 2? Although a pseudocode would be fine but Python + Numpy is preferred.
It's not the most elegant solution, but:
MIN_VALUE = 0
MAX_VALUE = 65535
NO_BINS = 2
# Create random dataset from [0,65535] interval
numbers = np.random.randint(0,65535+1,100)
# Create bin edges
bins = np.arange(0,65535, (MAX_VALUE-MIN_VALUE)/NO_BINS)
# Get bin values
_, bin_val = np.histogram(numbers, NO_BINS-1, range=(MIN_VALUE, MAX_VALUE))
# Change the values to the bin value
for iter_bin in range(1,NO_BINS+1):
numbers[np.where(digits == iter_bin)] = bin_val[iter_bin-1]
UPDATE
Does the same job:
import pandas as pd
import numpy as np
# or bin_labels = [i*((MAX_VALUE - MIN_VALUE) / (NO_BINS-1)) for i in range(NO_BINS)]
_, bin_labels = np.histogram(numbers, NO_BINS-1, range=(MIN_VALUE, MAX_VALUE))
pd.cut(numbers, NO_BINS, right=False, labels=bin_labels)

What exactly does this random.uniform line in Python do?

I'm following a tutorial here from Andrew Cross on using random forests in Python. I got the code to run fine, and for the most part I understand the output. However, I am unsure on exactly what this line does:
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
I know that it "creates a (random) uniform distribution between 0 and 1 and assigns 3/4ths of the data to be in the training subset." However, the training subset is not always exactly 3/4 of the subset. Sometimes it is smaller and sometimes it is larger. So is a random sized subset chosen that is approximately 75%? Why not make it always 75%?
np.random.uniform(0, 1, len(df)) creates an array of len(df) random numbers.
<= .75 then creates another array containing True where the numbers matched that condition, and False in other places.
The code then uses the data in indexes where True was found. Since the random distribution is... well, random, you won't get exactly 75% of the values.
It does not assign 3/4ths of the data to be in the training subset.
It assigns the probability that data is in the training subset to be 3/4:
Example:
>>> import numpy as np
>>> sum(np.random.uniform(0, 1, 10) < .75)
8
>>> sum(np.random.uniform(0, 1, 10) < .75)
10
>>> sum(np.random.uniform(0, 1, 10) < .75)
7
80% of the data is in the training subset in the 1st example
100% -- in the 2nd one
70% -- in the 3rd.
On average, it should be 75%.
If you want to be more strict selecting randomly a training set always very near to 75%, you can use some code like this:
d = np.random.uniform(0, 1, 1000)
p = np.percentile(d, 75)
print(np.sum(d <= p)) # 750
print(np.sum(d <= .75)) # 745
In your example:
d = np.random.uniform(0, 1, len(df))
p = np.percentile(d, 75)
df['is_train'] = d <= p

2D numpy array where spacing between items is defined by a function

I need a list or 2-d array of integers between a minimum value and a maximum value, where the interval between the integers varies according to a distribution function inversely. In other words, at the maximum value of the distribution the density should be highest. In my case something like a Weibull probability density function with k parameter 1.5 would be nice.
Output would look something like this:
>>> min = 1
>>> max = 500
>>> peak = 100
>>> n = 18
>>> myfunc(min, max, peak, n)
[1, 50, 75, 88, 94, 97, 98, 99, 100, 102, 106, 112, 135, 176, 230, 290, 360, 500]
I already tried one method using the np.random.weibull() function to populate a numpy array but this doesn't work out nicely enough; the randomization when producing a list of 20 items means that the spacing is not satisfactory. It would be much better to avoid generating random numbers from a distribution and instead do what I described above, controlling the spacing directly.
Thank you.
Edit: I mention a Weibull distribution because it is asymmetric, but of course any similar distribution function that gives similar results is also OK and may be more suitable.
Edit2: So I want a numpy non-linear space!
Edit3: As I answered in one comment, I want to avoid random number generation so that the function output is identical each time it is run with the same input parameters.
If I'm understanding your question right, this function should do what you're asking:
def weibullspaced(min, max, k, arrsize):
wb = np.random.weibull(k, arrsize - 1)
spaced = np.zeros((arrsize,))
spaced[1:] = np.cumsum(wb)
diff = max - min
spaced *= diff / spaced[-1]
return min + np.rint(spaced)
You can of course substitute in any distribution you want, but you said you wanted Weibull. Is that the function you're looking for?
Here is a rather unelegant but simple solution to my own question. I simplified things by using a triangular distribution function. This is good because it is easy to specify a minimum and maximum value. A function named "spacing()" provides a spacing amount from the x value according to a specified mathematical function. After incrementing through a while loop I add the Max value to the list so that the full range is present. I then convert to integers while converting to a numpy array.
The downside of this method is that I must manually specify a minimum and maximum step size. I would rather specify the length of the returned array!
import numpy as np
import math
Min = 1.0
Max = 500.0
peak = 100.0
minstep = 1.0
maxstep = 50.0
def spacing(x):
# Triangle distribution:
if x < peak:
# Since we are calculating gradients I keep everything as floats for now.
grad = (minstep - maxstep)/(peak - Min)
return grad*x + maxstep
elif x == peak:
return minstep
else:
grad = (maxstep-minstep)/(Max-peak)
return grad*x + minstep
def myfunc(Min, Max, peak, minstep, maxstep):
x = 1.0
chosen = []
while x < Max:
space = spacing(x)
chosen.append(x)
x += space
chosen.append(Max)
# I cheat with the integers by casting the list to ints right at the end:
chosen = np.array(chosen, dtype = 'int')
return chosen
print myfunc(1.0, 500.0, 100.0, 1.0, 50.0)
Output:
[ 1 50 75 88 94 97 99 100 113 128 145 163 184 208 235 264 298 335 378 425 478 500]

making binned boxplot in matplotlib with numpy and scipy in Python

I have a 2-d array containing pairs of values and I'd like to make a boxplot of the y-values by different bins of the x-values. I.e. if the array is:
my_array = array([[1, 40.5], [4.5, 60], ...]])
then I'd like to bin my_array[:, 0] and then for each of the bins, produce a boxplot of the corresponding my_array[:, 1] values that fall into each box. So in the end I want the plot to contain number of bins-many box plots.
I tried the following:
min_x = min(my_array[:, 0])
max_x = max(my_array[:, 1])
num_bins = 3
bins = linspace(min_x, max_x, num_bins)
elts_to_bins = digitize(my_array[:, 0], bins)
However, this gives me values in elts_to_bins that range from 1 to 3. I thought I should get 0-based indices for the bins, and I only wanted 3 bins. I'm assuming this is due to some trickyness with how bins are represented in linspace vs. digitize.
What is the easiest way to achieve this? I want num_bins-many equally spaced bins, with the first bin containing the lower half of the data and the upper bin containing the upper half... i.e., I want each data point to fall into some bin, so that I can make a boxplot.
thanks.
You're getting the 3rd bin for the maximum value in the array (I'm assuming you have a typo there, and max_x should be "max(my_array[:,0])" instead of "max(my_array[:,1])"). You can avoid this by adding 1 (or any positive number) to the last bin.
Also, if I'm understanding you correctly, you want to bin one variable by another, so my example below shows that. If you're using recarrays (which are much slower) there are also several functions in matplotlib.mlab (e.g. mlab.rec_groupby, etc) that do this sort of thing.
Anyway, in the end, you might have something like this (to bin x by the values in y, assuming x and y are the same length)
def bin_by(x, y, nbins=30):
"""
Bin x by y.
Returns the binned "x" values and the left edges of the bins
"""
bins = np.linspace(y.min(), y.max(), nbins+1)
# To avoid extra bin for the max value
bins[-1] += 1
indicies = np.digitize(y, bins)
output = []
for i in xrange(1, len(bins)):
output.append(x[indicies==i])
# Just return the left edges of the bins
bins = bins[:-1]
return output, bins
As a quick example:
In [3]: x = np.random.random((100, 2))
In [4]: binned_values, bins = bin_by(x[:,0], x[:,1], 2)
In [5]: binned_values
Out[5]:
[array([ 0.59649575, 0.07082605, 0.7191498 , 0.4026375 , 0.06611863,
0.01473529, 0.45487203, 0.39942696, 0.02342408, 0.04669615,
0.58294003, 0.59510434, 0.76255006, 0.76685052, 0.26108928,
0.7640156 , 0.01771553, 0.38212975, 0.74417014, 0.38217517,
0.73909022, 0.21068663, 0.9103707 , 0.83556636, 0.34277006,
0.38007865, 0.18697416, 0.64370535, 0.68292336, 0.26142583,
0.50457354, 0.63071319, 0.87525221, 0.86509534, 0.96382375,
0.57556343, 0.55860405, 0.36392931, 0.93638048, 0.66889756,
0.46140831, 0.01675165, 0.15401495, 0.10813141, 0.03876953,
0.65967335, 0.86803192, 0.94835281, 0.44950182]),
array([ 0.9249993 , 0.02682873, 0.89439141, 0.26415792, 0.42771144,
0.12292614, 0.44790357, 0.64692616, 0.14871052, 0.55611472,
0.72340179, 0.55335053, 0.07967047, 0.95725514, 0.49737279,
0.99213794, 0.7604765 , 0.56719713, 0.77828727, 0.77046566,
0.15060196, 0.39199123, 0.78904624, 0.59974575, 0.6965413 ,
0.52664095, 0.28629324, 0.21838664, 0.47305751, 0.3544522 ,
0.57704906, 0.1023201 , 0.76861237, 0.88862359, 0.29310836,
0.22079126, 0.84966201, 0.9376939 , 0.95449215, 0.10856864,
0.86655289, 0.57835533, 0.32831162, 0.1673871 , 0.55742108,
0.02436965, 0.45261232, 0.31552715, 0.56666458, 0.24757898,
0.8674747 ])]
Hope that helps a bit!
Numpy has a dedicated function for creating histograms the way you need to:
histogram(a, bins=10, range=None, normed=False, weights=None, new=None)
which you can use like:
(hist_data, bin_edges) = histogram(my_array[:,0], weights=my_array[:,1])
The key point here is to use the weights argument: each value a[i] will contribute weights[i] to the histogram. Example:
a = [0, 1]
weights = [10, 2]
describes 10 points at x = 0 and 2 points at x = 1.
You can set the number of bins, or the bin limits, with the bins argument (see the official documentation for more details).
The histogram can then be plotted with something like:
bar(bin_edges[:-1], hist_data)
If you only need to do a histogram plot, the similar hist() function can directly plot the histogram:
hist(my_array[:,0], weights=my_array[:,1])

Categories

Resources