Selecting data on median value

Selecting data on median value - python

I want to select one row of an array by the median value in one of the columns.
My method does not work the way I expect it to work, and it could be related to the representation/precision of the value returned by the numpy.median() function.
Here is a minimal working example and a workaround that I found:
import numpy as np
# Create an array with random numbers
some_array = np.random.rand(100)
# Try to select
selection = (some_array == np.median(some_array))
print len(some_array[selection]),len(some_array[~selection]) # Gives: 0, 100 -> selection fails
# Work-around
abs_dist_from_median = np.abs(some_array-np.median(some_array))
selection = (abs_dist_from_median == np.min(abs_dist_from_median))
print len(some_array[selection]),len(some_array[~selection]) # Gives: 1, 99 -> selection succeeded
It seems that the np.median() function returns a different representation off the number, thereby leading to a mismatch in the selection.
I find this behaviour strange, since by definition the median value of an array should be contained in the array. Any help/clarification would be appreciated!

First, the number of values is even such as [1, 2, 3, 4]. the median is (2+3)/2 not 2 or 3. If you change 100 to 101, it works properly. So your second approach is more appropriate on your purpose.
However, the best solution seems to use argsort as
some_array[some_array.argsort()[len(some_array)/2]]
Also, do not use == when you compare two float values. use np.isclose instead.

Related

Change string values in numpy.array in certain indices

I need to create an array of strings, actually it is color value, for each value in another array. Logic is that for positive values should be one color, and for negatives another color.
I've tried this code snippet:
values = np.array([1, 2, -3, 4, 5])
color_values = np.array(['rgb(74,159,234)'] * len(values))
color_values[values < 0] = 'rgb(120,183,239)'
print(color_values)
But the problem is that new string values are truncating to length of previous value in array, so the result is:
['rgb(74,159,234)', 'rgb(74,159,234)', 'rgb(120,183,239', 'rgb(74,159,234)', 'rgb(74,159,234)']
THe third value is changed, but without last parethesis. I can rewrite code to achieve result I need but now I'm curious about why this happens.
I'm using Python 3.6, numpy 1.14.2

Acording to this answer, str numpy arrays have a fixed length. They suggest specifying the data type when declaring the array.
You could try to add the datatype when declaring your array; set it to 16 chars (or more).
color_values = np.array(['rgb(74,159,234)'] * len(values), dtype='S16')
The rest of the lines should not need modification.

Effectively count the number of repetitions for each number in a two-dimensional array

I need to find duplicate numbers in multiple one-dimensional arrays and the number of repetitions for each repetition, This is good for one-dimensional arrays np.unique, but does not seem to apply to two-dimensional arrays, I have searched for similar answers, but I need a more detailed report.(The number of occurrences of all numbers, the position index)
Can numpy bincount work with 2D arrays?
This answer does not match, I hope to get a map containing more information on some of the data, such as a number of the most, and I do not like recycling, maybe this is not appropriate, but I will try to find ways to not use a loop,Because I have a very harsh demand for speed.
For example：
a = np.array([[1,2,2,2,3],
[0,1,1,1,2],
[0,0,0,1,0]])
# The number of occurrences for each number
# int count
# 0. 0
# 1. 1
# 2. 3
# 3. 1
#need the output:
#Index = the number of statistics, the number of repetitions
[[0 1 3 1]
[1 3 1 0]
[4 1 0 0]]
Because this is part of the loop, you need an efficient way of vectoring to complete more rows of statistics at once, and try to avoid looping again.
I've used packet aggregation to count the results. The function does this by constructing a key1 that differentiates rows, the data itself as key2, and a two-dimensional array of all 1s, Although able to output, but I think it is only temporary measures.Need the right way.
from numpy_indexed import group_by
def unique2d(x):
x = x.astype(int); mx = np.nanmax(x)+1
ltbe = np.tile(np.arange(x.shape[0])[:,None],(1,x.shape[1]))
vtbe = np.zeros(x.shape).astype(int) + 1
groups = npi.group_by((ltbe.ravel(),x.ravel().astype(int)))
unique, median = groups.sum(vtbe.ravel())
ctbe = np.zeros(x.shape[0]*mx.astype(int)).astype(int)
ctbe[(unique[0] * mx + unique[1]).astype(int)] = median
ctbe.shape=(x.shape[0],mx)
return ctbe
unique2d(a)
>array([[0, 1, 3, 1],
[1, 3, 1, 0],
[4, 1, 0, 0]])
Hope there are good suggestions and algorithms, thanks

The fewest lines of code I can come up with is as follows:
import numpy as np
import numpy_indexed as npi
a = np.array([[1,2,2,2,3],
[0,1,1,1,2],
[0,0,0,1,0]])
row_idx = np.indices(a.shape, dtype=np.int32)[0]
axes, table = npi.Table(row_idx.flatten(), a.flatten()).count()
I havnt profiled this, but it does not contain any hidden un-vectorized for-loops; and I doubt you could do it much faster in numpy by any means. Nor do I expect it to perform a whole lot faster than your current solution though. Using the smallest possible int-types may help.
Note that this function does not assume that the elements of a form a contiguous set; the axes labels are returned in the axes tuple; that may or may not be the behavior you are looking for. Modifying the code in the Table class to conform to your current layout shouldnt be hard though.
If speed is your foremost concern; your problem would probably map really well to numba.

What does numpy.percentile mean and how to use this for splitting array?

I am trying to understand percentiles in numpy.
import numpy as np
nd_array = np.array([3.6216, 4.5459, -3.5637, -2.5419])
step_intervals = range(100, 0, -5)
for percentile_interval in step_intervals:
threshold_attr_value = np.percentile(np.array(nd_array), percentile_interval)
print "percentile interval ={interval}, threshold_attr_value = {threshold_attr_value}, {arr}".format(interval=percentile_interval, threshold_attr_value=threshold_attr_value, arr=sorted(nd_array))
I get a value of these as
percentile interval =100, threshold_attr_value = 4.5459, [-3.5636999999999999, -2.5419, 3.6215999999999999, 4.5458999999999996]
...
percentile interval =5, threshold_attr_value = -3.41043, [-3.5636999999999999, -2.5419, 3.6215999999999999, 4.5458999999999996]
What does the percentiles value mean?
100% of the values in the array are < 4.5459?
5% of values in the array are < -3.41043?
Is that the correct way to read these?
I want to split the numpy array into small sub-arrays. I want to do it based on the percentile occurances of the elements. How can I do this?

To be more precise, you should say that a = np.percentile(arr, q) indicates that nearly q% of elements of arr are lower than a. Why do I emphasize on nearly?
If q=100, it always returns the maximum of arr. So, you cannot say that q% of elements are "lower than" a.
If q=0, it always returns the minimum of arr. So, you cannot say that q% of elements are "lower than or equal to" a.
In addition, the returned value depends on the type of interpolation.
The following code shows the role of interpolation parameter:
>>> import numpy as np
>>> arr = np.array([1,2,3,4,5])
>>> np.percentile(arr, 90) # default interpolation='linear'
4.5999999999999996
>>> np.percentile(arr, 90, interpolation='lower')
4
>>> np.percentile(arr, 90, interpolation='higher')
5

No, as you can see by inspection, only 75% of the values in your array are strictly less than 4.5459, and 25% of the values are strictly less than -3.41043. If you had written less than or equal to, then you would have been giving one common definition of "Percentile" which however happens to also not be what is applied in your case; instead, what's happening is that numpy is applying a certain interpolation scheme to ensure that the mapping taking a given number in [0, 100] to the corresponding percentile is continuous and piecewise linear, while still giving the "right" value at ranks corresponding to values in the given array. As it turns out, even this you can do in many different ways, all of which are reasonable, as described in the Wikipedia article on the subject. As you can see in the documentation of numpy.percentile, you have some control of the interpolation behaviour and by default it uses what the Wikipedia article calls the "second variant, $C = 1$".
Perhaps the easiest way to understand the implications of this is to simply plot the result of calculating the different values of np.percentile for your fixed length 4 array:
Note how the kinks are spread evenly across [0, 100] and that the percentiles corresponding to the actual values in your array are given by evaluating lambda p: np.percentile(nd_array, p) at 0*100/(4-1), 1*100/(4-1), 2*100/(4-1), and 3*100/(4-1) respectively.

Getting numbers within a range from a gaussian_kde_resample array

I have a gaussian_kde.resample array. I don't know if it is a numpy array so that I can use numpy functions.
I had the data 0<x<=0.5 of 3000 variables and I used
kde = scipy.stats.gaussian_kde(x) # can also mention bandwidth here (x,bandwidth)
sample = kde.resample(100000) # returns 100,000 values that follow the prob distribution of "x"
This gave me a sample of data that follows the probability distribution of "x". But the problem is, no matter what bandwidth I try to select, I get very few negative values in my "sample". I only want values within the range 0 < sample <= 0.5
I tried to do:
sample = np.array(sample) # to convert this to a numpy array
keep = 0<sample<=0.5
sample = sample[keep] # using the binary conditions
But this does not work! How can I remove the negative values in my array?

Firstly, you can check what type it is by using the 'type' call within python:
x = kde.resample(10000)
type(x)
numpy.ndarray
Secondly, it should be working in the way you wrote, but I would be more explicit in your binary condition:
print x
array([[ 1.42935658, 4.79293343, 4.2725778 , ..., 2.35775067, 1.69647609]])
x.size
10000
y = x[(x>1.5) & (x<4)]
which you can see, does the correct binary conditions and removes the values >1.5 and <4:
print y
array([ 2.95451084, 2.62400183, 2.79426449, ..., 2.35775067, 1.69647609])
y.size
5676

I know I'm answering about 3 years late, but this may be useful for future reference.
The catch is that while kde.resample(100000) technically returns a NumPy array, this array actually contains another array(!), and that gets in the way of all the attempts to use indexing to get subsets of the sample. To get the array that the resample() method probably should have returned all along, do this instead:
sample = kde.resample(100000)[0]
The array variable sample should then have all 100000 samples, and indexing this array should work as expected.
Why SciPy does it this way, I don't know. This misfeature doesn't even appear to be documented.

First of all, the return value of kde.resample is a numpy array, so you do not need to reconvert it.
The problem lies in the line (Edit: No, it doesn't. This should work!)
keep = 0 < sample <= 0.5
It does not do what you would think. Try:
keep = (0 < sample) * (sample <= 0.5)

NumPy masking issue -- What am I missing?

I'm plotting diet information using matplotlib, where the x-axis represents a range of dates, and the y-axis represents the number of calories consumed. Not too complicated that, but there is one snag: not all dates have calorie information, and it would make most sense to leave those out rather than do some sort of interpolation/smoothing.
I found several good examples of using numpy masks for such situations, but it seems I'm not getting something straight, as the code that I think should produce the results I want doesn't change anything.
Have a look:
calories_list_ma = np.ma.masked_where(calories_list == 0, calories_list)
plt.plot(datetimes_list, calories_list_ma, marker = 'x', color = 'r', ls = '-')
Which produces this:
I just want there to be an unplotted gap in the line for 9-23.
And actually, I know my use of masked_where must be incorrect, because when I print calories_list_ma.mask, the result is 'False'. Not a list, as it should be, showing which values are masked/unmasked with True and False.
Can someone set me straight?
Thanks so much!

I'm guessing from the name that your calories_list is a list. If it is a list calories_list == 0 will return one value, namely False, since the list does not equal the value 0. masked_where will then dutifully set the mask to False, resulting in an unmasked copy of your list.
You need to do calories_list = np.array(calories_list) first to make it into a numpy array. Unlike lists, numpy arrays have the "broadcasting" feature whereby calories_list == 0 compares each element individually to zero.

try using
calories_list_ma = np.ma.masked[calories_list == 0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting data on median value - python

Related

Change string values in numpy.array in certain indices

Effectively count the number of repetitions for each number in a two-dimensional array

What does numpy.percentile mean and how to use this for splitting array?

Getting numbers within a range from a gaussian_kde_resample array

NumPy masking issue -- What am I missing?

Categories

Resources