Using np.any() to capture outliers - python

I am trying to pull all upper and lower outliers from a data frame. I can do it in separate lines, assigning a variable to the uppers (ex. hi_outs = (sepal_outliers > 4.05)) and another variable to the lowers (ex. lo_outs = (sepal_outliers < 2.05)). But I wanted to try and grab both in one variable.
sepal_outliers = x[:,1]
outliers = np.any(sepal_outliers < 2.05, sepal_outliers > 4.05)
df[outliers]
I'm not sure why I've gotten the following error.
TypeError: only integer scalar arrays can be converted to a scalar index
any thoughts? How might I make this work? I'll keep reading up in the meantime...

You probably want to use NumPy's boolean indexing:
outliers = sepal_outliers[(sepal_outliers < 2.05) | (sepal_outliers > 4.05)]
That is, construct the boolean array of True or False for each element of the condition you set and index into the same array, sepal_outliers with it.
Note that for this, you need sepal_outliers to be a NumPy array rather than a common-or-garden Python list.

Related

Why boolean output turns to numerical output once I create a loop?

I am using skiimage library, I got it works correctly for different input files data.
Here the code that is working:
To explain it briefly, alpha_time is a level set function structured as [time,x,y,z], so alpha_time[0,:,:,:] is the level set function at time = 0.
gaz0 = alpha_0[0,:,:,:] >= 0
labels_gaz0 = measure.label(gaz0)
props_gaz0 = measure.regionprops_table(labels_gaz0,properties
['label','area'])
df0 = pandas.DataFrame(props_gaz0)
This code works correctly.
Now, rather than repeating it each time, I create a for loop to loop over time. I started with this line (let's say I have 10 files, let's say that the shape of gaz0 was (11,12,13):
gaz = numpy.zeros(time,11,12,13)
for counter in range(0,10):
gaz[counter,:,:,:] = alpha_time[counter,:,:,:] >=0
I did not have an error output, however when I do print(gaz[counter,:,:,:] I have a numerical matrix, ....
and when I do print(gaz0) I have a boolean output True when alpha_0[0,:,:,:] >= 0 and False elsewhere
I think that my output from the loop should be similar to the example before looping. I couldn't find from where this is coming from?
The problem is that gaz is defined as a numpy array of floats (default dtype for np.zeros function), while gaz0 is a boolean mask computed when you define it, so, it contains booleans on it.
If you want gaz0 to contain floats instead of booleans, you need to cast it as follows:
gaz0 = gaz0.astype(np.float64)
Note that you can cast it to whatever dtype you need. Consider that True values are casted to 1 and False to zero.
This is exactly the rule being applied implicitly in your assignment:
gaz[counter,:,:,:] = alpha_time[counter,:,:,:] >=0
In the second case, if you want to get booleans in your gaz numpy array, you just need to define with the proper dtype:
gaz = numpy.zeros((time,11,12,13), dtype=bool)

how to change a array of a single column to a row of values instead of arrays in python

I am trying to convert an array of arrays that each contain only one integer to a single array with just the integers.
This is my code below. k=1 after the first for loop and the next code deletes all the rows of except the first one and then transposes it.
handles.Background = np.zeros(((len(imgY) * len(imgX)),len(imgZ)))
WhereIsBackground = np.zeros((len(imgY), len(imgX)))
k = 0
for i in range(len(imgY)):
for j in range (len(imgX)):
if img[i,j,handles.PS_Index] < (handles.PS_Mean_Intensity / 8):
handles.Background[k,:] = img[i,j,:]
WhereIsBackground[i,j] = 1
k = k+1
handles.Background = np.delete(handles.Background,np.s_[k:(len(imgY)*len(imgX))+1],0).T
At this point, I can access data by using handles.Background[n] but this returns an array that contains a single integer. I was trying to convert the handles.Background so that when I do handles.Background[n], it just returns a single integer instead of an array containing that value.
So, I'm getting array([0.]) when I run handles.Background[0], but I want to get just 0 when I run handles.Background[0]
I've observed that int(handles.Background[i]) returns an integer and tried to reassign them using a for loop but the result didn't really change. What would be the best option for me?
for i in range (len(handles.Background)):
handles.Background[i] = int(handles.Background[i])
if handles.Background[n] returns an array, you can index into that, too, using the same [n] notation.
So you are looking for
handles.Background[n][0]
If you want to unpack the whole array at once, you can use this:
handles.Background = [bg[0] for bg in handles.Background]

Use of np.where()[0]

My code detects all the points under a threshold, then locates the start and end points.
below = np.where(self.data < self.threshold)[0]
startandend = np.diff(below)
startpoints = np.insert(startandend, 0, 2)
endpoints = np.insert(startandend, -1, 2)
startpoints = np.where(startpoints>1)[0]
endpoints = np.where(endpoints>1)[0]
startpoints = below[startpoints]
endpoints = below[endpoints]
I don't really get the use of [0] after np.where() function here
below = np.where(self.data < self.threshold)[0]
means:
take the first element from the tuple of ndarrays returned by np.where() and
assign it to below.
np.where is tricky. It returns an array of lists of indices where the conditions are met, even if the condition is never satisfied. In the case of np.where(my_numpy_array==some_value)[0] specifically, this means that you want the first value in the array, which is a list, and which contains the list of indices of condition-meeting cells.
Quite a mouthful. In simple terms, np.where(array==x)[0] returns a list of indices where the conditions have been met. I'm guessing this is a result of designing numpy for extensively broad applications.
Keep in mind that no matches still returns an empty list; errors like only size-1 arrays can be converted to python (some type) may be attributed to that.

Creating logical array from numpy array

I have a very large numpy array in Python full of meteorological data. In order to observe flawed data, I would like to look at every value and test it to see if it is less than -1. Eventually I would like to represent this with a logical array of 0's and 1's with 1 representing indices where the value is less than -1 and zeros representing all others. I have tried using the numpy.where funtion as follows
logarr = np.where(metdat < -1)
which returns the original array and the array of zeros for when this condition is true (around 200 times). I have tried using the numpy.where syntax laid out in Sci.Py.org where
logarr = np.where(metdat < -1 [1,0])
but my program dislikes the syntax. What am I doing wrong or would anyone recommend a better way of going about this?
Thanks,
jmatt
This works for your case, which directly converts the type from logical to int:
(matdat < -1).astype(int)
Or for np.where, the syntax needs to be:
np.where(matdat < -1, 1, 0)

Getting numbers within a range from a gaussian_kde_resample array

I have a gaussian_kde.resample array. I don't know if it is a numpy array so that I can use numpy functions.
I had the data 0<x<=0.5 of 3000 variables and I used
kde = scipy.stats.gaussian_kde(x) # can also mention bandwidth here (x,bandwidth)
sample = kde.resample(100000) # returns 100,000 values that follow the prob distribution of "x"
This gave me a sample of data that follows the probability distribution of "x". But the problem is, no matter what bandwidth I try to select, I get very few negative values in my "sample". I only want values within the range 0 < sample <= 0.5
I tried to do:
sample = np.array(sample) # to convert this to a numpy array
keep = 0<sample<=0.5
sample = sample[keep] # using the binary conditions
But this does not work! How can I remove the negative values in my array?
Firstly, you can check what type it is by using the 'type' call within python:
x = kde.resample(10000)
type(x)
numpy.ndarray
Secondly, it should be working in the way you wrote, but I would be more explicit in your binary condition:
print x
array([[ 1.42935658, 4.79293343, 4.2725778 , ..., 2.35775067, 1.69647609]])
x.size
10000
y = x[(x>1.5) & (x<4)]
which you can see, does the correct binary conditions and removes the values >1.5 and <4:
print y
array([ 2.95451084, 2.62400183, 2.79426449, ..., 2.35775067, 1.69647609])
y.size
5676
I know I'm answering about 3 years late, but this may be useful for future reference.
The catch is that while kde.resample(100000) technically returns a NumPy array, this array actually contains another array(!), and that gets in the way of all the attempts to use indexing to get subsets of the sample. To get the array that the resample() method probably should have returned all along, do this instead:
sample = kde.resample(100000)[0]
The array variable sample should then have all 100000 samples, and indexing this array should work as expected.
Why SciPy does it this way, I don't know. This misfeature doesn't even appear to be documented.
First of all, the return value of kde.resample is a numpy array, so you do not need to reconvert it.
The problem lies in the line (Edit: No, it doesn't. This should work!)
keep = 0 < sample <= 0.5
It does not do what you would think. Try:
keep = (0 < sample) * (sample <= 0.5)

Categories

Resources