Getting numbers within a range from a gaussian_kde_resample array - python

I have a gaussian_kde.resample array. I don't know if it is a numpy array so that I can use numpy functions.
I had the data 0<x<=0.5 of 3000 variables and I used
kde = scipy.stats.gaussian_kde(x) # can also mention bandwidth here (x,bandwidth)
sample = kde.resample(100000) # returns 100,000 values that follow the prob distribution of "x"
This gave me a sample of data that follows the probability distribution of "x". But the problem is, no matter what bandwidth I try to select, I get very few negative values in my "sample". I only want values within the range 0 < sample <= 0.5
I tried to do:
sample = np.array(sample) # to convert this to a numpy array
keep = 0<sample<=0.5
sample = sample[keep] # using the binary conditions
But this does not work! How can I remove the negative values in my array?

Firstly, you can check what type it is by using the 'type' call within python:
x = kde.resample(10000)
type(x)
numpy.ndarray
Secondly, it should be working in the way you wrote, but I would be more explicit in your binary condition:
print x
array([[ 1.42935658, 4.79293343, 4.2725778 , ..., 2.35775067, 1.69647609]])
x.size
10000
y = x[(x>1.5) & (x<4)]
which you can see, does the correct binary conditions and removes the values >1.5 and <4:
print y
array([ 2.95451084, 2.62400183, 2.79426449, ..., 2.35775067, 1.69647609])
y.size
5676

I know I'm answering about 3 years late, but this may be useful for future reference.
The catch is that while kde.resample(100000) technically returns a NumPy array, this array actually contains another array(!), and that gets in the way of all the attempts to use indexing to get subsets of the sample. To get the array that the resample() method probably should have returned all along, do this instead:
sample = kde.resample(100000)[0]
The array variable sample should then have all 100000 samples, and indexing this array should work as expected.
Why SciPy does it this way, I don't know. This misfeature doesn't even appear to be documented.

First of all, the return value of kde.resample is a numpy array, so you do not need to reconvert it.
The problem lies in the line (Edit: No, it doesn't. This should work!)
keep = 0 < sample <= 0.5
It does not do what you would think. Try:
keep = (0 < sample) * (sample <= 0.5)

Related

Why boolean output turns to numerical output once I create a loop?

I am using skiimage library, I got it works correctly for different input files data.
Here the code that is working:
To explain it briefly, alpha_time is a level set function structured as [time,x,y,z], so alpha_time[0,:,:,:] is the level set function at time = 0.
gaz0 = alpha_0[0,:,:,:] >= 0
labels_gaz0 = measure.label(gaz0)
props_gaz0 = measure.regionprops_table(labels_gaz0,properties
['label','area'])
df0 = pandas.DataFrame(props_gaz0)
This code works correctly.
Now, rather than repeating it each time, I create a for loop to loop over time. I started with this line (let's say I have 10 files, let's say that the shape of gaz0 was (11,12,13):
gaz = numpy.zeros(time,11,12,13)
for counter in range(0,10):
gaz[counter,:,:,:] = alpha_time[counter,:,:,:] >=0
I did not have an error output, however when I do print(gaz[counter,:,:,:] I have a numerical matrix, ....
and when I do print(gaz0) I have a boolean output True when alpha_0[0,:,:,:] >= 0 and False elsewhere
I think that my output from the loop should be similar to the example before looping. I couldn't find from where this is coming from?
The problem is that gaz is defined as a numpy array of floats (default dtype for np.zeros function), while gaz0 is a boolean mask computed when you define it, so, it contains booleans on it.
If you want gaz0 to contain floats instead of booleans, you need to cast it as follows:
gaz0 = gaz0.astype(np.float64)
Note that you can cast it to whatever dtype you need. Consider that True values are casted to 1 and False to zero.
This is exactly the rule being applied implicitly in your assignment:
gaz[counter,:,:,:] = alpha_time[counter,:,:,:] >=0
In the second case, if you want to get booleans in your gaz numpy array, you just need to define with the proper dtype:
gaz = numpy.zeros((time,11,12,13), dtype=bool)

Creating logical array from numpy array

I have a very large numpy array in Python full of meteorological data. In order to observe flawed data, I would like to look at every value and test it to see if it is less than -1. Eventually I would like to represent this with a logical array of 0's and 1's with 1 representing indices where the value is less than -1 and zeros representing all others. I have tried using the numpy.where funtion as follows
logarr = np.where(metdat < -1)
which returns the original array and the array of zeros for when this condition is true (around 200 times). I have tried using the numpy.where syntax laid out in Sci.Py.org where
logarr = np.where(metdat < -1 [1,0])
but my program dislikes the syntax. What am I doing wrong or would anyone recommend a better way of going about this?
Thanks,
jmatt
This works for your case, which directly converts the type from logical to int:
(matdat < -1).astype(int)
Or for np.where, the syntax needs to be:
np.where(matdat < -1, 1, 0)

Float required in list output

I am trying to create a custom filter to run it with the generic filter from SciPy package.
scipy.ndimage.filters.generic_filter
The problem is that I don't know how to get the returned value to be a scalar, as it needs for the generic function to work. I read through these threads (bottom), but I can't find a way for my function to perform.
The code is this:
import scipy.ndimage as sc
def minimum(window):
list = []
for i in range(window.shape[0]):
window[i] -= min(window)
list.append(window[i])
return list
test = np.ones((10, 10)) * np.arange(10)
result = sc.generic_filter(test, minimum, size=3)
It gives the error:
cval, origins, extra_arguments, extra_keywords)
TypeError: a float is required
Scipy filter with multi-dimensional (or non-scalar) output
How to apply ndimage.generic_filter()
http://ilovesymposia.com/2014/06/24/a-clever-use-of-scipys-ndimage-generic_filter-for-n-dimensional-image-processing/
If I understand, you want to substract each pixel the min of its 3-horizontal neighbourhood. It's not a good practice to do that with lists, because numpy is for efficiency( ~100 times faster ). The simplest way to do that is just :
test-sc.generic_filter(test, np.min, size=3)
Then the substraction is vectorized on the whole array.
You can also do:
test-np.min([np.roll(test,1),np.roll(test,-1),test],axis=0)
10 times faster, if you accept the artefact at the border.
Using the example in Scipy filter with multi-dimensional (or non-scalar) output I converted your code to:
def minimum(window,out):
list = []
for i in range(window.shape[0]):
window[i] -= min(window)
list.append(window[i])
out.append(list)
return 0
test = np.ones((10, 10)) * np.arange(10)
result = []
sc.generic_filter(test, minimum, size=3, extra_arguments=(result,))
Now your function minimum outputs its result to the parameter out, and the return value is not used anymore. So the final result matrix contains all the results concatenated, not the output of generic_filter.
Edit 1: Using the generic_filter with a function that returns a scalar, a matrix of the same dimensions is returned. In this case however the lists are appended of each call by the filter which results in a larger matrix (100x9 in this case).

Replacing missing values with random in a numpy array

I have a 2D numpy array with binary data, i.e. 0s and 1s (not observed or observed). For some instances, that information is missing (NaN). Since the missing values are random in the data set, I think the best way to replace them would be using random 0s and 1s.
Here is some example code:
import numpy as np
row, col = 10, 5
matrix = np.random.randint(2, size=(row,col))
matrix = matrix.astype(float)
matrix[1,2] = np.nan
matrix[5,3] = np.nan
matrix[8,0] = np.nan
matrix[np.isnan(matrix)] = np.random.randint(2)
The problem with this is that all NaNs are replaced with the same value, either 0 or 1, while I would like both. Is there a simpler solution than for example a for loop calling each NaN separately? The data set I'm working on is a lot bigger than this example.
Try
nan_mask = np.isnan(matrix)
matrix[nan_mask] = np.random.randint(0, 2, size=np.count_nonzero(nan_mask))
You can use a vectorized function:
random_replace = np.vectorize(lambda x: np.random.randint(2) if np.isnan(x) else x)
random_replace(matrix)
Since the missing values are random in the data set, I think the best way to replace them would be using random 0s and 1s.
I'd heartily contradict you here. Unless you have stochastic model that proves that assuming equal probability for each element to be either 0 or 1, that would bias your observation.
Now, I don't know where your data comes from, but "2D array" sure sounds like an image signal, or something of the like. You can find that most of the energy in many signal types is in low frequencies; if something of the like is the case for you, you can probably get lesser distortion by replacing the missing values with an element of a low-pass filtered version of your 2D array.
Either way, since you need to call numpy.isnan from python to check whether a value is NaN, I think the only way to solve this is writing an efficient loop, unless you want to senselessly calculate a huge random 2D array just to fill in a few missing numbers.
EDIT: oh, I like the vectorized version; it's effectively what I'd call a efficient loop, since it does the looping without interpreting a python loop iteration each time.
EDIT2: the mask method with counting nonzeros is even more effective, I guess :)

can I do fast set difference with floats using numpy if elements are equal up to some tolerance

I have two lists of float numbers, and I want to calculate the set difference between them.
With numpy I originally wrote the following code:
aprows = allpoints.view([('',allpoints.dtype)]*allpoints.shape[1])
rprows = toberemovedpoints.view([('',toberemovedpoints.dtype)]*toberemovedpoints.shape[1])
diff = setdiff1d(aprows, rprows).view(allpoints.dtype).reshape(-1, 2)
This works well for things like integers. In case of 2d points with float coordinates that are the result of some geometrical calculations, there's a problem of finite precision and rounding errors causing the set difference to miss some equalities. For now I resorted to the much, much slower:
diff = []
for a in allpoints:
remove = False
for p in toberemovedpoints:
if norm(p-a) < 0.1:
remove = True
if not remove:
diff.append(a)
return array(diff)
But is there a way to write this with numpy and gain back the speed?
Note that I want the remaining points to still have their full precision, so first rounding the numbers and then do a set difference probably is not the way forward (or is it? :) )
Edited to add an solution based on scipy.KDTree that seems to work:
def remove_points_fast(allpoints, toberemovedpoints):
diff = []
removed = 0
# prepare a KDTree
from scipy.spatial import KDTree
tree = KDTree(toberemovedpoints, leafsize=allpoints.shape[0]+1)
for p in allpoints:
distance, ndx = tree.query([p], k=1)
if distance < 0.1:
removed += 1
else:
diff.append(p)
return array(diff), removed
If you want to do this with the matrix form, you have a lot of memory consumption with larger arrays. If that does not matter, then you get the difference matrix by:
diff_array = allpoints[:,None] - toberemovedpoints[None,:]
The resulting array has as many rows as there are points in allpoints, and as many columns as there are points in toberemovedpoints. Then you can manipulate this any way you want (e.g. calculate the absolute value), which gives you a boolean array. To find which rows have any hits (absolute difference < .1), use numpy.any:
hits = numpy.any(numpy.abs(diff_array) < .1, axis=1)
Now you have a vector which has the same number of items as there were rows in the difference array. You can use that vector to index all points (negation because we wanted the non-matching points):
return allpoints[-hits]
This is a numpyish way of doing this. But, as I said above, it takes a lot of memory.
If you have larger data, then you are better off doing it point by point. Something like this:
return allpoints[-numpy.array([numpy.any(numpy.abs(a-toberemoved) < .1) for a in allpoints ])]
This should perform well in most cases, and the memory use is much lower than with the matrix solution. (For stylistic reasons you may want to use numpy.all instead of numpy.any and turn the comparison around to get rid of the negation.)
(Beware, there may be pritning mistakes in the code.)

Categories

Resources