Replacing missing values with random in a numpy array - python

I have a 2D numpy array with binary data, i.e. 0s and 1s (not observed or observed). For some instances, that information is missing (NaN). Since the missing values are random in the data set, I think the best way to replace them would be using random 0s and 1s.
Here is some example code:
import numpy as np
row, col = 10, 5
matrix = np.random.randint(2, size=(row,col))
matrix = matrix.astype(float)
matrix[1,2] = np.nan
matrix[5,3] = np.nan
matrix[8,0] = np.nan
matrix[np.isnan(matrix)] = np.random.randint(2)
The problem with this is that all NaNs are replaced with the same value, either 0 or 1, while I would like both. Is there a simpler solution than for example a for loop calling each NaN separately? The data set I'm working on is a lot bigger than this example.

Try
nan_mask = np.isnan(matrix)
matrix[nan_mask] = np.random.randint(0, 2, size=np.count_nonzero(nan_mask))

You can use a vectorized function:
random_replace = np.vectorize(lambda x: np.random.randint(2) if np.isnan(x) else x)
random_replace(matrix)

Since the missing values are random in the data set, I think the best way to replace them would be using random 0s and 1s.
I'd heartily contradict you here. Unless you have stochastic model that proves that assuming equal probability for each element to be either 0 or 1, that would bias your observation.
Now, I don't know where your data comes from, but "2D array" sure sounds like an image signal, or something of the like. You can find that most of the energy in many signal types is in low frequencies; if something of the like is the case for you, you can probably get lesser distortion by replacing the missing values with an element of a low-pass filtered version of your 2D array.
Either way, since you need to call numpy.isnan from python to check whether a value is NaN, I think the only way to solve this is writing an efficient loop, unless you want to senselessly calculate a huge random 2D array just to fill in a few missing numbers.
EDIT: oh, I like the vectorized version; it's effectively what I'd call a efficient loop, since it does the looping without interpreting a python loop iteration each time.
EDIT2: the mask method with counting nonzeros is even more effective, I guess :)

Related

Creating an empty multidimensional array

In Python when using np.empty(), for example np.empty((3,1)) we get an array that is of size (3,1) but, in reality, it is not empty and it contains very small values (e.g., 1.7*(10^315)). Is possible to create an array that is really empty/have no values but have given dimensions/shape?
I'd suggest using np.full_like to choose the fill-value directly...
x = np.full_like((3, 1), None, dtype=object)
... of course the dtype you chose kind of defines what you mean by "empty"
I am guessing that by empty, you mean an array filled with zeros.
Use np.zeros() to create an array with zeros. np.empty() just allocates the array, so the numbers in there are garbage. It is provided as a way to even reduce the cost of setting the values to zero. But it is generally safer to use np.zeros().
I suggest to use np.nan. like shown below,
yourdata = np.empty((3,1)) * np.nan
(Or)
you can use np.zeros((3,1)). but it will fill all the values as zero. It is not intuitively well. I feel like using np.nan is best in practice.
Its all upto you and depends on your requirement.

Is there a way to perform this subsampling algorithm in numpy?

The algorithm just builds up a new list from an input data array. It only appends a new element from the input array once the element has crossed the visibleDelta threshold of the previous stored element:
def subsample(data, visibleDelta):
subsampled = [data[0]]
for point in data[1:]:
if abs(point - subsampled[len(subsampled) - 1]) > visibleDelta:
subsampled.append(point)
return subsampled
Problem is I need this to run on very large datasets (~1B values), and I'd like to use numpy or some other numerical library to do this if possible.
I should probably mention that the 'real' function won't just deal with a 1D array of data. The input data will be a pandas dataframe, with the first column being x values, and the second being y values (I'll be comparing the y values).
Any way to do this efficiently?
if you want to track the data in this way, numpy is not the good tool, See Numba or Cython for efficiency.
A slightly different approach is to determine threshold and look when data reach them :
data=sin(arange(1e6)/3e4)
visibledelta=0.2
cat=floor(data/visibledelta)
subsample=arange(data.size-1)[diff(cat).astype(bool)]
plot(data)
plot(subsample,data[subsample],'o')
which give :
Some adjust may be done, but the data is splitted in chunks.

Getting numbers within a range from a gaussian_kde_resample array

I have a gaussian_kde.resample array. I don't know if it is a numpy array so that I can use numpy functions.
I had the data 0<x<=0.5 of 3000 variables and I used
kde = scipy.stats.gaussian_kde(x) # can also mention bandwidth here (x,bandwidth)
sample = kde.resample(100000) # returns 100,000 values that follow the prob distribution of "x"
This gave me a sample of data that follows the probability distribution of "x". But the problem is, no matter what bandwidth I try to select, I get very few negative values in my "sample". I only want values within the range 0 < sample <= 0.5
I tried to do:
sample = np.array(sample) # to convert this to a numpy array
keep = 0<sample<=0.5
sample = sample[keep] # using the binary conditions
But this does not work! How can I remove the negative values in my array?
Firstly, you can check what type it is by using the 'type' call within python:
x = kde.resample(10000)
type(x)
numpy.ndarray
Secondly, it should be working in the way you wrote, but I would be more explicit in your binary condition:
print x
array([[ 1.42935658, 4.79293343, 4.2725778 , ..., 2.35775067, 1.69647609]])
x.size
10000
y = x[(x>1.5) & (x<4)]
which you can see, does the correct binary conditions and removes the values >1.5 and <4:
print y
array([ 2.95451084, 2.62400183, 2.79426449, ..., 2.35775067, 1.69647609])
y.size
5676
I know I'm answering about 3 years late, but this may be useful for future reference.
The catch is that while kde.resample(100000) technically returns a NumPy array, this array actually contains another array(!), and that gets in the way of all the attempts to use indexing to get subsets of the sample. To get the array that the resample() method probably should have returned all along, do this instead:
sample = kde.resample(100000)[0]
The array variable sample should then have all 100000 samples, and indexing this array should work as expected.
Why SciPy does it this way, I don't know. This misfeature doesn't even appear to be documented.
First of all, the return value of kde.resample is a numpy array, so you do not need to reconvert it.
The problem lies in the line (Edit: No, it doesn't. This should work!)
keep = 0 < sample <= 0.5
It does not do what you would think. Try:
keep = (0 < sample) * (sample <= 0.5)

Grid search: Get index that corresponds best to value (for a whole matrix)

I know that this question party has been answered, but I am looking specifically at numpy and scipy. Say I have a grid
lGrid = linspace(0.1, 8, 50)
and I want to find the index that corresponds best to 2, I do
index = abs(lGrid-2).argmin()
lGrid[index]
2.034
However, what if I have a whole matrix of values instead of 2 here. I guess iteration is pretty slow. abs(lGrid-[2,4]) however will fail due to shape issues. I will need a solution that is easily extendable to N-dim matrices. What is the best course of action in this environment?
You can use broadcasting:
from numpy import arange,linspace,argmin
vals = arange(30).reshape(2,5,3) #your N-dimensional input, like array([2,4])
lGrid = linspace(0.1, 8, 50)
result = argmin(abs(lGrid-vals[...,newaxis]),axis=-1)
for example, with input vals = array([2,4]), you obtain result = array([12, 24]) and lGrid[result]=array([ 2.03469388, 3.96938776])
You "guess that Iteration is pretty slow", but I guess it isn't. So I would just just iterate over the "whole Matrix of values instead of 2". Perhaps:
for val in BigArray.flatten():
index = abs(lGrid-val).argmin()
yield lGrid[index]
If lGrid is failry large, then the overhead of iterating in a Python for loop is probably not big in comparison to the vecotirsed operation Happening inside it.
There might be a way you can use broadcasting and reshaping to do the whole thing in one giant operation, but would be complicated, and you might accidentally allocate such a huge array that your machine slows down to a crawl.

Best way to create a NumPy array from a dictionary?

I'm just starting with NumPy so I may be missing some core concepts...
What's the best way to create a NumPy array from a dictionary whose values are lists?
Something like this:
d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }
Should turn into something like:
data = [
[10,20,30,?,?],
[50,60,?,?,?],
[100,200,300,400,500]
]
I'm going to do some basic statistics on each row, eg:
deviations = numpy.std(data, axis=1)
Questions:
What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?
Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.
You don't need to create numpy arrays to call numpy.std().
You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.
The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.
If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
if you find that this is still too slow, try to use psycho to optimize the python loop.
if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
An alternative to Cython would be to use SWIG with numpy.i.
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.
example with O(N) complexity:
import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
if len(row) == 1:
list_size_1.append(row)
elif len(row) == 2:
list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)
While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.
Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records.
The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:
at least O(N*logN) 'len' calls and comparisons for sorting with an effective algorithm
O(N) checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)
Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).
It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:
for row in data.itervalues():
np_row = numpy.array(row)
this_row_std = numpy.std(np_row)
# compute any other statistic descriptors needed and then save to some list
In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.
A generalized variant of what was suggested by Mapad:
from numpy import array, mean, std
def get_statistical_descriptors(a):
if ax = len(shape(a))-1
functions = [mean, std]
return f(a, axis = ax) for f in functions
def process_long_list_stats(data):
import numpy
groups = {}
for key, row in data.iteritems():
size = len(row)
try:
groups[size].append(key)
except KeyError:
groups[size] = ([key])
results = []
for gr_keys in groups.itervalues():
gr_rows = numpy.array([data[k] for k in gr_keys])
stats = get_statistical_descriptors(gr_rows)
results.extend( zip(gr_keys, zip(*stats)) )
return dict(results)
numpy dictionary
You can use a structured array to preserve the ability to address a numpy object by a key, like a dictionary.
import numpy as np
dd = {'a':1,'b':2,'c':3}
dtype = eval('[' + ','.join(["('%s', float)" % key for key in dd.keys()]) + ']')
values = [tuple(dd.values())]
numpy_dict = np.array(values, dtype=dtype)
numpy_dict['c']
will now output
array([ 3.])

Categories

Resources