median-absolute-deviation (MAD) based outlier detection - python

I wanted to apply median-absolute-deviation (MAD) based outlier detection using the answer from #Joe Kington as given below:
Pythonic way of detecting outliers in one dimensional observation data
However, what's going wrong with my code, I could not figure out how to assign the outliers as nan values for MY DATA:
import numpy as np
data = np.array([55,32,4,5,6,7,8,9,11,0,2,1,3,4,5,6,7,8,25,25,25,25,10,11,12,25,26,27,28],dtype=float)
median = np.median(data, axis=0)
diff = np.sum((data - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
data_without_outliers = data[modified_z_score < 3.5]
?????
print data_without_outliers

What is the problem with using:
data[modified_z_score > 3.5] = np.nan
Note that this will only work if data is a floating point array (which it should be if you are calculating MAD).

The problem might be line:
diff = np.sum((data - median)**2, axis=-1)
Applying np.sum() will collapse the result to scalar.
Remove top-level sum, and your code will work.
Other way around it to ensure that that data is at least 2d array. You can use numpy.atleast_2d() for that.
In order to assign NaNs, follow answer from https://stackoverflow.com/a/22804327/4989451

Related

Finite difference using xarray rolling

My goal is to compute a derivative of a moving window of a multidimensional dataset along a given dimension, where the dataset is stored as Xarray DataArray or DataSet.
In the simplest case, given a 2D array I would like to compute a moving difference across multiple entries in one dimension, e.g.:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6) ).reshape(10,6)
T=3
reducedArray = np.zeros_like(data)
for i in range(data.shape[1]):
if i < T:
reducedArray[:,i] = data[:,i] - data[:,0]
else:
reducedArray[:,i] = data[:,i] - data[:,i-T]
where the if i <T condition ensures that input and output contain proper values (i.e., no nans) and are of identical shape.
Xarray's diff aims to perform a finite-difference approximation of a given derivative order using nearest-neighbours, so it is not suitable here, hence the question:
Is it possible to perform this operation using Xarray functions only?
The rolling weighted average example appears to be something similar, but still too distinct due to the usage of NumPy routines. I've been thinking that something along the lines of the following should work:
xr2DDataArray = xr.DataArray(
data,
dims=('x','y'),
coords={'x':np.linspace(0,1,10), 'y':np.linspace(1,4,6)}
)
r = xr2DDataArray.rolling(x=T,min_periods=2)
r.reduce( redFn )
I am struggling with the definition of redFn here ,though.
Caveat The actual dataset to which the operation is to be applied will have a size of ~10GiB, so a solution that does not blow up the memory requirements will be highly appreciated!
Update/Solution
Using Xarray rolling
After sleeping on it and a bit more fiddling the post linked above actually contains a solution. To obtain a finite difference we just have to define the weights to be $\pm 1$ at the ends and $0$ else:
def fdMovingWindow(data, **kwargs):
T = kwargs['T'];
del kwargs['T'];
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
axis = kwargs['axis']
if data.shape[axis] == T:
return np.sum(data * weights, **kwargs)
else:
return 0
r.reduce(fdMovingWindow, T=4)
alternatively, using construct and a dot product:
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
xrWeights = xr.DataArray(weights, dims=['window'])
xr2DDataArray.rolling(y=T,min_periods=1).construct('window').dot(xrWeights)
This carries a massive caveat: The procedure essentially creates a list arrays representing the moving window. This is fine for a modest 2D / 3D array, but for a 4D array that takes up ~10 GiB in memory this will lead to an OOM death!
Simplicistic - memory efficient
A less memory-intensive way is to copy the array and work in a way similar to NumPy's arrays:
xrDiffArray = xr2DDataArray.copy()
dy = xr2DDataArray.y.values[1] - xr2DDataArray.y.values[0] #equidistant sampling
for src in xr2DDataArray:
if src.y.values < xr2DDataArray.y.values[0] + T*dy:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.values[0]
else:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.sel(y = src.y.values - dy*T).values
This will produce the intended result without dimensional errors, but it requires a copy of the dataset.
I was hoping to utilise Xarray to prevent a copy and instead just chain operations that are then evaluated if and when values are actually requested.
A suggestion as to how to accomplish this will still be welcomed!
I have never used xarray, so maybe I am mistaken, but I think you can get the result you want avoiding using loops and conditionals. This is at least twice faster than your example for numpy arrays:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6)).reshape(10,6)
reducedArray = np.empty_like(data)
reducedArray[:, T:] = data[:, T:] - data[:, :-T]
reducedArray[:, :T] = data[:, :T] - data[:, 0, np.newaxis]
I imagine the improvement will be higher when using DataArrays.
It does not use xarray functions but neither depends on numpy functions. I am confident that translating this to xarray will be straightforward, I know that it works if there are no coords, but once you include them, you get an error because of the coords mismatch (coords of data[:, T:] and of data[:, :-T] are different). Sadly, I can't do better now.

What is the purpose of keras utils normalize?

I'd like to normalize my training set before passing it to my NN so instead of doing it manually (subtract mean and divide by std), I tried keras.utils.normalize() and I am amazed about the results I got.
Running this:
r = np.random.rand(3000) * 1000
nr = normalize(r)
print(np.mean(r))
print(np.mean(nr))
print(np.std(r))
print(np.std(nr))
print(np.min(r))
print(np.min(nr))
print(np.max(r))
print(np.max(nr))
​
​Results in that:
495.60440066771866
0.015737914577213984
291.4440194021
0.009254802974329002
0.20755517410064872
6.590913227674956e-06
999.7631481267636
0.03174747238214018
Unfortunately, the docs don't explain what's happening under the hood. Can you please explain what it does and if I should use keras.utils.normalize instead of what I would have done manually?
It is not the kind of normalization you expect. Actually, it uses np.linalg.norm() under the hood to normalize the given data using Lp-norms:
def normalize(x, axis=-1, order=2):
"""Normalizes a Numpy array.
# Arguments
x: Numpy array to normalize.
axis: axis along which to normalize.
order: Normalization order (e.g. 2 for L2 norm).
# Returns
A normalized copy of the array.
"""
l2 = np.atleast_1d(np.linalg.norm(x, order, axis))
l2[l2 == 0] = 1
return x / np.expand_dims(l2, axis)
For example, in the default case, it would normalize the data using L2-normalization (i.e. the sum of squared of elements would be equal to one).
You can either use this function, or if you don't want to do mean and std normalization manually, you can use StandardScaler() from sklearn or even MinMaxScaler().

How best to implement a matrix mask operation in tensorflow?

I had a case where I needed to fill some holes (missing data) in an image processing application in tensorflow. The 'holes' are easy to locate as they are zeros and the good data is not zeros. I wanted to fill the holes with random data. This is quite easy to do using python numpy but doing it in tensorflow requires some work. I came up with a solution and wanted to see if there is a better or more efficient way to do the same thing. I understand that tensorflow does not yet support the more advanced numpy type indexing yet but there is a function tf.gather_nd() that seems promising for this. However, I could not tell from the documentation how to us it for what I wanted to do. I would appreciate answers that improve on what I did or especially if someone can show me how to do it using tf.gather_nd(). Also, tf.boolean_mask() does not work for what I am trying to do because it does not allow you to use the output as an index. In python what I am trying to do:
a = np.ones((2,2))
a[0,0]=a[0,1] = 0
mask = a == 0
a[mask] = np.random.random_sample(a.shape)[mask]
print('new a = ', a)
What I ended up doing in Tensorflow to achieve same thing (skipping filling the array steps)
zeros = tf.zeros(tf.shape(a))
mask = tf.greater(a,zeros)
mask_n = tf.equal(a,zeros)
mask = tf.cast(mask,tf.float32)
mask_n = tf.cast(mask_n,tf.float32
r = tf.random_uniform(tf.shape(a),minval = 0.0,maxval=1.0,dtype=tf.float32)
r_add = tf.multiply(mask_n,r)
targets = tf.add(tf.multiply(mask,a),r_add)
I think these three lines might do what you want. First, you make a mask. Then, you create the random data. Finally, fill in the masked values with the random data.
mask = tf.equal(a, 0.0)
r = tf.random_uniform(tf.shape(a), minval = 0.0,maxval=1.0,dtype=tf.float32)
targets = tf.where(mask, r, a)
You can use tf.where to achieve the same:
A = tf.Variable(a)
B = tf.where(A==0., tf.random_normal(A.get_shape()), tf.cast(A, tf.float32))

Numpy symmetric matrix becomes asymmetric when I applied min-max scaling

I have a symmetric matrix (1877 x 1877), here is the matrix file. I try to standardize the values between 0-1. After I apply this method, the matrix is no longer symmetric. Any help is appreciated.
print((dist.transpose() == dist).all()) # this prints 'True'
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler()
return minmax_scale.fit_transform(X)
sci_dist_scaled = sci_minmax(dist)
(sci_dist_scaled.transpose() == sci_dist_scaled).all() # this print 'False'
sci_dist_scaled.dtype, dist.dtype # (dtype('float64'), dtype('float64'))
Looking at this description the minmaxscaler appears to work column-by-column, so, naturally, you can't expect it to preserve symmetry.
What's best to do in your case depends a bit on what you are trying to achieve, really. If having the values between 0 and 1 is all you require you can rescale by hand:
mn, mx = dist.min(), dist.max()
dist01 = (dist - mn) / (mx - mn)
but depending on your ultimate problem this may be too simplistic...

Pandas: Check if row has similar values

I'm generating an overlay for a map using pandas and used:
if ((df['latitude'] == new_latitude) & (df['longitude'] == new_longitude)).any():
continue
to make sure that I don't produce duplicate points. But I am starting to produce points that are 0.001 different (in either longitude, latitude or both) than one already produced. How can I prevent this in a similar manner as above?
IIUC you can subtract from the entire series and then just filter the points:
thresh = 0.001
lat = df.loc[(df['latitude'] - new_latitude).abs() > thresh, 'latitude']
lon = df.loc[(df['longtitude'] - new_longtitude).abs() > thresh, 'longtitude']
this uses abs to get the absolute value to generate a boolean mask and filter all the duplicate and near duplicate values out.
You could use numpy.isclose function with atol setted to your precision:
import numpy as np
prec = 0.001
np.isclose(df['latitude'], new_latitude, atol=prec)
if ((np.isclose(df['latitude'], new_latitude, prec) & (np.isclose(df['longitude'], new_longitude, prec)).any():
continue

Categories

Resources