I'm looking for a way to shift a np.array of length n, n-1 times and to create a matrix of the shifted vectors.
So for example if this is my vector:
[1,4,7,8]
What I want to get is:
[[None, None, None],
[1 , None, None],
[4 , 1 , None],
[7 , 4 , 1 ]]
I can do it easily with a for loop and shift, but I was wondering whether there is a more efficient way with a builtin numpy function.
Here's one with np.lib.stride_tricks.as_strided -
def shifted_subarrays(a, fill=None):
a = np.asarray(a)
fillar = np.full(len(a)-1, fill)
a_ext = np.concatenate((fillar,a))
n = len(a)
s = a_ext.strides[0]
strided = np.lib.stride_tricks.as_strided
return strided(a_ext[len(a)-2:], shape=(n,n-1), strides=(s,-s))
Sample run -
In [20]: a = [1,4,7,8]
In [21]: shifted_subarrays(a)
Out[21]:
array([[None, None, None],
[1, None, None],
[4, 1, None],
[7, 4, 1]], dtype=object)
In [46]: shifted_subarrays(a, fill=np.nan)
Out[46]:
array([[nan, nan, nan],
[ 1., nan, nan],
[ 4., 1., nan],
[ 7., 4., 1.]])
A simpler one with toeplitz -
from scipy.linalg import toeplitz
out = toeplitz(a,[None]*(len(a)))[:,1:]
Related
Beginner level at python. I have a large matrix (MxN) that I want to process and a Mx1 matrix that contains some indices. What I would want is to replace each row of the MxN matrix with a NaN given that the column of that row is less than that of the listed with respect to the Mx1 indices matrix.
Say for example I have:
A = [1 2 3 4]
[5 6 7 8]
[9 10 11 12]
and
B = [0]
[2]
[1]
the resultant matrix should be
C = [1 2 3 4]
[NaN NaN 7 8]
[NaN 10 11 12]
I am trying to avoid using for loops because the matrix I'm dealing with is large and the this function will be repetitive. Is there an elegant pythonic way to implement this?
Check out this code :
here logic over which first method work is that create condition-matrix for np.where and which is done following ways
import numpy as np
A = np.array([[1, 2, 3, 4],[5, 6, 7, 8],[9, 10, 11, 12]], dtype=np.float)
B = np.array([[0], [2], [1]])
B = np.array(list(map(lambda i: [False]*i[0]+[True]*(4-i[0]), B)))
A = np.where(B, A, np.nan)
print(A)
Method-2: using basic pythonic code
import numpy as np
A = np.array([[1, 2, 3, 4],[5, 6, 7, 8],[9, 10, 11, 12]], dtype=np.float)
B = np.array([[0], [2], [1]])
for i,j in enumerate(A):
j[:B[i][0]] = np.nan
print(A)
Your arrays - note that A is float, so it can hold np.nan:
In [348]: A = np.arange(1,13).reshape(3,4).astype(float); B = np.array([[0],[2],[1]])
In [349]: A
Out[349]:
array([[ 1., 2., 3., 4.],
[ 5., 6., 7., 8.],
[ 9., 10., 11., 12.]])
In [350]: B
Out[350]:
array([[0],
[2],
[1]])
A boolean mask were we want to change values:
In [351]: np.arange(4)<B
Out[351]:
array([[False, False, False, False],
[ True, True, False, False],
[ True, False, False, False]])
apply it:
In [352]: A[np.arange(4)<B] = np.nan
In [353]: A
Out[353]:
array([[ 1., 2., 3., 4.],
[nan, nan, 7., 8.],
[nan, 10., 11., 12.]])
Given a 2D array, I'm looking for a pythonic way to get an array of same shape, with only the maximum element per each row.
See max_row_filter function below
def max_row_filter(mat2d):
m = np.zeros(mat2d.shape)
for r in range(mat2d.shape[0]):
c = np.argmax(mat2d[r])
m[r,c]=mat2d[r,c]
return m
p = np.array([[1,2,3],[5,4,3,],[9,10,3]])
max_row_filter(p)
Out: array([[ 0., 0., 3.],
[ 5., 0., 0.],
[ 0., 10., 0.]])
I'm looking for an efficient way to do this, suitable to be done on big arrays.
Alternative answer (this will keep duplicates):
p * (p==p.max(axis=1, keepdims=True))
If there are no duplicates, you could use numpy.argmax:
import numpy as np
p = np.array([[1, 2, 3],
[5, 4, 3, ],
[9, 10, 3]])
result = np.zeros_like(p)
rows, cols = zip(*enumerate(np.argmax(p, axis=1)))
result[rows, cols] = p[rows, cols]
print(result)
Output
[[ 0 0 3]
[ 5 0 0]
[ 0 10 0]]
Note that, for multiple occurrences argmax return the first occurence.
My situation: i have a pandas dataframe so that, for each row, I have to compute the following.
1) Get the first valute na excluded (df.apply(lambda x: x.dropna().iloc[0]))
2) Get the last valute na excluded (df.apply(lambda x: x.dropna().iloc[-1]))
3) Count the non na values (df.apply(lambda x: len(x.dropna()))
Sample case and expected output :
x = np.array([[1,2,np.nan], [4,5,6], [np.nan, 8,9]])
1) [1, 4, 8]
2) [2, 6, 9]
3) [2, 3, 2]
And i need to keep it optimized. So i turned to numpy and looked for a way to apply y = x[~numpy.isnan(x)] on a NxK array as a first step. Then,i would use what was shown here (Vectorized way of accessing row specific elements in a numpy array) for 1) and 2) but i am still empty handed for 3)
Here's one way -
In [756]: x
Out[756]:
array([[ 1., 2., nan],
[ 4., 5., 6.],
[ nan, 8., 9.]])
In [768]: m = ~np.isnan(x)
In [769]: first_idx = m.argmax(1)
In [770]: last_idx = m.shape[1] - m[:,::-1].argmax(1) - 1
In [771]: x[np.arange(len(first_idx)), first_idx]
Out[771]: array([ 1., 4., 8.])
In [772]: x[np.arange(len(last_idx)), last_idx]
Out[772]: array([ 2., 6., 9.])
In [773]: m.sum(1)
Out[773]: array([2, 3, 2])
Alternatively, we could make use of cumulative-summation to get those indices, like so -
In [787]: c = m.cumsum(1)
In [788]: first_idx = (c==1).argmax(1)
In [789]: last_idx = c.argmax(1)
I would like to replace the first x values in every row of my array a with ones and to keep all the other values NaN. The first x values however changes in every row and is determined by a list b.
Since I'm not very familiar with arrays I thought I might do this with a for loop as shown below, but this doesn't work
(I've got inspiration for the basics of replacement in arrays from How to set first N elements of array to zero?).
In:
a = np.empty((3,4))
a.fill(np.nan)
b = [2,3,1]
for i in range(b):
a[0:b[i]] = [1] * b[i]
a[i:] = np.ones((b[i]))
pass
Out:
line 7:
ValueError: could not broadcast input array from shape (2) into shape (2,4)
Result should be like:
Out:
[[1, 1, nan, nan],
[1, 1, 1, nan],
[1, nan, nan, nan]]
In the linked answer, How to set first N elements of array to zero?
the solution for arrays is
y = numpy.array(x)
y[0:n] = 0
In other words if we are filling a slice (range of indices) with the same number we can specify a scalar. It could be an array of the same size, e.g. np.ones(n). But it doesn't have to be.
So we just need to iterate over the rows of a (and elements of b) and perform this indexed assignment
In [368]: a = np.ones((3,4))*np.nan
In [369]: for i in range(3):
...: a[i,:b[i]] = 1
...:
In [370]: a
Out[370]:
array([[ 1., 1., nan, nan],
[ 1., 1., 1., nan],
[ 1., nan, nan, nan]])
There are various ways of 'filling' the original array with nan. np.full does an np.empty followed by a copyto.
A variation on the row iteration is with for i,n in enumerate(a):.
Another good way of iterating in a coordinated sense is with zip.
In [371]: for i,x in zip(b,a):
...: x[:i] = 1
This takes advantage of the fact that iteration on a 2d array iterates on its rows. So x is an 1d view of a and can be changed in-place.
But with a bit of indexing trickery, we don't even have to loop.
In [373]: a = np.full((3,4),np.nan)
In [375]: mask = np.array(b)[:,None]>np.arange(4)
In [376]: mask
Out[376]:
array([[ True, True, False, False],
[ True, True, True, False],
[ True, False, False, False]], dtype=bool)
In [377]: a[mask] = 1
In [378]: a
Out[378]:
array([[ 1., 1., nan, nan],
[ 1., 1., 1., nan],
[ 1., nan, nan, nan]])
This is a favorite of one of the top numpy posters, #Divakar.
Numpy: Fix array with rows of different lengths by filling the empty elements with zeros
It can be used to pad a list of lists. Speaking of padding, itertools has a handy tool, zip_longest (py3 name)
In [380]: np.array(list(itertools.zip_longest(*[np.ones(x).tolist() for x in b],fillvalue=np.nan))).T
Out[380]:
array([[ 1., 1., nan],
[ 1., 1., 1.],
[ 1., nan, nan]])
Your question should have specified what was wrong; what kinds of errors you got:
for i in w2:
a[0:b[i]] = [1] * b[i]
a[i:] = np.ones((b[i]))
w2 is unspecified, but probably is range(3).
a[0:b[i]] is wrong because it specifies all rows, where as you are working on just one at a time. a[i:] specifies a range of rows as well.
You can do this via a loop. Initialize an array of nan values then loop through the list of first n's and set values to 1 according to the n for each row.
a = np.full((3, 4), np.nan)
b = [2, 3, 1]
for i, x in enumerate(b):
a[i, :x] = 1
You can initialise you matrix using a list comprehension:
>>> import numpy as np
>>> b = [2, 3, 1]
>>> max_len = 4
>>> gen_array = lambda i: [1] * i + [np.NAN] * (max_len - i)
>>> np.matrix([gen_array(i) for i in b])
With detailed steps:
[1] * N will create an array of length N filled with 1:
>>> [1] * 3
[1, 1, 1]
You can concat array using +:
>>> [1, 2] + [3, 4]
[1, 2, 3, 4]
You just have to combine both [1] * X + [np.NAN] * (N - X) will create an array of N dimension filled with X 1
last one, list-comprehension:
[i for i in b]
is a "shortcut" (not really, but it is easier to understand) for:
a = []
for i in b:
a.append(i)
import numpy as np
a = np.random.rand(3,4) #Create matrix with random numbers (you can change this to np.empty or whatever you want.
b = [1, 2, 3] # Your 'b' list
for idr, row in enumerate(a): # Loop through the matrix by row
a[idr,:b[idr]] = 1 # idr is the row index, here you change the row 'idr' from the column 0 to the column b[idr] that will be 0, 1 and 3
a[idr,b[idr]:] = 'NaN' # the values after that receive NaN
print(a) # Outputs matrix
#[[ 1. nan nan nan]
[ 1. 1. nan nan]
[ 1. 1. 1. nan]]
I have to use Scikit Lean's KNeighborsClassifier to compare time series using an user defined function in Python.
knn = KNeighborsClassifier(n_neighbors=1,weights='distance',metric='pyfunc',func=dtw_dist)
The problem is that KNeighborsClassifier doens't seem to support my training data. They are time series, so they are lists with different sizes. KNeighborsClassifier gives me this error message when I try to use fit method (knn.fit(X,Y)):
ValueError: data type not understood
It seems KNeighborsClassifier only supports same size training sets (only time series with same lenght would be accepted, but that is not my case), but my teacher told me to use KNeighborsClassifier. So I don't know what to do...
Any ideas?
Two (or one...) options as far as I can tell:
Precompute the distances (not directly supported by KNeighborsClassifier it seems, other clustering algorithms do, e.g., Spectral Clustering).
Convert your data to be square using NaNs, and handling these accordingly in your custom distance function.
'Square' your data using NaNs
So, option 2 it is.
Say we have the following data, where every row represents a time series:
import numpy as np
series = [
[1,2,3,4],
[1,2,3],
[1],
[1,2,3,4,5,6,7,8]
]
We simply make the data square by adding nans:
def make_square(jagged):
# Careful: this mutates the series list of list
max_cols = max(map(len, jagged))
for row in jagged:
row.extend([None] * (max_cols - len(row)))
return np.array(jagged, dtype=np.float)
make_square(series)
array([[ 1., 2., 3., 4., nan, nan, nan, nan],
[ 1., 2., 3., nan, nan, nan, nan, nan],
[ 1., nan, nan, nan, nan, nan, nan, nan],
[ 1., 2., 3., 4., 5., 6., 7., 8.]])
Now the data 'fits' into the algorithm. You just have to adapt your distance function to account for the NaNs.
Precompute and use a cache function
Oh we can probably do option 1 too (assuming you have N time series):
Precompute the distances into a (N, N) distance matrix D
Create a (N, 1) matrix that is just a range between [0, N) (i.e., the index of the series in the distance matrix)
Create a distance function wrapper
Use this wrapper as the distance function.
wrapper function:
def wrapper(row1, row2):
# might have to fiddle a bit here, but i think this retrieves the indices.
i1, i2 = row1[0], row2[0]
return D[i1, i2]
Ok I hope its clear.
Complete example
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
from mlpy import dtw_std # I dont know if you are using this one: it doesnt matter.
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# Example data
series = [
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3],
[1],
[1, 2, 3, 4, 5, 6, 7, 8],
[1, 2, 5, 6, 7, 8],
[1, 2, 4, 5, 6, 7, 8],
]
# I dont know.. these seemed to make sense to me!
y = np.array([
0,
0,
0,
0,
1,
2,
2,
2
])
# Compute the distance matrix
N = len(series)
D = np.zeros((N, N))
for i in range(N):
for j in range(i+1, N):
D[i, j] = dtw_std(series[i], series[j])
D[j, i] = D[i, j]
print D
# Create the fake data matrix: just the indices of the timeseries
X = np.arange(N).reshape((N, 1))
# Create the wrapper function that returns the correct distance
def wrapper(row1, row2):
# cast to int to prevent warnings: sklearn converts our integer indices to floats.
i1, i2 = int(row1[0]), int(row2[0])
return D[i1, i2]
# Only the ball_tree algorith seems to accept a custom function
knn = KNeighborsClassifier(weights='distance', algorithm='ball_tree', metric='pyfunc', func=wrapper)
knn.fit(X, y)
print knn.kneighbors(X[0])
# (array([[ 0., 0., 0., 1., 6.]]), array([[1, 2, 0, 3, 4]]))
print knn.kneighbors(X[0])
# (array([[ 0., 0., 0., 1., 6.]]), array([[1, 2, 0, 3, 4]]))
print knn.predict(X)
# [0 0 0 0 1 2 2 2]