Convert Python sequence to NumPy array, filling missing values

Convert Python sequence to NumPy array, filling missing values - python

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)

You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!

Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)

max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray

Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)

you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]

I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])

If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Related

Number of unique elements per row in a NumPy array

For example, for
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
I want to get
[2, 2, 3]
Is there a way to do this without for loops or using np.vectorize?
Edit: Actual data consists of 1000 rows of 100 elements each, with each element ranging from 1 to 365. The ultimate goal is to determine the percentage of rows that have duplicates. This was a homework problem which I already solved (with a for loop), but I was just wondering if there was a better way to do it with numpy.

Approach #1
One vectorized approach with sorting -
In [8]: b = np.sort(a,axis=1)
In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])
Approach #2
Another method for ints that aren't very large would be with offsetting each row by an offset that would differentiate elements off each row from others and then doing binned-summation and counting number of non-zero bins per row -
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Runtime test
Approaches as funcs -
def sorting(a):
b = np.sort(a,axis=1)
return (b[:,1:] != b[:,:-1]).sum(axis=1)+1
def bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
# From #wim's post
def pandas(a):
df = pd.DataFrame(a.T)
return df.nunique()
# #jp_data_analysis's soln
def numpy_apply(a):
return np.apply_along_axis(compose(len, np.unique), 1, a)
Case #1 : Square shaped one
In [164]: np.random.seed(0)
In [165]: a = np.random.randint(0,5,(10000,10000))
In [166]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop
Case #2 : Large number of rows
In [167]: np.random.seed(0)
In [168]: a = np.random.randint(0,5,(1000000,10))
In [169]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop
Extending to number of unique elements per column
To extend, we just need to do the slicing and ufunc operations along the other axis for the two proposed approaches, like so -
def nunique_percol_sort(a):
b = np.sort(a,axis=0)
return (b[1:] != b[:-1]).sum(axis=0)+1
def nunique_percol_bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[1]))*n
M = a.shape[1]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Generic ndarray with generic axis
Let's see how we can extend to ndarray of generic dimensions and get those number of unique counts along a generic axis. We will make use of np.diff with its axis param to get those consecutive differences and hence make it generic, like so -
def nunique(a, axis):
return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1
Sample runs -
In [77]: a
Out[77]:
array([[1, 0, 2, 2, 0],
[1, 0, 1, 2, 0],
[0, 0, 0, 0, 2],
[1, 2, 1, 0, 1],
[2, 0, 1, 0, 0]])
In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])
In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])
If you are working with floating pt numbers and want to make the unique-ness case based on some tolerance value rather than absolute match, we can use np.isclose. Two such options would be -
(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)
For a custom tolerance value, feed those with np.isclose.

This solution via np.apply_along_axis isn't vectorised and involves a Python-level loop. But it is relatively intuitive using len + np.unique functions.
import numpy as np
from toolz import compose
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
np.apply_along_axis(compose(len, np.unique), 1, a) # [2, 2, 3]

A oneliner using sort:
In [6]: np.count_nonzero(np.diff(np.sort(a)), axis=1)+1
Out[6]: array([2, 2, 3])

Are you open to considering pandas? Dataframes have a dedicated method for this
>>> a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
>>> df = pd.DataFrame(a.T)
>>> print(*df.nunique())
2 2 3

Padding a 2D numpy with varying rows into a same size [duplicate]

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)

You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!

Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)

max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray

Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)

you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]

I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])

If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Removing arrays which contain duplicate elements in Numpy [duplicate]

I have a (N,3) array of numpy values:
>>> vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
>>> vals
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 7],
[0, 4, 5],
[2, 2, 1],
[0, 0, 0],
[5, 4, 3]])
I'd like to remove rows from the array that have a duplicate value. For example, the result for the above array should be:
>>> duplicates_removed
array([[1, 2, 3],
[4, 5, 6],
[0, 4, 5],
[5, 4, 3]])
I'm not sure how to do this efficiently with numpy without looping (the array could be quite large). Anyone know how I could do this?

This is an option:
import numpy
vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
a = (vals[:,0] == vals[:,1]) | (vals[:,1] == vals[:,2]) | (vals[:,0] == vals[:,2])
vals = numpy.delete(vals, numpy.where(a), axis=0)

Here's an approach to handle generic number of columns and still be a vectorized method -
def rows_uniq_elems(a):
a_sorted = np.sort(a,axis=-1)
return a[(a_sorted[...,1:] != a_sorted[...,:-1]).all(-1)]
Steps :
Sort along each row.
Look for differences between consecutive elements in each row. Thus, any row with at least one zero differentiation indicates a duplicate element. We will use this to get a mask of valid rows. So, the final step is to simply select valid rows off input array, using the mask.
Sample run -
In [49]: a
Out[49]:
array([[1, 2, 3, 7],
[4, 5, 6, 7],
[7, 8, 7, 8],
[0, 4, 5, 6],
[2, 2, 1, 1],
[0, 0, 0, 3],
[5, 4, 3, 2]])
In [50]: rows_uniq_elems(a)
Out[50]:
array([[1, 2, 3, 7],
[4, 5, 6, 7],
[0, 4, 5, 6],
[5, 4, 3, 2]])

numpy.array([v for v in vals if len(set(v)) == len(v)])
Mind you, this still loops behind the scenes. You can't avoid that. But it should work fine even for millions of rows.

Its six years on, but this question helped me, so I ran a comparison for speed for the answers given by Divakar, Benjamin, Marcelo Cantos and Curtis Patrick.
import numpy as np
vals = np.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
def rows_uniq_elems1(a):
idx = a.argsort(1)
a_sorted = a[np.arange(idx.shape[0])[:,None], idx]
return a[(a_sorted[:,1:] != a_sorted[:,:-1]).all(-1)]
def rows_uniq_elems2(a):
a = (a[:,0] == a[:,1]) | (a[:,1] == a[:,2]) | (a[:,0] == a[:,2])
return np.delete(a, np.where(a), axis=0)
def rows_uniq_elems3(a):
return np.array([v for v in a if len(set(v)) == len(v)])
def rows_uniq_elems4(a):
return np.array([v for v in a if len(np.unique(v)) == len(v)])
Results:
%timeit rows_uniq_elems1(vals)
10000 loops, best of 3: 67.9 µs per loop
%timeit rows_uniq_elems2(vals)
10000 loops, best of 3: 156 µs per loop
%timeit rows_uniq_elems3(vals)
1000 loops, best of 3: 59.5 µs per loop
%timeit rows_uniq_elems(vals)
10000 loops, best of 3: 268 µs per loop
It seems that using set beats numpy.unique. In my case I needed to do this over a much larger array:
bigvals = np.random.randint(0,10,3000).reshape([3,1000])
%timeit rows_uniq_elems1(bigvals)
10000 loops, best of 3: 276 µs per loop
%timeit rows_uniq_elems2(bigvals)
10000 loops, best of 3: 192 µs per loop
%timeit rows_uniq_elems3(bigvals)
10000 loops, best of 3: 6.5 ms per loop
%timeit rows_uniq_elems4(bigvals)
10000 loops, best of 3: 35.7 ms per loop
The methods without list comprehensions are much faster. However, the number of rows are hard coded, and are difficult to extend to more than three columns, so in my case at least the list comprehension with the set is the best answer.
EDITED because I confused rows and columns in bigvals

Identical to Marcelo, but I think using numpy.unique() instead of set() may get across exactly what you are shooting for.
numpy.array([v for v in vals if len(numpy.unique(v)) == len(v)])

Delete element from multi-dimensional numpy array by value

Given a numpy array
a = np.array([[0, -1, 0], [1, 0, 0], [1, 0, -1]])
what's the fastest way to delete all elements of value -1 to get an array of the form
np.array([[0, 0], [1, 0, 0], [1, 0]])

Another method you might consider:
def iterative_numpy(a):
mask = a != 1
out = np.array([ a[i,mask[i]] for i xrange(a.shape[0]) ])
return out
Divakar's method loop_compr_based calculates sums along the rows of mask and a cumulative sum of that result. This method avoids such summations but still has to iterate through the rows of a. It also returns an array of arrays. This has the annoyance that out has to be indexed with the syntax out[1][2] rather than out[1,2]. Comparing the times with a matrix random integer matrices:
In [4]: a = np.random.random_integers(-1,1, size = (3,30))
In [5]: %timeit iterative_numpy(a)
100000 loops, best of 3: 11.1 us per loop
In [6]: %timeit loop_compr_based(a)
10000 loops, best of 3: 20.2 us per loop
In [7]: a = np.random.random_integers(-1,1, size = (30,3))
In [8]: %timeit iterative_numpy(a)
10000 loops, best of 3: 59.5 us per loop
In [9]: %timeit loop_compr_based(a)
10000 loops, best of 3: 30.8 us per loop
In [10]: a = np.random.random_integers(-1,1, size = (30,30))
In [11]: %timeit iterative_numpy(a)
10000 loops, best of 3: 64.6 us per loop
In [12]: %timeit loop_compr_based(a)
10000 loops, best of 3: 36 us per loop
When there are more columns than rows, iterative_numpy wins out. When there are more rows than columns, loop_compr_based wins but transposing a first will improve the performance of both methods. When the dimensions are comparably the same, loop_compr_based is best.
Important Side Discussion
Outside of the implementation, it's important to note that any numpy array which has a non-uniform shape is not an actual array in the sense that the values do not occupy a contiguous section of memory and further, the usual array operations will not work as expected.
As an example:
>>> a = np.array([[1,2,3],[1,2],[1]])
>>> a*2
array([[1, 2, 3, 1, 2, 3], [1, 2, 1, 2], [1, 1]], dtype=object)
Notice that numpy actually informs us that this is not the usual numpy array with the note dtype=object.
Thus it might be best to just make a list of numpy arrays and use them accordingly.

Approach #1 : Using NumPy splitting of array -
def split_based(a, val):
mask = a!=val
p = np.split(a[mask],mask.sum(1)[:-1].cumsum())
out = np.array(list(map(list,p)))
return out
Approach #2 : Using loop comprehension, but minimal work within the loop -
def loop_compr_based(a, val):
mask = a!=val
stop = mask.sum(1).cumsum()
start = np.append(0,stop[:-1])
am = a[mask].tolist()
out = np.array([am[start[i]:stop[i]] for i in range(len(start))])
return out
Sample run -
In [391]: a
Out[391]:
array([[ 0, -1, 0],
[ 1, 0, 0],
[ 1, 0, -1],
[-1, -1, 8],
[ 3, 7, 2]])
In [392]: split_based(a, val=-1)
Out[392]: array([[0, 0], [1, 0, 0], [1, 0], [8], [3, 7, 2]], dtype=object)
In [393]: loop_compr_based(a, val=-1)
Out[393]: array([[0, 0], [1, 0, 0], [1, 0], [8], [3, 7, 2]], dtype=object)
Runtime test -
In [387]: a = np.random.randint(-2,10,(1000,1000))
In [388]: %timeit split_based(a, val=-1)
10 loops, best of 3: 161 ms per loop
In [389]: %timeit loop_compr_based(a, val=-1)
10 loops, best of 3: 29 ms per loop

Use indexes = np.where(a == -1) to get indexes of elements
Find indices of elements equal to zero from numpy array
Then delete specific elements by index with np.delete(your_array, indexes)
How to remove specific elements in a numpy array

How about this?
print([[y for y in x if y > -1] for x in a])
[[0, 0], [1, 0, 0], [1, 0]]

For almost everything you might want to do with such an array, you can use a masked array
a = np.array([[0, -1, 0], [1, 0, 0], [1, 0, -1]])
b=np.ma.masked_equal(a,-1)
b
Out[5]:
masked_array(data =
[[0 -- 0]
[1 0 0]
[1 0 --]],
mask =
[[False True False]
[False False False]
[False False True]],
fill_value = -1)
If you really want the ragged array, it can be .compressed() by line
c=np.array([b[i].compressed() for i in range(b.shape[0])])
c
Out[10]: array([array([0, 0]), array([1, 0, 0]), array([1, 0])], dtype=object)

dot product of two 1D vectors in numpy

I'm working with numpy in python to calculate a vector multiplication.
I have a vector x of dimensions n x 1 and I want to calculate x*x_transpose.
This gives me problems because x.T or x.transpose() doesn't affect a 1 dimensional vector (numpy represents vertical and horizontal vectors the same way).
But how do I calculate a (n x 1) x (1 x n) vector multiplication in numpy?
numpy.dot(x,x.T) gives a scalar, not a 2D matrix as I want.

You are essentially computing an Outer Product.
You can use np.outer.
In [15]: a=[1,2,3]
In [16]: np.outer(a,a)
Out[16]:
array([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])

While np.outer is the simplest way to do this, I'd thought I'd just mention how you might manipulate the (N,) shaped array to do this:
In [17]: a = np.arange(4)
In [18]: np.dot(a[:,None], a[None,:])
Out[18]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
In [19]: np.outer(a,a)
Out[19]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
Where you could alternatively replace None with np.newaxis.
Another more exotic way to do this is with np.einsum:
In [20]: np.einsum('i,j', a, a)
Out[20]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
And just for fun, some timings, which are likely going to vary based on hardware and numpy version/compilation:
Small-ish vector
In [36]: a = np.arange(5, dtype=np.float64)
In [37]: %timeit np.outer(a,a)
100000 loops, best of 3: 17.7 µs per loop
In [38]: %timeit np.dot(a[:,None],a[None,:])
100000 loops, best of 3: 11 µs per loop
In [39]: %timeit np.einsum('i,j', a, a)
1 loops, best of 3: 11.9 µs per loop
In [40]: %timeit a[:, None] * a
100000 loops, best of 3: 9.68 µs per loop
And something a little larger
In [42]: a = np.arange(500, dtype=np.float64)
In [43]: %timeit np.outer(a,a)
1000 loops, best of 3: 605 µs per loop
In [44]: %timeit np.dot(a[:,None],a[None,:])
1000 loops, best of 3: 1.29 ms per loop
In [45]: %timeit np.einsum('i,j', a, a)
1000 loops, best of 3: 359 µs per loop
In [46]: %timeit a[:, None] * a
1000 loops, best of 3: 597 µs per loop

If you want an inner product then use numpy.dot(x,x) for outer product use numpy.outer(x,x)

Another alternative is to define the row / column vector with 2-dimensions, e.g.
a = np.array([1, 2, 3], ndmin=2)
np.dot(a.T, a)
array([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])

Another alternative is to user numpy.matrix
>>> a = np.matrix([1,2,3])
>>> a
matrix([[1, 2, 3]])
>>> a.T * a
matrix([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])
Generally use of numpy.arrays is preferred. However, using numpy.matrices can be more readable for long expressions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert Python sequence to NumPy array, filling missing values - python

You can use itertools.zip_longest: import itertools np.array(list(itertools.zip_longest(*v, fillvalue=0))).T Out: array([[1, 0], [1, 2]]) Note: For Python 2, it is itertools.izip_longest.

Pandas and its DataFrame-s deal beautifully with missing data. import numpy as np import pandas as pd v = [[1], [1, 2]] print(pd.DataFrame(v).fillna(0).values.astype(np.int32)) # array([[1, 0], # [1, 2]], dtype=int32)

max_len = max(len(sub_list) for sub_list in v) result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v]) >>> result array([[1, 0], [1, 2]]) >>> type(result) numpy.ndarray

you can try to convert pandas dataframe first, after that convert it to numpy array ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]] df = pd.DataFrame(ll) print(df) # 0 1 2 3 # 0 1 2 3.0 NaN # 1 4 5 NaN NaN # 2 6 7 8.0 9.0 npl = df.to_numpy() print(npl) # [[ 1. 2. 3. nan] # [ 4. 5. nan nan] # [ 6. 7. 8. 9.]]

I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad: pad = len(max(X, key=len)) result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])

Related

Number of unique elements per row in a NumPy array

Padding a 2D numpy with varying rows into a same size [duplicate]

Removing arrays which contain duplicate elements in Numpy [duplicate]

Delete element from multi-dimensional numpy array by value

dot product of two 1D vectors in numpy

Categories

Resources