Python for loop syntax with two dimensional array - python

for f in train.columns:
missings = train[train[f] == -1][f].count()
what does trainp[][] mean? How can this be two dimensional array if f referring to another column?

For vanilla python It is certainly very odd and poorly written code, but it could be valid in a very limited number of situations. Below are a couple examples in which it would work. I am sure there are more, but either way it is not very easy to understand and I do not recommend using it in your own code.
Note: the iterable.count() method requires 1 argument.
example 2
f = 4
train = [[1, 2, 3, 4, [0, 0, 1, 0]], [1, 2, 3, 4, [1, 0, 1, 1]], 0, 1, -1]
missings = train[train[f] == -1][f].count(1)
print(missings) # output = 3
example 1
f = 4
train = {True: [1, 2, 3, 4, [0, 0, 0, 1]], False: [1, 2, 3, 4, [1, 1, 1, 0]], 4: 1}
missing = train[train[f] == -1][f].count(1)
print(missing) # output = 3

It's looking like you are already getting values from the 2D array i-e train[train[f] == -1][f]
you can make it a 2D array by doing something like that
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
or
arr = [[12, 13, 5, 4], [14, 8,11], [12, 10, 12, 6], [15,17,9,0]]

Related

np.add is giving me a headache. How do I use the where option in np.add?

I have two arrays that are loaded from images, I need to add these two images, I.e. if there's a circle of pixel value 1 in the center, I need to add it with a triangle of pixel value 2 in the top left corner. The rule I want to set is that if the 1 is already in that index, it will only add pixel value 2 to the pixels that are blank (pixel value is 0)
How do I do that? I keep trying with np.add and the where option
mask_test = master_array == 0
master_array = np.add(master_array, new_pic, where = mask_test)
But it keeps screwing up and master_array just ends up being new_pic instead of the sum. Online searches of how 'where' works has been fruitless because everyone doesn't give an example, some even just go "oh it's not used much so I won't go over it".
This code:
master_array = np.add(master_array, new_pic, where = mask_test)
gives me this:
But the problem is when the pixels do overlap I get a pixel value of 3 instead of it retaining the value of 1 as it should.
As explained in the docs, the out array will retain its "original value" in cases where the condition in the where parameter is false. This implies that you need to specify an out array to which the function will output if you are going to set the where parameter. Otherwise the function tries to get the original values from an uninitialized array, which has odd results. If you're happy to overwrite master_array, you can do that like this:
np.add(master_array, new_pic, out=master_array, where=master_array == 0)
(You don't need to assign the returned value here - specifying the output array is sufficient.)
It is probably less of a headache to use + with np.where instead:
master_array += np.where(master_array == 0, new_pic, 0)
But since you are only adding in cases where pixel value is 0 in the master, there is no need to add in the first place. You could just use np.where without any addition.
master_array = np.where(master_array == 0, new_pic, master_array)
Use of where in np.add (or other ufunc) is not common - especially compared to the use of the np.where function. And at least when I answered SO, I stress the out needs to be included.
The docs talk about "uninitialized" values when out=None, the default. That may be unclear, but effectively it means, an array such as that created by np.empty.
This may contain anything, such as:
In [263]: res = np.empty((5,5),int)
In [264]: res
Out[264]:
array([[ 50999536, 0, 140274438367024,
-6315338160082163841, 140273540789184],
[ 161, 55839504, 140274448227440,
140273575343728, 358094631352936090],
[ 140273564120384, 140273575343344, -7783537013977118542,
140273543024256, 140273575343200],
[-6522034781934541837, 140273620247296, 140273575343776,
1387433780369843801, 140273560270848],
[ 140273561761968, -3190833100527581043, 140273563628672,
140273561762640, 480]])
Define an initial array:
In [265]: x1 = np.random.randint(0,5,(5,5))
In [266]: x1
Out[266]:
array([[3, 2, 0, 1, 3],
[3, 2, 4, 0, 3],
[2, 3, 3, 4, 3],
[3, 2, 0, 2, 2],
[1, 2, 1, 1, 2]])
In [267]: x2=x1.copy()
Without out, we get values much like res above. Only the x1==0 elements are set to 10:
In [268]: np.add(x1, 10, where=x1==0)
Out[268]:
array([[51108864, 0, 10, 47780512, 51193856],
[51213024, 51245760, 51252528, 10, 51260336],
[51261168, 51261920, 51264176, 51298864, 51270656],
[51271040, 51274864, 10, 51276640, 51277024],
[51277808, 51278528, 51279104, 51284496, 51286448]])
Or we could set the out to np.zeros:
In [269]: np.add(x1, 10, where=x1==0, out=np.zeros((5,5),int))
Out[269]:
array([[ 0, 0, 10, 0, 0],
[ 0, 0, 0, 10, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 10, 0, 0],
[ 0, 0, 0, 0, 0]])
But if we set it to x1, or a copy of x1 (which is probably what you want):
In [270]: np.add(x1, 10, where=x1==0, out=x2)
Out[270]:
array([[ 3, 2, 10, 1, 3],
[ 3, 2, 4, 10, 3],
[ 2, 3, 3, 4, 3],
[ 3, 2, 10, 2, 2],
[ 1, 2, 1, 1, 2]])
But we could do the same with masked addition:
In [271]: x1[x1==0] += 10
In [272]: x1
Out[272]:
array([[ 3, 2, 10, 1, 3],
[ 3, 2, 4, 10, 3],
[ 2, 3, 3, 4, 3],
[ 3, 2, 10, 2, 2],
[ 1, 2, 1, 1, 2]])
Or using the more commonly use np.where function:
In [273]: np.where(x1==10, 20, x1)
Out[273]:
array([[ 3, 2, 20, 1, 3],
[ 3, 2, 4, 20, 3],
[ 2, 3, 3, 4, 3],
[ 3, 2, 20, 2, 2],
[ 1, 2, 1, 1, 2]])
In my experience with SO, the where/out is most useful when evaluation at certain values can give rise to errors, for example division by 0, or log of negatives.
In np.where(A,B,C), the 3 arguments are evaluated in full, and result just selects from B and C based on A. With the np.add(x, y, where=A, out=C), the x+y addition is only done where the condition is true. The evaluation is selective. The distinction may be hard to grasp, and may not matter when using np.add.
You could simply use the normal addition:
mask_test = master_array == 0
master_array += new_pic * mask_test

how to replace all the items of array by indexes of specified lists?

I want to replace all the items of sequence with ids that tell which list of labeller they are in. Assume that all the values are distinct in both sequence and labeller and a union of lists of labeller has the same items as sequence. lsizes corresponds to the sizes of lists in labeller and is redundant for Pythonic solution but might be compulsory for solution to be vectorised fully.
sequence = [1, 2, 10, 5, 6, 4, 3, 8, 7, 9],
labeller = [[1, 2, 10], [3, 4, 5, 6, 7], [8, 9]]
lsizes = [3, 5, 2]
I know how to solve it in a simple way:
idx = {u:i for i, label in enumerate(labeller) for u in label}
tags = [idx[u] for u in sequence]
And the output is:
tags = [0, 0, 0, 1, 1, 1, 1, 2, 1, 2]
After that I put all my efforts to do it in vectorised way. It's quite complicated for me. This is my attempt, done rather by a guess, but, unfortunately, it doesn't pass all my tests. I hope I'm close:
sequence = np.array(sequence)
cl = np.concatenate(labeller)
_, cl_idx = np.unique(cl, return_index=True)
_, idx = np.unique(sequence[cl_idx], return_index=True)
tags = np.repeat(np.arange(len(lsizes)), lsizes)[idx]
#output: [0 0 1 1 0 1 1 1 2 2]
How can I finish it? I would also like to see rigour explanation what it does and how to understand it better. Any sources are also welcome.
Approach #1
For those tracing back problems, searchsorted seems to be the way to go and works here too, re-using your cl -
cl = np.concatenate(labeller)
sidx = cl.argsort()
idx = np.searchsorted(cl, sequence, sorter=sidx)
idx0 = sidx[idx]
l = list(map(len, labeller))
r = np.repeat(np.arange(len(l)), l)
out = r[idx0]
Using lsizes for l makes it fully vectorized. But, I suspect the concatenation step might be heavy. Whether this is worth it or not would depend a lot on the lengths of the subarrays.
Approach #2
For positive numbers, here's one with array-indexing as a hashing mechanism -
N = max(map(max, labeller))+1
id_ar = np.zeros(N, dtype=int) # use np.empty for perf. boost
for i,l in enumerate(labeller):
id_ar[l] = i
out = id_ar[sequence]
sequence = [1, 2, 10, 5, 6, 4, 3, 8, 7, 9]
labeller = [[1, 2, 10], [3, 4, 5, 6, 7], [8, 9]]
lsizes = [3, 5, 2]
sequence_array = np.array(sequence)
labeller_array = np.array(labeller).sum()
index_array = np.repeat(list(range(len(lsizes))), lsizes)
np.apply_along_axis(lambda num : index_array[np.where(labeller_array == num)[0]], 0, sequence_array[None, :])
# output: array([[0, 0, 0, 1, 1, 1, 1, 2, 1, 2]])
Alternative:
label_df = pd.DataFrame({'label':labeller_array, 'index':index_array})
seq_df = pd.DataFrame({'seq':sequence_array})
seq_df.merge(label_df, left_on = 'seq', right_on = 'label')['index'].tolist()
#output: [0, 0, 0, 1, 1, 1, 1, 2, 1, 2]

Customize iterating over numpy matrix

I'm using python 3.X and I want to create such iterator that will allow me to iterate a matrix from cell [N,0] to [0,N]
I don't want to use indices-magic so I tried np.nditer which is not enough for that.
a = np.matrix(np.random.randint(0,3,(3,3)))
>>>([[0, 0, 1],
[1, 1, 2],
[1, 2, 2]])
it = np.nditer(a, flags=['f_index'])
for i in range(a.size):
print(it[0])
it.iternext()
>>>0 0 1 1 1 2 1 2 2
I want to get the following :
1,2,2,1,1,2,0,0,1
Is it possible using iterators of some kind?
In [29]: arr = np.array([[0,0,1],[1,1,2],[1,2,2]])
In [30]: arr[::-1,:]
Out[30]:
array([[1, 2, 2],
[1, 1, 2],
[0, 0, 1]])
In [31]: arr[::-1,:].ravel()
Out[31]: array([1, 2, 2, 1, 1, 2, 0, 0, 1])

What is a faster way to get the location of unique rows in numpy

I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...
import numpy
uniq_rows = numpy.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = numpy.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
print row, numpy.where((test_rows == row).all(axis=1))[0]
This prints...
[0, 1, 0] [ 1 4 10]
[1, 1, 0] [ 3 8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]
Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.
EDIT:
This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.
Numpy
From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)
Pandas
df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)
uniq
0 1 2
0 0 1 0
1 1 1 0
2 1 1 1
3 0 1 1
Or you could generate the unique rows automatically from the incoming DataFrame
uniq_generated = df.drop_duplicates().reset_index(drop=True)
yields
0 1 2
0 0 1 1
1 0 1 0
2 0 0 0
3 1 1 0
4 1 1 1
and then look for it
d = dict()
for idx, row in uniq.iterrows():
d[idx] = df.index[(df == row).all(axis=1)].values
This is about the same as your where method
d
{0: array([ 1, 4, 10], dtype=int64),
1: array([ 3, 8, 12], dtype=int64),
2: array([7, 9], dtype=int64),
3: array([0, 5, 6], dtype=int64)}
There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.
np.where((uniq_rows[:, None, :] == test_rows).all(2))
Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.
(array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6]))
How it works:
(uniq_rows[:, None, :] == test_rows)
Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.
With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)
In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]:
(array([[0, 0, 0],
[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]]),
array([2, 1, 0, 3, 7], dtype=int32),
array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))
In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
...: dd[v].append(i)
...:
In [167]: dd
Out[167]:
defaultdict(list,
{0: [2, 11],
1: [1, 4, 10],
2: [0, 5, 6],
3: [3, 8, 12],
4: [7, 9]})
or indexing the dictionary with the unique rows (as hashable tuple):
In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
...: dd[tuple(a[v])].append(i)
...:
In [172]: dd
Out[172]:
defaultdict(list,
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]})
This will do the job:
import numpy as np
uniq_rows = np.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = np.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]
This will work for all the cases including the cases in which not all the rows in uniq_rows are present in test_rows. However, if somehow you know ahead that all of them are present, you could replace the part
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
with just the row:
res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)
Thus avoiding loops entirely.
Not very 'numpythonic', but for a bit of an upfront cost, we can make a dict with the keys as a tuple of your row, and a list of indices:
test_rowsdict = {}
for i,j in enumerate(test_rows):
test_rowsdict.setdefault(tuple(j),[]).append(i)
test_rowsdict
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]}
Then you can filter based on your uniq_rows, with a fast dict lookup: test_rowsdict[tuple(row)]:
out = []
for i in uniq_rows:
out.append((i, test_rowsdict.get(tuple(i),[])))
For your data, I get 16us for just the lookup, and 66us for building and looking up, versus 95us for your np.where solution.
Approach #1
Here's one approach, not sure about the level of "NumPythonic-ness" though to such a tricky problem -
def get1Ds(a, b): # Get 1D views of each row from the two inputs
# check that casting to void will create equal size elements
assert a.shape[1:] == b.shape[1:]
assert a.dtype == b.dtype
# compute dtypes
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
# convert to 1d void arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel()
b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel()
return a_void, b_void
def matching_row_indices(uniq_rows, test_rows):
A, B = get1Ds(uniq_rows, test_rows)
validA_mask = np.in1d(A,B)
sidx_A = A.argsort()
validA_mask = validA_mask[sidx_A]
sidx = B.argsort()
sortedB = B[sidx]
split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1
all_split_indx = np.split(sidx, split_idx)
match_mask = np.in1d(B,A)[sidx]
valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx])
locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]]
return uniq_rows[sidx_A[validA_mask]], locations
Scope(s) of improvement (on performance) :
np.split could be replaced by a for-loop for splitting using slicing.
np.r_ could be replaced by np.concatenate.
Sample run -
In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)
In [332]: unq_rows
Out[332]:
array([[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]])
In [333]: idx
Out[333]: [array([ 1, 4, 10]),array([0, 5, 6]),array([ 3, 8, 12]),array([7, 9])]
Approach #2
Another approach to beat the setup overhead from the previous one and making use of get1Ds from it, would be -
A, B = get1Ds(uniq_rows, test_rows)
idx_group = []
for row in A:
idx_group.append(np.flatnonzero(B == row))
The numpy_indexed package (disclaimer: I am its author) was created to solve problems of this kind in an elegant and efficient manner:
import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)
If you dont care about the indices of all rows, but only those present in test_rows, npi has a bunch of simple ways of tackling that problem too; f.i:
subset_indices = npi.indices(unique_test_rows, unique_rows)
As a sidenote; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask a question of this kind, these grouped indices are just a means to an end, and not the endgoal of the computation. Chances are that using the functionality in npi you can reach that end goal more efficiently, without ever explicitly computing those indices. Do you care to give some more background to your problem?
EDIT: if you arrays are indeed this big, and always consist of a small number of columns with binary values, wrapping them with the following encoding might boost efficiency a lot further still:
def encode(rows):
return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)

Getting the indexes to the duplicate columns of a numpy array [duplicate]

This question already has answers here:
Find unique columns and column membership
(3 answers)
Closed 8 years ago.
I have a numpy array with duplicate columns:
import numpy as np
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
I need to find the indexes to those duplicates or something like that:
[0, 4]
[1, 2, 5]
I have a hard time dealing with indexes in Python. I really don't know to approach it.
Thanks
I tried identifying the unique columns first with this function:
def unique_columns(data):
ind = np.lexsort(data)
return data.T[ind[np.concatenate(([True], any(data.T[ind[1:]]!=data.T[ind[:-1]], axis=1)))]].T
But I can't figure out the indexes from there.
There is not a simple way to do this unfortunately. Using a np.unique answer. This method requires that the axis you want to unique is contiguous in memory and numpy's typical memory layout is C contiguous or contiguous in rows. Fortunately numpy makes this conversion simple:
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
def unique_columns2(data):
dt = np.dtype((np.void, data.dtype.itemsize * data.shape[0]))
dataf = np.asfortranarray(data).view(dt)
u,uind = np.unique(dataf, return_inverse=True)
u = u.view(data.dtype).reshape(-1,data.shape[0]).T
return (u,uind)
Our result:
u,uind = unique_columns2(A)
u
array([[0, 1, 1],
[0, 1, 2],
[0, 1, 3]])
uind
array([1, 2, 2, 0, 1, 2])
I am not really sure what you want to do from here, for example you can do something like this:
>>> [np.where(uind==x)[0] for x in range(u.shape[0])]
[array([3]), array([0, 4]), array([1, 2, 5])]
Some timings:
tmp = np.random.randint(0,4,(30000,500))
#BiRico and OP's answer
%timeit unique_columns(tmp)
1 loops, best of 3: 2.91 s per loop
%timeit unique_columns2(tmp)
1 loops, best of 3: 208 ms per loop
Here is an outline of how to approach it. Use numpy.lexsort to sort the columns, that way all the duplicates will be grouped together. Once the duplicates are all together, you can easily tell which columns are duplicates and the indices that correspond with those columns.
Here's an implementation of the method described above.
import numpy as np
def duplicate_columns(data, minoccur=2):
ind = np.lexsort(data)
diff = np.any(data.T[ind[1:]] != data.T[ind[:-1]], axis=1)
edges = np.where(diff)[0] + 1
result = np.split(ind, edges)
result = [group for group in result if len(group) >= minoccur]
return result
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
print(duplicate_columns(A))
# [array([0, 4]), array([1, 2, 5])]

Categories

Resources