I have a huge .csv file (~2GB) that I import in my program with read_csvand then convert to an numpy matrix with as_matrix. The generated matrix has the form like the data_mat in the given example below. My problem is now, that I need to extract the blocks with the same uuid4 (entry in the first column of the matrix). The submatrices are then processed by another function. It seems that my example below is not the best way doing this. Faster methods are welcome.
import numpy as np
data_mat = np.array([['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 4, 3, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 1, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 3, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 6, 1, 1],\
['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 4, 1],\
['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 1, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 2, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 9, 0],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 0],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 5, 1, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 1],\
['d3a8a9d0-4380-42e3-b35f-733a9f9770da', 3, 6, 10]],dtype=object)
unique_ids, indices = np.unique(data_mat[:,0],return_index=True,axis=None)
length = len(data_mat)
i=0
for idd in unique_ids:
index = indices[i]
k=0
while ((index+k)<length and idd == data_mat[index+k,0]):
k+=1
tmp_mat=data_mat[index:(index+k),:]
# do something with tmp_mat ...
print(tmp_mat)
i+=1
To optimize the idea would be to minimize the computations once we are inside the loop. So, with that in mind, we would rearrange the rows of the array, sorted by the first column. Then, get the indices that define the boundaries. Finally, start our loop and simply slice for each group to get a submatrix at each iteration. Slicing is virtually free when working with arrays, so that should help us.
Thus, one implementation would be -
a0 = data_mat[:,0]
sidx = a0.argsort()
sd = data_mat[sidx] # sorted data_mat
idx = np.flatnonzero(np.concatenate(( [True], sd[1:,0] != sd[:-1,0], [True] )))
for i,j in zip(idx[:-1], idx[1:]):
tmp_mat = sd[i:j]
print tmp_mat
If you are looking to store each submatrix as an array to have a list of arrays as the final output, simply do -
[sd[i:j] for i,j in zip(idx[:-1], idx[1:])]
For sorted data_mat
For a case with data_mat already being sorted as shown in the sample, we could avoid sorting the entire array and directly use the first column, like so -
a0 = data_mat[:,0]
idx = np.flatnonzero(np.concatenate(( [True], a0[1:] != a0[:-1], [True] )))
for i,j in zip(idx[:-1], idx[1:]):
tmp_mat = data_mat[i:j]
print(tmp_mat)
Again, to get all those submatrices as a list of arrays, use -
[data_mat[i:j] for i,j in zip(idx[:-1], idx[1:])]
Note that the submatrices that we would get with this one would be in a different order than with the sorting done in the previous approach.
Benchmarking for sorted data_mat
Approaches -
# #Daniel F's soln-2
def split_app(data_mat):
idx = np.flatnonzero(data_mat[1:, 0] != data_mat[:-1, 0]) + 1
return np.split(data_mat, idx)
# Proposed in this post
def zip_app(data_mat):
a0 = data_mat[:,0]
idx = np.flatnonzero(np.concatenate(( [True], a0[1:] != a0[:-1], [True] )))
return [data_mat[i:j] for i,j in zip(idx[:-1], idx[1:])]
Timings -
In the sample we had a submatrix of max length 6. So, let's extend to a bigger case keeping it with the same pattern -
In [442]: a = np.random.randint(0,100000,(6*100000,4)); a[:,0].sort()
In [443]: %timeit split_app(a)
10 loops, best of 3: 88.8 ms per loop
In [444]: %timeit zip_app(a)
10 loops, best of 3: 40.2 ms per loop
In [445]: a = np.random.randint(0,1000000,(6*1000000,4)); a[:,0].sort()
In [446]: %timeit split_app(a)
1 loop, best of 3: 917 ms per loop
In [447]: %timeit zip_app(a)
1 loop, best of 3: 414 ms per loop
You can do this with boolean indexing.
unique_ids = np.unique(data_mat[:, 0])
masks = np.equal.outer(unique_ids, data_mat[:, 0])
for mask in masks:
tmp_mat = data_mat[mask]
# do something with tmp_mat ...
print(tmp_mat)
If the unique ids are already grouped, you could do this with np.split, similar to #Divakar
idx = np.flatnonzero(data_mat[1:, 0] != data_mat[:-1, 0]) + 1
for tmp_mat in np.split(data_mat, idx):
# do something with tmp_mat ...
print(tmp_mat)
Related
I want to create an array of numbers from 1 to n without the number x,
is there a "prettier" way to do it instead of [i for i in range(n) if i != x]?
thanks!
Using advanced indexing with np.r_.
np.arange(n)[np.r_[0:x, x+1:n]]
def myrange(n, exclude):
return np.arange(n)[np.r_[0:exclude, exclude+1:n]]
>>> myrange(10, exclude=3)
array([0, 1, 2, 4, 5, 6, 7, 8, 9])
Timings
%timeit myrange(10000000, 7008)
1 loop, best of 3: 79.1 ms per loop
%timeit other(10000000, 7008)
1 loop, best of 3: 952 ms per loop
where
def myrange(n, exclude):
return np.arange(n)[np.r_[0:exclude, exclude+1:n]]
def other(n, exclude):
return [i for i in range(n) if i != x]
You can concatenate two ranges:
np.concatenate((np.arange(x), np.arange(x + 1, n)))
You can also delete an item:
np.delete(np.arange(n), x)
You can mask:
mask = np.ones(n, dtype=bool)
mask[x] = False
np.arange(n)[mask]
You can even use a masked array, depending on your application:
a = np.ma.array(np.arange(n), mask=np.zeross(n, dtypebool))
a.mask[x] = True=
I would suggest using itertools.chain :
for a in itertools.chain(range(x), range(x+1, n)):
print(a)
or
list(itertools.chain(range(x), range(x+1, n)))
I don't know why, but [itertools.chain(range(x), range(x+1, n))] doesn't work though.
EDIT: Thanks to #rafaelc for how to make it work on square brackets.
[*itertools.chain(range(x), range(x+1, n))]
Suppose I have 2 matrices M and N (both have > 1 columns). I also have an index matrix I with 2 columns -- 1 for M and one for N. The indices for N are unique, but the indices for M may appear more than once. The operation I would like to perform is,
for i,j in w:
M[i] += N[j]
Is there a more efficient way to do this other than a for loop?
For completeness, in numpy >= 1.8 you can also use np.add's at method:
In [8]: m, n = np.random.rand(2, 10)
In [9]: m_idx, n_idx = np.random.randint(10, size=(2, 20))
In [10]: m0 = m.copy()
In [11]: np.add.at(m, m_idx, n[n_idx])
In [13]: m0 += np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
In [14]: np.allclose(m, m0)
Out[14]: True
In [15]: %timeit np.add.at(m, m_idx, n[n_idx])
100000 loops, best of 3: 9.49 us per loop
In [16]: %timeit np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
1000000 loops, best of 3: 1.54 us per loop
Aside of the obvious performance disadvantage, it has a couple of advantages:
np.bincount converts its weights to double precision floats, .at will operate with you array's native type. This makes it the simplest option for dealing e.g. with complex numbers.
np.bincount only adds weights together, you have an at method for all ufuncs, so you can repeatedly multiply, or logical_and, or whatever you feel like.
But for your use case, np.bincount is probably the way to go.
Using also m_ind, n_ind = w.T, just do M += np.bincount(m_ind, weights=N[n_ind], minlength=len(M))
For clarity, let's define
>>> m_ind, n_ind = w.T
Then the for loop
for i, j in zip(m_ind, n_ind):
M[i] += N[j]
updates the entries M[np.unique(m_ind)]. The values that get written to it are N[n_ind], which must be grouped by m_ind. (The fact that there's an n_ind in addition to m_ind is actually tangential to the question; you could just set N = N[n_ind].) There happens to be a SciPy class that does exactly this: scipy.sparse.csr_matrix.
Example data:
>>> m_ind, n_ind = array([[0, 0, 1, 1], [2, 3, 0, 1]])
>>> M = np.arange(2, 6)
>>> N = np.logspace(2, 5, 4)
The result of the for loop is that M becomes [110002 1103 4 5]. We get the same result with a csr_matrix as follows. As I said earlier, n_ind isn't relevant, so we get rid of that first.
>>> N = N[n_ind]
>>> from scipy.sparse import csr_matrix
>>> update = csr_matrix((N, m_ind, [0, len(N)])).toarray()
The CSR constructor builds a matrix with the required values at the required indices; the third part of its argument is a compressed column index, meaning that the values N[0:len(N)] have the indices m_ind[0:len(N)]. Duplicates are summed:
>>> update
array([[ 110000., 1100.]])
This has shape (1, len(np.unique(m_ind))) and can be added in directly:
>>> M[np.unique(m_ind)] += update.ravel()
>>> M
array([110002, 1103, 4, 5])
Suppose I have 2 matrices M and N (both have > 1 columns). I also have an index matrix I with 2 columns -- 1 for M and one for N. The indices for N are unique, but the indices for M may appear more than once. The operation I would like to perform is,
for i,j in w:
M[i] += N[j]
Is there a more efficient way to do this other than a for loop?
For completeness, in numpy >= 1.8 you can also use np.add's at method:
In [8]: m, n = np.random.rand(2, 10)
In [9]: m_idx, n_idx = np.random.randint(10, size=(2, 20))
In [10]: m0 = m.copy()
In [11]: np.add.at(m, m_idx, n[n_idx])
In [13]: m0 += np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
In [14]: np.allclose(m, m0)
Out[14]: True
In [15]: %timeit np.add.at(m, m_idx, n[n_idx])
100000 loops, best of 3: 9.49 us per loop
In [16]: %timeit np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
1000000 loops, best of 3: 1.54 us per loop
Aside of the obvious performance disadvantage, it has a couple of advantages:
np.bincount converts its weights to double precision floats, .at will operate with you array's native type. This makes it the simplest option for dealing e.g. with complex numbers.
np.bincount only adds weights together, you have an at method for all ufuncs, so you can repeatedly multiply, or logical_and, or whatever you feel like.
But for your use case, np.bincount is probably the way to go.
Using also m_ind, n_ind = w.T, just do M += np.bincount(m_ind, weights=N[n_ind], minlength=len(M))
For clarity, let's define
>>> m_ind, n_ind = w.T
Then the for loop
for i, j in zip(m_ind, n_ind):
M[i] += N[j]
updates the entries M[np.unique(m_ind)]. The values that get written to it are N[n_ind], which must be grouped by m_ind. (The fact that there's an n_ind in addition to m_ind is actually tangential to the question; you could just set N = N[n_ind].) There happens to be a SciPy class that does exactly this: scipy.sparse.csr_matrix.
Example data:
>>> m_ind, n_ind = array([[0, 0, 1, 1], [2, 3, 0, 1]])
>>> M = np.arange(2, 6)
>>> N = np.logspace(2, 5, 4)
The result of the for loop is that M becomes [110002 1103 4 5]. We get the same result with a csr_matrix as follows. As I said earlier, n_ind isn't relevant, so we get rid of that first.
>>> N = N[n_ind]
>>> from scipy.sparse import csr_matrix
>>> update = csr_matrix((N, m_ind, [0, len(N)])).toarray()
The CSR constructor builds a matrix with the required values at the required indices; the third part of its argument is a compressed column index, meaning that the values N[0:len(N)] have the indices m_ind[0:len(N)]. Duplicates are summed:
>>> update
array([[ 110000., 1100.]])
This has shape (1, len(np.unique(m_ind))) and can be added in directly:
>>> M[np.unique(m_ind)] += update.ravel()
>>> M
array([110002, 1103, 4, 5])
Given a list of numpy arrays, each with the same dimensions, how can I find which array contains the maximum value on an element-by-element basis?
e.g.
import numpy as np
def find_index_where_max_occurs(my_list):
# d = ... something goes here ...
return d
a=np.array([1,1,3,1])
b=np.array([3,1,1,1])
c=np.array([1,3,1,1])
my_list=[a,b,c]
array_of_indices_where_max_occurs = find_index_where_max_occurs(my_list)
# This is what I want:
# >>> print array_of_indices_where_max_occurs
# array([1,2,0,0])
# i.e. for the first element, the maximum value occurs in array b which is at index 1 in my_list.
Any help would be much appreciated... thanks!
Another option if you want an array:
>>> np.array((a, b, c)).argmax(axis=0)
array([1, 2, 0, 0])
So:
def f(my_list):
return np.array(my_list).argmax(axis=0)
This works with multidimensional arrays, too.
For the fun of it, I realised that #Lev's original answer was faster than his generalized edit, so this is the generalized stacking version which is much faster than the np.asarray version, but it is not very elegant.
np.concatenate((a[None,...], b[None,...], c[None,...]), axis=0).argmax(0)
That is:
def bystack(arrs):
return np.concatenate([arr[None,...] for arr in arrs], axis=0).argmax(0)
Some explanation:
I've added a new axis to each array: arr[None,...] is equivalent to arr[np.newaxis,...] which is the same as arr[np.newaxis,:,:,:] where the ... expands to be the appropriate number dimensions. The reason for this is because np.concatenate will then stack along the new dimension, which is 0 since the None is at the front.
So, for example:
In [286]: a
Out[286]:
array([[0, 1],
[2, 3]])
In [287]: b
Out[287]:
array([[10, 11],
[12, 13]])
In [288]: np.concatenate((a[None,...],b[None,...]),axis=0)
Out[288]:
array([[[ 0, 1],
[ 2, 3]],
[[10, 11],
[12, 13]]])
In case it helps to understand, this would work too:
np.concatenate((a[...,None], b[...,None], c[...,None]), axis=a.ndim).argmax(a.ndim)
where the new axis is now added at the end, so we must stack and maximize along that last axis, which will be a.ndim. For a, b, and c being 2d, we could do this:
np.concatenate((a[:,:,None], b[:,:,None], c[:,:,None]), axis=2).argmax(2)
Which is equivalent to the dstack I mentioned in my comment above (dstack adds a third axis to stack along if it doesn't exist in the arrays).
To test:
N = 10
M = 2
a = np.random.random((N,)*M)
b = np.random.random((N,)*M)
c = np.random.random((N,)*M)
def bystack(arrs):
return np.concatenate([arr[None,...] for arr in arrs], axis=0).argmax(0)
def byarray(arrs):
return np.array(arrs).argmax(axis=0)
def byasarray(arrs):
return np.asarray(arrs).argmax(axis=0)
def bylist(arrs):
assert arrs[0].ndim == 1, "ndim must be 1"
return [np.argmax(x) for x in zip(*arrs)]
In [240]: timeit bystack((a,b,c))
100000 loops, best of 3: 18.3 us per loop
In [241]: timeit byarray((a,b,c))
10000 loops, best of 3: 89.7 us per loop
In [242]: timeit byasarray((a,b,c))
10000 loops, best of 3: 90.0 us per loop
In [259]: timeit bylist((a,b,c))
1000 loops, best of 3: 267 us per loop
[np.argmax(x) for x in zip(*my_list)]
Well, this is just a list, but you know how to make it an array if you want. :)
To explain what this does: zip(*my_list) is equivalent to zip(a,b,c), which gives you a generator to loop over. Each step in the loop gives you a tuple like (a[i], b[i], c[i]), where i is the step in the loop. Then, np.argmax gives you the index of that tuple for the element with the largest value.
I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column represents values for various spatial sites for a given time.
So if the array is like:
1 3 4 2 2 7
5 2 2 1 4 1
3 3 2 2 1 1
The result should be
1 3 2 2 2 1
Note that when there are multiple values for mode, any one (selected randomly) may be set as mode.
I can iterate over the columns finding mode one at a time but I was hoping numpy might have some in-built function to do that. Or if there is a trick to find that efficiently without looping.
Check scipy.stats.mode() (inspired by #tom10's comment):
import numpy as np
from scipy import stats
a = np.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
m = stats.mode(a)
print(m)
Output:
ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))
As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]:
print(m[0])
Output:
[[1 3 2 2 1 1]]
Update
The scipy.stats.mode function has been significantly optimized since this post, and would be the recommended method
Old answer
This is a tricky problem, since there is not much out there to calculate mode along an axis. The solution is straight forward for 1-D arrays, where numpy.bincount is handy, along with numpy.unique with the return_counts arg as True. The most common n-dimensional function I see is scipy.stats.mode, although it is prohibitively slow- especially for large arrays with many unique values. As a solution, I've developed this function, and use it heavily:
import numpy
def mode(ndarray, axis=0):
# Check inputs
ndarray = numpy.asarray(ndarray)
ndim = ndarray.ndim
if ndarray.size == 1:
return (ndarray[0], 1)
elif ndarray.size == 0:
raise Exception('Cannot compute mode on empty array')
try:
axis = range(ndarray.ndim)[axis]
except:
raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))
# If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
if all([ndim == 1,
int(numpy.__version__.split('.')[0]) >= 1,
int(numpy.__version__.split('.')[1]) >= 9]):
modals, counts = numpy.unique(ndarray, return_counts=True)
index = numpy.argmax(counts)
return modals[index], counts[index]
# Sort array
sort = numpy.sort(ndarray, axis=axis)
# Create array to transpose along the axis and get padding shape
transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
shape = list(sort.shape)
shape[axis] = 1
# Create a boolean array along strides of unique values
strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
numpy.diff(sort, axis=axis) == 0,
numpy.zeros(shape=shape, dtype='bool')],
axis=axis).transpose(transpose).ravel()
# Count the stride lengths
counts = numpy.cumsum(strides)
counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
counts[strides] = 0
# Get shape of padded counts and slice to return to the original shape
shape = numpy.array(sort.shape)
shape[axis] += 1
shape = shape[transpose]
slices = [slice(None)] * ndim
slices[axis] = slice(1, None)
# Reshape and compute final counts
counts = counts.reshape(shape).transpose(transpose)[slices] + 1
# Find maximum counts and return modals/counts
slices = [slice(None, i) for i in sort.shape]
del slices[axis]
index = numpy.ogrid[slices]
index.insert(axis, numpy.argmax(counts, axis=axis))
return sort[index], counts[index]
Result:
In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))
Some benchmarks:
In [4]: import scipy.stats
In [5]: a = numpy.random.randint(1,10,(1000,1000))
In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop
In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop
In [8]: a = numpy.random.randint(1,500,(1000,1000))
In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop
In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop
In [11]: a = numpy.random.random((200,200))
In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop
In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop
EDIT: Provided more of a background and modified the approach to be more memory-efficient
If you want to use numpy only:
x = [-1, 2, 1, 3, 3]
vals,counts = np.unique(x, return_counts=True)
gives
(array([-1, 1, 2, 3]), array([1, 1, 1, 2]))
And extract it:
index = np.argmax(counts)
return vals[index]
A neat solution that only uses numpy (not scipy nor the Counter class):
A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])
np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)
array([1, 3, 2, 2, 1, 1])
Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.
(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]
Remember to discard the mode when len(np.argmax(counts)) > 1, also to validate if it is actually representative of the central distribution of your data you may check whether it falls inside your standard deviation interval.
simplest way in Python to get the mode of an list or array a
import statistics
a=[7,4,4,4,4,25,25,6,7,4867,5,6,56,52,32,44,4,4,44,4,44,4]
print(f"{statistics.mode(a)} is the mode (most frequently occurring number)")
That's it
I think a very simple way would be to use the Counter class. You can then use the most_common() function of the Counter instance as mentioned here.
For 1-d arrays:
import numpy as np
from collections import Counter
nparr = np.arange(10)
nparr[2] = 6
nparr[3] = 6 #6 is now the mode
mode = Counter(nparr).most_common(1)
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])
For multiple dimensional arrays (little difference):
import numpy as np
from collections import Counter
nparr = np.arange(10)
nparr[2] = 6
nparr[3] = 6
nparr = nparr.reshape((10,2,5)) #same thing but we add this to reshape into ndarray
mode = Counter(nparr.flatten()).most_common(1) # just use .flatten() method
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])
This may or may not be an efficient implementation, but it is convenient.
from collections import Counter
n = int(input())
data = sorted([int(i) for i in input().split()])
sorted(sorted(Counter(data).items()), key = lambda x: x[1], reverse = True)[0][0]
print(Mean)
The Counter(data) counts the frequency and returns a defaultdict. sorted(Counter(data).items()) sorts using the keys, not the frequency. Finally, need to sorted the frequency using another sorted with key = lambda x: x[1]. The reverse tells Python to sort the frequency from the largest to the smallest.
if you want to find mode as int Value here is the easiest way
I was trying to find out mode of Array using Scipy Stats but the problem is that output of the code look like:
ModeResult(mode=array(2), count=array([[1, 2, 2, 2, 1, 2]])) , I only want the Integer output so if you want the same just try this
import numpy as np
from scipy import stats
numbers = list(map(int, input().split()))
print(int(stats.mode(numbers)[0]))
Last line is enough to print Mode Value in Python: print(int(stats.mode(numbers)[0]))
If you wish to use only numpy and do it without using the index of the array. The following implementation combining dictionaries with numpy can be used.
val,count = np.unique(x,return_counts=True)
freq = {}
for v,c in zip(val,count):
freq[v] = c
mode = sorted(freq.items(),key =lambda kv :kv[1])[-1]
Finding Mode using dict in python
def mode(x):
d={}
k=0
v=0
for i in x:
d[i]=d.get(i,0)+1
if d[i]>v:
k=i
v=d[i]
print(d)
return k
print(mode(x))