I have a numpy array with integer values, let's call this array x.
I want to create some sort of list where for each value, I have the indices of x that hold this value.
For example, for:
x = [1,2,2,4,7,1,1,7,16]
I want to get:
{1: [0,5,6], 2:[1,2], 4:[3], 7:[4,7], 15:[16]}
The parenthesis I used are arbitrary, I don't care which data structure I use as long as I can output my result to a file as quickly as possible. At the end I want a .txt file that reads:
0,5,6
1,2
3
4,7
16
Since you mentioned you're not picky about the data structure of your values,tTo get something like the dictionary you posted in your question, you could do a dictionary comprehension over the unique values in x with np.where for the values:
>>> {i:np.where(x == i)[0] for i in set(x)}
{1: array([0, 5, 6]),
2: array([1, 2]),
4: array([3]),
7: array([4, 7]),
16: array([8])}
Comparing this to a more vanilla loop through a list, this will be significantly faster for larger arrays:
def list_method(x):
res = {i:[] for i in set(x)}
for i, value in enumerate(x):
res[value].append(i)
return res
def np_method(x):
return {i:np.where(x == i)[0] for i in set(x)}
x = np.random.randint(1, 50, 1000000)
In [5]: %timeit list_method(x)
259 ms ± 4.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit np_method(x)
120 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pure python will be like this:
result = {}
for idx,val in enumerate(x):
arr = result.get(val,[])
arr.append(idx)
result[val] = arr
x = [1,2,2,4,7,1,1,7,16]
numlist = []
numdict = {}
c = 0
for n in x:
if n not in numlist:
numlist.append(n)
numdict[n] = [c]
else:
numdict[n].append(c)
c += 1
print(numlist, numdict)
Output is:
[1, 2, 4, 7, 16] {1: [0, 5, 6], 2: [1, 2], 4: [3], 7: [4, 7], 16: [8]}
To write to file use:
with open('file.txt', 'w') as f:
f.write(str(numdict))
Related
I have a DataFrame that can be produced using this Python code:
import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
'time': [t for t in range(6)] * 2,
'observations': [o for o in range(12)]})
The following code enables me to reformat the data as desired:
dflist = []
for v_ in df.visit.unique():
for t_ in df.time[df.visit == v_]:
dflist.append([df[(df.visit == v_) & (df.time <= t_)].groupby('visit')['observations'].apply(list)])
pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))
However this is extremely slow.
I have tried using .expanding(), however, this will only return scalars whereas I would like list (or numpy array).
I would appreciate any help in vectorizing or otherwise optimizing this procedure.
Thanks
Fortunately, in pandas 1.1.0 and newer, expanding produces an iterable which can be used to use take advantage of the faster grouping, but produce non-scaler data like lists:
new_df = pd.DataFrame({
'observations':
[list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])
new_df:
observations
visit
1 [0]
1 [0, 1]
1 [0, 1, 2]
1 [0, 1, 2, 3]
1 [0, 1, 2, 3, 4]
1 [0, 1, 2, 3, 4, 5]
2 [6]
2 [6, 7]
2 [6, 7, 8]
2 [6, 7, 8, 9]
2 [6, 7, 8, 9, 10]
2 [6, 7, 8, 9, 10, 11]
Timing via %timeit:
Setup:
import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
'time': [t for t in range(6)] * 2,
'observations': [o for o in range(12)]})
Original:
def fn():
dflist = []
for v_ in df.visit.unique():
for t_ in df.time[df.visit == v_]:
dflist.append([
df[(df.visit == v_) & (df.time <= t_)]
.groupby('visit')['observations'].apply(list)
])
return pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))
%timeit fn()
13 ms ± 692 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehension with expanding (~13x faster on this sample):
def fn2():
return pd.DataFrame({
'observations':
[list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])
%timeit fn2()
967 µs ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sanity Check:
fn().eq(fn2()).all(axis=None) # True
The double apply approach by #Quixotic22 (~3.4x faster than the original ~3.9x slower than comprehension + expanding):
def fn3():
return (df.
set_index('visit')['observations'].
apply(lambda x: [x]).
reset_index().groupby('visit')['observations'].
apply(lambda x: x.cumsum()))
%timeit fn3()
3.78 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
*Note this approach only produces a series of observations, does not include the visit as the index.
fn().eq(fn3()).all(axis=None) # False
Looks like a good solution has been provided but dropping this here as a viable alternative.
(df.
set_index('visit')['observations'].
apply(lambda x: [x]).
reset_index().groupby('visit')['observations'].
apply(lambda x: x.cumsum())
)
I have two lists/arrays, I want to find the index of elements in one list if the same number exists in another list. Here's an example
list_A = [1,7,9,7,11,1,2,3,6,4,9,0,1]
list_B = [9,1,7]
#output required : [0,1,2,3,5,10,12]
Any method to do this using hopefully numpy
Using a list-comprehension and enumerate():
>>> list_A = [1,7,9,7,11,1,2,3,6,4,9,0,1]
>>> list_B = [9,1,7]
>>> [i for i, x in enumerate(list_A) if x in list_B]
[0, 1, 2, 3, 5, 10, 12]
Using numpy:
>>> import numpy as np
>>> np.where(np.isin(list_A, list_B))
(array([ 0, 1, 2, 3, 5, 10, 12], dtype=int64),)
In addition, as #Chris_Rands points out, we could also convert list_B to a set first, as in is O(1) for sets as opposed to O(n) for lists.
Time comparison:
import random
import numpy as np
import timeit
list_A = [random.randint(0,100000) for _ in range(100000)]
list_B = [random.randint(0,100000) for _ in range(50000)]
array_A = np.array(A)
array_B = np.array(B)
def lists_enumerate(list_A, list_B):
return [i for i, x in enumerate(list_A) if x in set(list_B)]
def listB_to_set_enumerate(list_A, list_B):
set_B = set(list_B)
return [i for i, x in enumerate(list_A) if x in set_B]
def numpy(array_A, array_B):
return np.where(np.isin(array_A, array_B))
Results:
>>> %timeit lists_enumerate(list_A, list_B)
48.8 s ± 638 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit listB_to_set_enumerate(list_A, list_B)
11.2 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit numpy(array_A, array_B)
23.3 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So clearly for larger lists the best solution is to either convert list_B to a set before applying the enumerate, or use numpy.
I have several numpy arrays; I want to build a groupby method that would have group ids for these arrays. It will then allow me to index these arrays on the group id to perform operations on the groups.
For an example:
import numpy as np
import pandas as pd
a = np.array([1,1,1,2,2,3])
b = np.array([1,2,2,2,3,3])
def group_np(groupcols):
groupby = np.array([''.join([str(b) for b in bs]) for bs in zip(*[c for c in groupcols])])
_, groupby = np.unique(groupby, return_invesrse=True)
return groupby
def group_pd(groupcols):
df = pd.DataFrame(groupcols[0])
for i in range(1, len(groupcols)):
df[i] = groupcols[i]
for i in range(len(groupcols)):
df[i] = df[i].fillna(-1)
return df.groupby(list(range(len(groupcols)))).grouper.group_info[0]
Outputs:
group_np([a,b]) -> [0, 1, 1, 2, 3, 4]
group_pd([a,b]) -> [0, 1, 1, 2, 3, 4]
Is there a more efficient way of implementing it, ideally in pure numpy? The bottleneck currently seems to be building a vector that would have unique values for each group - at the moment I am doing that by concatenating the values for each vector as strings.
I want this to work for any number of input vectors, which can have millions of elements.
Edit: here is another testcase:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
Here, group elements 2,3,4,7 should all be the same.
Edit2: adding some benchmarks.
a = np.random.randint(1, 1000, 30000000)
b = np.random.randint(1, 1000, 30000000)
c = np.random.randint(1, 1000, 30000000)
def group_np2(groupcols):
_, groupby = np.unique(np.stack(groupcols), return_inverse=True, axis=1)
return groupby
%timeit group_np2([a,b,c])
# 25.1 s +/- 1.06 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)
%timeit group_pd([a,b,c])
# 21.7 s +/- 646 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
After using np.stack on the arrays a and b, if you set the parameter return_inverse to True in np.unique then it is the output you are looking for:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
_, inv = np.unique(np.stack([a,b]), axis=1, return_inverse=True)
print (inv)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
and you can replace [a,b] in np.stack by a list of all the vectors.
Edit: a faster solution is use np.unique on the sum of the arrays multiply by the cumulative product (np.cumprod) of the max plus 1 of all previous arrays in groupcols. such as:
def group_np_sum(groupcols):
groupcols_max = np.cumprod([ar.max()+1 for ar in groupcols[:-1]])
return np.unique( sum([groupcols[0]] +
[ ar*m for ar, m in zip(groupcols[1:],groupcols_max)]),
return_inverse=True)[1]
To check:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print (group_np_sum([a,b]))
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
Note: the number associated to each group may not be the same (here I changed the first element of a by 3)
a = np.array([3,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print(group_np2([a,b]))
print (group_np_sum([a,b]))
array([3, 1, 0, 0, 0, 2, 4, 0], dtype=int64)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
but groups themselves are the same.
Now to check for timing:
a = np.random.randint(1, 100, 30000)
b = np.random.randint(1, 100, 30000)
c = np.random.randint(1, 100, 30000)
groupcols = [a,b,c]
%timeit group_pd(groupcols)
#13.7 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit group_np2(groupcols)
#34.2 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit group_np_sum(groupcols)
#3.63 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The numpy_indexed package (dsiclaimer: I am its authos) covers these type of use cases:
import numpy_indexed as npi
npi.group_by((a, b))
Passing a tuple of index-arrays like this avoids creating a copy; but if you dont mind making the copy you can use stacking as well:
npi.group_by(np.stack(a, b))
Description:
I have a large array with simple integers(positive and not large) like 1, 2, ..., etc. For example: [1, 1, 2, 2, 1, 2]. I want to get a dict in which use a single value from the list as the dict's key, and use the indexes list of this value as the dict's value.
Question:
Is there a simpler and faster way to get the expected results in python? (array can be a list or a numpy array)
Code:
a = [1, 1, 2, 2, 1, 2]
results = indexes_of_same_elements(a)
print(results)
Expected results:
{1:[0, 1, 4], 2:[2, 3, 5]}
You can avoid iteration here using vectorized methods, in particular np.unique + np.argsort:
idx = np.argsort(a)
el, c = np.unique(a, return_counts=True)
out = dict(zip(el, np.split(idx, c.cumsum()[:-1])))
{1: array([0, 1, 4], dtype=int64), 2: array([2, 3, 5], dtype=int64)}
Performance
a = np.random.randint(1, 100, 10000)
In [183]: %%timeit
...: idx = np.argsort(a)
...: el, c = np.unique(a, return_counts=True)
...: dict(zip(el, np.split(idx, c.cumsum()[:-1])))
...:
897 µs ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [184]: %%timeit
...: results = {}
...: for i, k in enumerate(a):
...: results.setdefault(k, []).append(i)
...:
2.61 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can exploit the fact that the elements are "simple" (i.e. nonnegative and not too large?) integers.
The trick is to construct a sparse matrix with just one element per row and then to transform it to a column wise representation. This is typically faster than argsort because this transform is O(M + N + nnz), if the sparse matrix is MxN with nnz nonzeros.
from scipy import sparse
def use_sprsm():
x = sparse.csr_matrix((a, a, np.arange(a.size+1))).tocsc()
idx, = np.where(x.indptr[:-1] != x.indptr[1:])
return {i: a for i, a in zip(idx, np.split(x.indices, x.indptr[idx[1:]]))}
# for comparison
def use_asort():
idx = np.argsort(a)
el, c = np.unique(a, return_counts=True)
return dict(zip(el, np.split(idx, c.cumsum()[:-1])))
Sample run:
>>> a = np.random.randint(0, 100, (10_000,))
>>>
# sanity check, note that `use_sprsm` returns sorted indices
>>> for k, v in use_asort().items():
... assert np.array_equal(np.sort(v), use_sprsm()[k])
...
>>> timeit(use_asort, number=1000)
0.8930604780325666
>>> timeit(use_sprsm, number=1000)
0.38419671391602606
It is pretty trivial to construct the dict:
In []:
results = {}
for i, k in enumerate(a):
results.setdefault(k, []).append(i) # str(k) if you really need the key to be a str
print(results)
Out[]:
{1: [0, 1, 4], 2: [2, 3, 5]}
You could also use results = collections.defaultdict(list) and then results[k].append(i) instead of results.setdefault(k, []).append(i)
Given input
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Need output :
array([[2, 3],
[4, 6],
[7, 8]])
It is easy to use iteration or loop to do this, but there should be a neat way to do this without using loops. Thanks
Approach #1
One approach with masking -
A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
Sample run -
In [395]: A
Out[395]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [396]: A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
Out[396]:
array([[2, 3],
[4, 6],
[7, 8]])
Approach #2
Using the regular pattern of non-diagonal elements that could be traced with broadcasted additions with range arrays -
m = A.shape[0]
idx = (np.arange(1,m+1) + (m+1)*np.arange(m-1)[:,None]).reshape(m,-1)
out = A.ravel()[idx]
Approach #3 (Strides Strikes!)
Abusing the regular pattern of non-diagonal elements from previous approach, we can introduce np.lib.stride_tricks.as_strided and some slicing help, like so -
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
out = strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)).reshape(m,-1)
Runtime test
Approaches as funcs :
def skip_diag_masking(A):
return A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
def skip_diag_broadcasting(A):
m = A.shape[0]
idx = (np.arange(1,m+1) + (m+1)*np.arange(m-1)[:,None]).reshape(m,-1)
return A.ravel()[idx]
def skip_diag_strided(A):
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)).reshape(m,-1)
Timings -
In [528]: A = np.random.randint(11,99,(5000,5000))
In [529]: %timeit skip_diag_masking(A)
...: %timeit skip_diag_broadcasting(A)
...: %timeit skip_diag_strided(A)
...:
10 loops, best of 3: 56.1 ms per loop
10 loops, best of 3: 82.1 ms per loop
10 loops, best of 3: 32.6 ms per loop
I know I'm late to this party, but I have what I believe is a simper solution. So you want to remove the diagonal? Okay cool:
replace it with NaN
filter all but NaN (this converts to one dimensional as it can't assume the result will be square)
reset the dimensionality
`
arr = np.array([[1,2,3],[4,5,6],[7,8,9]]).astype(np.float)
np.fill_diagonal(arr, np.nan)
arr[~np.isnan(arr)].reshape(arr.shape[0], arr.shape[1] - 1)
Solution steps:
Flatten your array
Delete the location of the diagonal elements which is at the location range(0, len(x_no_diag), len(x) + 1)
Reshape your array to (num_rows, num_columns - 1)
The function:
import numpy as np
def remove_diag(x):
x_no_diag = np.ndarray.flatten(x)
x_no_diag = np.delete(x_no_diag, range(0, len(x_no_diag), len(x) + 1), 0)
x_no_diag = x_no_diag.reshape(len(x), len(x) - 1)
return x_no_diag
Example:
>>> x = np.random.randint(5, size=(3,3))
array([[0, 2, 3],
[3, 4, 1],
[2, 4, 0]])
>>> remove_diag(x)
array([[2, 3],
[3, 1],
[2, 4]])
Just with numpy, assuming a square matrix:
new_A = numpy.delete(A,range(0,A.shape[0]**2,(A.shape[0]+1))).reshape(A.shape[0],(A.shape[1]-1))
If you do not mind creating a new array, then you can use list comprehension.
A = np.array([A[i][A[i] != A[i][i]] for i in range(len(A))])
Rerunning the same methods as #Divakar,
A = np.random.randint(11,99,(5000,5000))
skip_diag_masking
85.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_broadcasting
163 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_strided
52.5 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_list_comp
101 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Perhaps the cleanest way, based on Divakar's first solution but using len(array) instead of array.shape[0], is:
array_without_diagonal = array[~np.eye(len(array), dtype=bool)].reshape(len(array), -1)
I love all the answers here, but would like to add one in case your numpy object has more than 2-dimensions. In that case, you can use the following adjustment of Divakar's approach #1:
def remove_diag(A):
removed = A[~np.eye(A.shape[0], dtype=bool)].reshape(A.shape[0], int(A.shape[0])-1, -1)
return np.squeeze(removed)
The other approach is to use numpy.delete(). assuming square matrix, you can use:
numpy.delete(A,range(0,A.shape[0]**2,A.shape[0])).reshape(A.shape[0],A.shape[1]-1)