Find intersecting values in multiple numpy arrays - python

I have 100 large arrays > 250,000 elements each. I want to find common values that are found in these arrays. I know that there are not going to be values that are found in all 100 arrays, but a small number values will be found in multiple arrays (I suspect 10-30%). I want to find which values are found with the highest frequency across these arrays. (Side point: arrays have no duplicates)
I know that I can loop through the arrays and eventually find them, but that will take a while. I also know about the np.intersect1d function, but I that only gives values that are found within all of the arrays, whereas I'm looking for values that are only going to be in around 20 of the 100 arrays.
My best bet is use the np.intersect1d function and loop through all possible combinations of the arrays, which would definitely take a while, but not as long as simply looping through all 250,000 x 100 values.
Example:
array_1 = array([1.98,2.33,3.44,,...11.1)
array_2 = array([1.26,1.49,4.14,,...9.0)
array_2 = array([1.58,2.33,3.44,,...19.1)
array_3 = array([4.18,2.03,3.74,,...12.1)
.
.
.
array_100= array([1.11,2.13,1.74,,...1.1)
No values in all 100, Is there a value that can be found in 30 different arrays?

You can either use np.unique with the return_counts keyword, or a vanilla Python Counter.
The first option works if you can concatenate your arrays into a single 250k x 100 monolith, or even string them out over after the other:
unq, counts = np.unique(monolith, return_counts=True)
ind = np.argsort(counts)[::-1]
unq = unq[ind]
counts = counts[ind]
This will leave you with an array containing all the unique values, and the frequency with which they occur.
If the arrays have to remain separate, use collections.Counter to accomplish the same task. In the following, I assume that you have a list containing your arrays. It would be very pointless to have a hundred individually named variables:
c = Counter()
for arr in arrays:
c.update(arr)
Now c.most_common will give you the most common elements and their counts.

Related

efficient iterative creation of multiple numpy arrays at once

I have a file with millions of lines, each of which is a list of integers (these sublists are in the range of tens to hundreds of items). What I want is to read through the file contents once and create 3 numpy arrays -- one with the average of each sublist, one with the length of each sublist, and one which is a flattened list of all the values in all the sublists.
If I just wanted one of these things, I'd do something like:
counts = np.fromiter((len(json.loads(line.rstrip())) for line in mystream), int)
but if I write 3 of those, my code would iterate through my millions of sublists 3 times, and I obviously only want to iterate through them once. So I want to do something like this:
averages = []
counts = []
allvals = []
for line in mystream:
sublist = json.loads(line.rstrip())
averages.append(np.average(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
I believe that creating regular arrays as above and then doing
np_averages = np.array(averages)
Is very inefficient (basically creating the list twice). What is the right/efficient way to iteratively create a numpy array if it's not practical to use fromiter? Or do I want to create a function that returns the 3 values and do something like list comprehension for multiple return function? with fromiter instead of traditional list comprehension?
Or would it be efficient to create a 2D array of
[[count1, average1, sublist1], [count1, average2, sublist2], ...] and then doing additional operations to slice off (and in the 3rd case also flatten) the columns as their own 1D arrays?
First of all, the json library is not the most optimized library for that. You can use the pysimdjson package based on the optimized simdjson library to speed up the computation. For small integer lists, it is about twice faster on my machine.
Moreover, Numpy functions are not great for relatively small arrays as they introduce a pretty big overhead. For example, np.average takes about 8-10 us on my machine to compute an array of 20 items. Meanwhile, sum(sublist)/len(sublist) only takes 0.25-0.30 us.
Finally, np.array needs to iterate twice to convert the list into an array because it does not know the type of all objects. You can specify it so to make the convertion faster: np.array(averages, np.float64).
Here is a significantly faster implementation:
import simdjson
averages = []
counts = []
allvals = []
for line in mystream:
sublist = simdjson.loads(line.rstrip())
averages.append(sum(sublist) / len(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
np_averages = np.array(averages, np.float64)
One issue with this implementation is that allvals will contain all the values in the form of a big list of objects. CPython objects are quite big in memory compared to native Numpy integers (especially compared to 32-bit=4bytes integers) since each object takes usually 32 bytes and the reference in the list takes usually 8 bytes (resulting in 40 bytes per items, that is to say 10 times more than Numpy 32-bit-integer-based arrays). Thus, I may be better to use a native implementation, possibly based on Cython.

Compare 2D arrays row-wise

This problem is resulting from the spatial analysis of unstructured grids in 3D.
I have 2 2D arrays to compare, each with 3 columns for xyz coordinates.
One of the array is a reference, the other is evaluated against it (it is the result of CKde tree query against the reference array). In the end I want the number of matching row of the reference.
I have tried to find an array concatenation solution but I am lost in the different dimensions
reference=np.array([[0,1,33],[0,33,36],[0,2,36],[1, 33, 34]])
query= np.array([[0,1,33],[0,1,33],[1, 33, 34],[0,33,36],[0,33,36],[0,1,33],[0,33,36]])
Something in the style is where I am heading
filter=reference[:,:,None]==query.all(axis=0)
result = filter.sum(axis=1)
but I cannot find the right way of broadcasting to be able to compare the rows of the 2 arrays.
The result should be:
np.array([3,3,0,1])
You need to broadcast the two arrays. Since you cannot compare the 1D array directly, you first need to do a reduction using all on the last dimension. Then you can count the matched rows with sum sum. Here is the resulting code:
(reference[None,:,:] == query[:,None,:]).all(axis=2).sum(axis=0)
That being said, this solution is not the most efficient for bigger arrays. Indeed for m rows for size n in reference and k rows in query, the complexity of the solution is O(n m k) while the optimal solution is O(n m + n k). This can be achieved using hash maps (aka dict). The idea is to put rows of reference array in a hash map with associated values set to 0 and then for each value of query increase the value of the hash map with the key set to the row of query. One just need to iterate over the hash map to get the final array. Hash map accesses are done in (amortized) constant time. Unfortunately, Python dict does not support array as key since array cannot be hashed, but tuples can be. Here is an example:
counts = {tuple(row):0 for row in reference}
for row in query:
key = tuple(row)
if key in counts:
counts[key] += 1
print(list(counts.values()))
Which results in printing: [3, 3, 0, 1].
Note that the order is often not conserved in hash maps, but it should be ok for Python dict. Alternatively, one can use another hash map to rebuild the final array.
The resulting solution may be slower for small arrays, but it should be better for huge ones.

Split several times a numpy array into irregular fragments

As the title of the question suggests, I am trying to find an optimal (and possibly pythonic) way of splitting several times a one dimensional numpy array into several irregular fragments, provided the following conditions: the first split occurs into n fragments whose lengths l are contained in the LSHAPE array, the second split occurs in each one of the n previous fragments, but now each one of them is split regularly into m arrays. The corresponding values of m are stored in the MSHAPES array, in a way that the i-th m matches the i-th l. To best illustrate my problem, I include the solution I have found so far, which makes use of the numpy split method:
import numpy as np
# Define arrays (n = 3 in this example)
LSHAPE = np.array([5, 8, 3])
MSHAPE = np.array([4, 5, 2])
# Generate a random 1D array of the requiered lenght
LM_SHAP = np.sum(np.multiply(LSHAPE, MSHAPE))
REFDAT = np.random.uniform(-1, 1, size=LM_SHAP)
# Split twice the array (this is my solution so far)
SLICE_L = np.split(REFDAT, np.cumsum(np.multiply(LSHAPE, MSHAPE)))[0:-1]
SLICE_L_M = []
for idx, mfrags in enumerate(SLICE_L):
SLICE_L_M.append(np.split(mfrags, MSHAPE[idx]))
In the code above a random test array (REFDAT) is created to fulfill the requirements of the problem, and then subsequently split. The results are stored in the SLICE_L_M array. This solution works, but I think is hard to read and possibly not efficient, so I would like to know if it is possible to improve it. I have read some Stackoverflow threads which are related to this one (like this one and this one) but I think my problem is slightly different. Thanks in advance for your help and time.
Edit:
One can gain an average ~ 3% CPU time improvement if a list comprehension is used:
SLICE_L = np.split(REFDAT, np.cumsum(np.multiply(LSHAPE, MSHAPE)))[0:-1]
SLICE_L_M = [np.split(lval, mval) for lval, mval in zip(SLICE_L, MSHAPE)]

Update numpy array with sparse indices and values

I have 1-dimensional numpy array and want to store sparse updates of it.
Say I have array of length 500000 and want to do 100 updates of 100 elements. Updates are either adds or just changing the values (I do not think it matters).
What is the best way to do it using numpy?
I wanted to just store two arrays: indices, values_to_add and therefore have two objects: one stores dense matrix and other just keeps indices and values to add, and I can just do something like this with the dense matrix:
dense_matrix[indices] += values_to_add
And if I have multiple updates, I just concat them.
But this numpy syntax doesn't work fine with repeated elements: they are just ignored.
Updating pair when we have an update that repeats index is O(n). I thought of using dict instead of array to store updates, which looks fine from the point of view of complexity, but it doesn't look good numpy style.
What is the most expressive way to achieve this? I know about scipy sparse objects, but (1) I want pure numpy because (2) I want to understand the most efficient way to implement it.
If you have repeated indices you could use at, from the documentation:
Performs unbuffered in place operation on operand ‘a’ for elements
specified by ‘indices’. For addition ufunc, this method is equivalent
to a[indices] += b, except that results are accumulated for elements
that are indexed more than once.
Code
a = np.arange(10)
indices = [0, 2, 2]
np.add.at(a, indices, [-44, -55, -55])
print(a)
Output
[ -44 1 -108 3 4 5 6 7 8 9]

Finding a list of indices from master array using secondary array with non-unique entries

I have a master array of length n of id numbers that apply to other analogous arrays with corresponding data for elements in my simulation that belong to those id numbers (e.g. data[id]). Were I to generate a list of id numbers of length m separately and need the information in the data array for those ids, what is the best method of getting a list of indices idx of the original array of ids in order to extract data[idx]? That is, given:
a=numpy.array([1,3,4,5,6]) # master array
b=numpy.array([3,4,3,6,4,1,5]) # secondary array
I would like to generate
idx=numpy.array([1,2,1,4,2,0,3])
The array a is typically in sequential order but it's not a requirement. Also, array b will most definitely have repeats and will not be in any order.
My current method of doing this is:
idx=numpy.array([numpy.where(a==bi)[0][0] for bi in b])
I timed it using the following test:
a=(numpy.random.uniform(100,size=100)).astype('int')
b=numpy.repeat(a,100)
timeit method1(a,b)
10 loops, best of 3: 53.1 ms per loop
Is there a better way of doing this?
The current way you are doing it with where searching through the whole array of a each time. You can make this look-up O(1) instead of O(N) using a dict. For instance, I used the following method:
def method2(a,b):
tmpdict = dict(zip(a,range(len(a))))
idx = numpy.array([tmpdict[bi] for bi in b])
and got a very large speed-up which will be even better for larger arrays. For the sizes that you had in your example code, I got a speed-up of 15x. The only problem with my code is that if there are repeated elements in a, then the dict will currently point to the last instance of the element while with your method it will point to the first instance. However, that can remedied if there are to be repeated elements in the actual usage of the code.
I'm not sure if there is a way to do this automatically in python, but you're probably best off sorting the two arrays and then generating your output in one pass through b. The complexity of that operation should be O(|a|*log|a|)+O(|b|*log|b|)+O(|b|) = O(|b|*log|b|) (assuming |b| > |a|). I believe your original try has complexity O(|a|*|b|), so this should provide a noticeable improvement for a sufficiently large b.

Categories

Resources