How to apply a masked array to a very large JSON fast - python

The Data
I am currently working on very large JSON files formated as such
{key: [1000+ * arrays of length 241],
key2: [1000+ * arrays of length 241],
(...repeat 5-8 times...)}
The data is structured in a way that the nth element in each key's array belongs to the nth entity. Think about it as each key being a descriptor such as 'height' or 'pressure'. And therefore to get an entity's 'height' and 'pressure' you would access the entities index n in all the arrays. Therefore all the key's arrays are the same length Z
This, as you can imagine, is a pain to work with as a whole. Therefore, whenever I perform any data manipulation I return a masked array of length Z populated with 1's and 0's. 1 means the data in that index in every key is to be kept and 0 means it should be omitted)
The Problem
Once all of my data manipulation has been performed I need to apply the masked array to the data to return a copy of the original JSON data but where the length of each key's array Z is equal to the number of 1's in the masked array (If the element in the masked array at index n is a 0 then the element in index n will be removed from all of the json key's arrays and vice versa)
My attempt
# mask: masked array
# d: data to apply the mask to
def apply_mask(mask, d):
keys = d.keys()
print(keys)
rem = [] #List of index to remove
for i in range(len(mask)):
if mask[i] == 0:
rem.append(i) #Populate 'rem'
for k in keys:
d[k] = [elem for elem in d[k] if not d[k].index(elem) in rem]
return d
This works as intended but takes a while on such large JSON data
Question
I hope everything above was clear and helps you to understand my question:
Is there a more optimal/quicker way to apply a masked array to data such as this shown above?
Cheers

This is going to be slow because
d[k] = [elem for elem in d[k] if not d[k].index(elem) in rem]
is completely recreating the inner list every time.
Since you're already modifying d in-place, you could just delete the respective elements:
def apply_mask(mask, d):
for i, keep in enumerate(mask):
if not keep:
for key in d:
del d[key][i - len(mask)]
return d
(Negative indices i - len(mask) are being used because positive indices don't work anymore if the list has already changed its length due to previously removed elements.)

The problem comes from the high algorithmic complexity of the code. It is possible to design a much faster algorithm.
Let K be the number of keys in the dictionary d (ie. len(d)). Let Z be the size of the mask (ie. len(mask)), which is also the typical size of the array values in d (ie.len(d[key]) for any key).
The algorithmic complexity of the initial code is O(Z^3 * K). This is because rem is a list and in rem is done in linear time and also because d[k].index(elem) search elem in d[k] in linear time too.
The solution proposed by finefoot is faster. Indeed, the complexity of his code is O(Z^2 * K) (because del is done in linear time on CPython lists).
However, is is possible to do the computation in linear time: O(K * Z). Here is how:
def apply_mask(mask, d):
for key in d:
d[key] = [e for i,e in enumerate(d[key]) if mask[i]!=0]
return d
This code should be several order of magnitude faster.
PS: I think the initial algorithm is not correct regarding the description of the problem. Indeed, some items that should be kept can be removed since rem is not cleaned between iterations (and so indices are accumulated).

Related

Efficiency of ordering the elements of a numpy array, by occurrences in a different array

I have the following code:
import numpy as np
def suborder(x, y):
pos = np.in1d(x, y, assume_unique=True)
return x[pos]
x and y are 1d numpy integer arrays, and the elements of y are a subset of those in x, and neither array has repeats. The result is the elements of y, in the order they appear in x. The code gives the result I want. But the intermediate array pos is the same size as x and in many use cases y is much, much smaller than x. Is there a way I can more directly get the result without allocating the intermediate array pos so as to save some memory?
x is not sorted. In my case its elements are the ids of objects and are the value 0->len(x) but in an unspecified order, and it's sorted in order of a score assigned to each object. The purpose of suborder is to order subsets with that same score order.
x is around 10million elements; and I have many different values for y, some approaching the size of x, all the way down to just a handful of elements.
Edit: I get x from doing an argsort on a set of scores for objects. I had imagined that it would be better to sort once for all scores, and then use that ordering to impose an order on the subsets. It may actually be better to take scores[y], then argsort that and take the elements of y in that order (for each y).
Solution 1
Since items are in range(0, len(x)) and are all unique (ie. permutation), then you can preallocate only one buffer of size len(x) (len(x)*4 bytes in RAM). The strategy is to first build reverse indices once just after sorting x:
idx = np.array(len(x), dtype=np.int32) # Can be reused after each sort of `x`
idx[x] = np.arange(len(x), dtype=np.int32) # Can be filled chunk-by-chunk in a loop if memory matters
Then, you need to filter the y array so all values are in range(0, len(x)). If this is already the case, then skip this step. The operation can be done using yFilt = y[np.logical_and(y >= 0, y < len(x))]. Since y can be quite big, you can do this operation chunk-by-chunk. A simpler, faster solution and more memory-efficient solution would be to filter y on the fly using Numba.
Then, you need to compute x[np.sorted(idx[yFilt])] to reorder items of y like in x. This can be done in-place using the following code:
# Should not allocate any temporary arrays
idx.take(yFilt, out=yFilt)
yFilt.sort()
x.take(yFilt, out=yFilt)
After that, yFilt is now ordered like the items in x. Note that you can mutate y so not to perform any temporary array allocations (although this means y is not used by something else in the code after this operation).
This reordering algorithm runs in O(Ny log Ny) time with Ny = len(y). The pre-computation runs in O(Nx) time with Nx = len(x). It requires 4 (Nx + Ny) bytes space for the out-of-place implementation and 4 Nx bytes for the in-place version that perform no allocation to reorder y.
Solution 2
If the previous solution takes too much memory for you, this solution should be the good one despite being much more computationally intensive. It uses only O(8 Ny) bytes (O(4 Ny) for the in-place implementation) and runs in O(Nx log Ny) time. Note that the output array can be preallocated once (and only filled later) to avoid any issue with the GC/allocator.
The idea is to perform a binary search of each value of x in a sorted+filtered version of y. Values are appended on the fly in an output array. This solution requires Numba or Cython to be fast (although a complex pure-Numpy implementation using chunks and np.searchsorted can be written).
import numba as nb
# `out` can be preallocated and passed in parameter to
# avoid allocations in hot loops
#nb.njit('int32[:](int32[:], int32[:])')
def orderLike(x, y):
sorted = np.sort(y) # Use y.sort() for an in-place implementation
out = np.empty(len(y), np.int32)
cur = 0
for v in x:
pos = np.searchsorted(sorted, v)
if pos < len(y) and sorted[pos] == v: # Found
out[cur] = v
cur += 1
return out[:cur]
in1d starts with:
if len(ar2) < 10 * len(ar1) ** 0.145 or contains_object:
...
mask = np.zeros(len(ar1), dtype=bool)
for a in ar2:
mask |= (ar1 == a)
return mask
In other words it does an equality test for each element of y. If your size difference isn't that large, then it uses a different method, one based on concatenating the arrays and doing a argsort.
I can imagine doing using np.flatnonzero(ar1==a) to get the equivalent indices, and concatenating them. But that will preserve the y order.

Sorting an array in O(n) using dictionary?

Assuming you have an array of numbers that need to be sorted and the following two conditions are true:
A low standard deviation
Memory isn't a constraint
How about using dictionary to sort this in O(n), below is the python code:
def sortArray(nums: List[int]) -> List[int]:
# dictionary to store all the numbers in the array as key and number of occurrences as value
d = {}
# Keep a track of upper and lower bound of array
max_num = nums[0]
min_num = nums[0]
for e in nums:
if e>max:
max=e
if e<min:
min=e
try:
#increment the value for "e" if it exists in dictionary
d[e]=d[e]+1
except:
#add a new key "e"
d[e]=1
a=[]
for i in range(min_num,max_num+1):
try:
for j in range(0,d[i]):
# add the element in new array for d[i] times
a.append(i)
except:
continue
return a
Given the 2 conditions,Are there any scenarios where this code would not work in O(n)? Is there something wrong with this approach ?
It is not possible to tell the exact complexity of this procedure because the running time depends on the range of the values, among others.
If this range is fixed, the final double loop adds an element N times. Hence O(N), if the hash table guarantees an O(N) complexity.
But for unbounded N, a fixed range does not make much sense, and it should grow with N. Hence the complexity will be of order O(M(N) + N), where M(N) is the size of the range.
It is also worth to note that the append operation does not necessarily feature constant time cost per element.

Efficient bin-assignment in numpy

i have a very large 1D python array x of somewhat repeating numbers and along with it some data d of the same size.
x = np.array([48531, 62312, 23345, 62312, 1567, ..., 23345, 23345])
d = np.array([0 , 1 , 2 , 3 , 4 , ..., 99998, 99999])
in my context "very large" refers to 10k...100k entries. Some of them are repeating so the number of unique entries is about 5k...15k.
I would like to group them into the bins. This should be done by creating two objects. One is a matrix buffer, b of data items taken from d. The other object is a vector v of unique x values each of the buffer columns refers to. Here's the example:
v = [48531, 62312, 23345, 1567, ...]
b = [[0 , 1 , 2 , 4 , ...]
[X , 3 , ....., ...., ...]
[ ...., ....., ....., ...., ...]
[X , X , 99998, X , ...]
[X , X , 99999, X , ...] ]
Since the numbers of occurrences of each unique number in x vary some of the values in the buffer b are invalid (indicated by the capital X, i.e. "don't care").
It's very easy to derive v in numpy:
v, n = np.unique(x, return_counts=True) # yay, just 5ms
and we even get n which is the number of valid entries within each column in b. Moreover, (np.max(n), v.shape[0]) returns the shape of the matrix b that needs to be allocated.
But how to efficiently generate b?
A for-loop could help
b = np.zeros((np.max(n), v.shape[0]))
for i in range(v.shape[0]):
idx = np.flatnonzero(x == v[i])
b[0:n[i], i] = d[idx]
This loop iterates over all columns of b and extracts the indices idxby identifying all the locations where x == v.
However I don't like the solution because of the rather slow for loop (taking about 50x longer than the unique command). I'd rather have the operation vectorized.
So one vectorized approach would be to create a matrix of indices where x == v and then run the nonzero() command on it along the columns. however, this matrix would require memory in the range of 150k x 15k, so about 8GB on a 32 bit system.
To me it sounds rather silly that the np.unique-operation can even efficiently return the inverted indices so that x = v[inv_indices] but that there is no way to get the v-to-x assignment lists for each bin in v. This should come almost for free when the function is scanning through x. Implementation-wise the only challenge would be the unknown size of the resulting index-matrix.
Another way of phrasing this problem assuming that the np.unique-command is the method-to-use for binning:
given the three arrays x, v, inv_indices where v are the unique elements in x and x = v[inv_indices] is there an efficient way of generating the index vectors v_to_x[i] such that all(v[i] == x[v_to_x[i]]) for all bins i?
I shouldn't have to spend more time than for the np.unique-command itself. And I'm happy to provide an upper bound for the number of items in each bin (say e.g. 50).
based on the suggestion from #user202729 I wrote this code
x_sorted_args = np.argsort(x)
x_sorted = x[x_sorted_args]
i = 0
v = -np.ones(T)
b = np.zeros((K, T))
for k,g in groupby(enumerate(x_sorted), lambda tup: tup[1]):
groups = np.array(list(g))[:,0]
size = groups.shape[0]
v[i] = k
b[0:size, i] = d[x_sorted_args[groups]]
i += 1
in runs in about ~100ms which results in some considerable speedup w.r.t. the original code posted above.
It first enumerates the values in x adding the corresponding index information. Then the enumeration is grouped by the actual x value which in fact is the second value of the tuple generated by enumerate().
The for loop iterates over all the groups turning those iterators of tuples g into the groups matrix of size (size x 2) and then throws away the second column, i.e. the x values keeping only the indices. This leads to groups being just a 1D array.
groupby() only works on sorted arrays.
Good work. I'm just wondering if we can do even better? Still a lot of unreasonable data copying seems to happen. Creating a list of tuples and then turning this into a 2D matrix just to throw away half of it still feels a bit suboptimal.
I received the answer I was looking for by rephrasing the question, see here: python: vectorized cumulative counting
by "cumulative counting" the inv_indices returned by np.unique() we receive the array indices of the sparse matrix so that
c = cumcount(inv_indices)
b[inv_indices, c] = d
cumulative counting as proposed in the thread linked above is very efficient. Run times lower than 20ms are very realistic.

Optimize finding pairs of arrays which can be compared

Definition: Array A(a1,a2,...,an) is >= than B(b1,b2,...bn) if they are equal sized and a_i>=b_i for every i from 1 to n.
For example:
[1,2,3] >= [1,2,0]
[1,2,0] not comparable with [1,0,2]
[1,0,2] >= [1,0,0]
I have a list which consists of a big number of such arrays (approx. 10000, but can be bigger). Arrays' elements are positive integers. I need to remove all arrays from this list that are bigger than at least one of other arrays. In other words: if there exists such B that A >= B then remove A.
Here is my current O(n^2) approach which is extremely slow. I simply compare every array with all other arrays and remove it if it's bigger. Are there any ways to speed it up.
import numpy as np
import time
import random
def filter_minimal(lst):
n = len(lst)
to_delete = set()
for i in xrange(n-1):
if i in to_delete:
continue
for j in xrange(i+1,n):
if j in to_delete: continue
if all(lst[i]>=lst[j]):
to_delete.add(i)
break
elif all(lst[i]<=lst[j]):
to_delete.add(j)
return [lst[i] for i in xrange(len(lst)) if i not in to_delete]
def test(number_of_arrays,size):
x = map(np.array,[[random.randrange(0,10) for _ in xrange(size)] for i in xrange(number_of_arrays)])
return filter_minimal(x)
a = time.time()
result = test(400,10)
print time.time()-a
print len(result)
P.S. I've noticed that using numpy.all instead of builtin python all slows the program dramatically. What can be the reason?
Might not be exactly what you are asking for, but this should get you started.
import numpy as np
import time
import random
def compare(x,y):
#Reshape x to a higher dimensional array
compare_array=x.reshape(-1,1,x.shape[-1])
#You can now compare every x with every y element wise simultaneously
mask=(y>=compare_array)
#Create a mask that first ensures that all elements of y are greater then x and
#then ensure that this is the case at least once.
mask=np.any(np.all(mask,axis=-1),axis=-1)
#Places this mask on x
return x[mask]
def test(number_of_arrays,size,maxval):
#Create arrays of size (number_of_arrays,size) with maximum value maxval.
x = np.random.randint(maxval, size=(number_of_arrays,size))
y= np.random.randint(maxval, size=(number_of_arrays,size))
return compare(x,y)
print test(50,10,20)
First of all we need to carefully check the objective. Is it true that we delete any array that is > ANY of the other arrays, even the deleted ones? For example, if A > B and C > A and B=C, then do we need to delete only A or both A and C? If we only need to delete INCOMPATIBLE arrays, then it is a much harder problem. This is a very difficult problem because different partitions of the set of arrays may be compatible, so you have the problem of finding the largest valid partition.
Assuming the easy problem, a better way to define the problem is that you want to KEEP all arrays which have at least one element < the corresponding element in ALL the other arrays. (In the hard problem, it is the corresponding element in the other KEPT arrays. We will not consider this.)
Stage 1
To solve this problem what you do is arrange the arrays in columns and then sort each row while maintaining the key to the array and the mapping of each array-row to position (POSITION lists). For example, you might end up with a result in stage 1 like this:
row 1: B C D A E
row 2: C A E B D
row 3: E D B C A
Meaning that for the first element (row 1) array B has a value >= C, C >= D, etc.
Now, sort and iterate the last column of this matrix ({E D A} in the example). For each item, check if the element is less than the previous element in its row. For example, in row 1, you would check if E < A. If this is true you return immediately and keep the result. For example, if E_row1 < A_row1 then you can keep array E. Only if the values in the row are equal do you need to do a stage 2 test (see below).
In the example shown you would keep E, D, A (as long as they passed the test above).
Stage 2
This leaves B and C. Sort the POSITION list for each. For example, this will tell you that the row with B's mininum position is row 2. Now do a direct comparison between B and every array below it in the mininum row, here row 2. Here there is only one such array, D. Do a direct comparison between B and D. This shows that B < D in row 3, therefore B is compatible with D. If the item is compatible with every array below its minimum position keep it. We keep B.
Now we do the same thing for C. In C's case we need only do one direct comparison, with A. C dominates A so we do not keep C.
Note that in addition to testing items that did not appear in the last column we need to test items that had equality in Stage 1. For example, imagine D=A=E in row 1. In this case we would have to do direct comparisons for every equality involving the array in the last column. So, in this case we direct compare E to A and E to D. This shows that E dominates D, so E is not kept.
The final result is we keep A, B, and D. C and E are discarded.
The overall performance of this algorithm is n2*log n in Stage 1 + { n lower bound, n * log n - upper bound } in Stage 2. So, maximum running time is n2*log n + nlogn and minimum running time is n2logn + n. Note that the running time of your algorithm is n-cubed n3. Since you compare each matrix (n*n) and each comparison is n element comparisons = n*n*n.
In general, this will be much faster than the brute force approach. Most of the time will be spent sorting the original matrix, a more or less unavoidable task. Note that you could potentially improve my algorithm by using priority queues instead of sorting, but the resulting algorithm would be much more complicated.

Better algorithm (than using a dict) for enumerating pairs with a given sum.

Given a number, I have to find out all possible index-pairs in a given array whose sum equals that number. I am currently using the following algo:
def myfunc(array,num):
dic = {}
for x in xrange(len(array)): # if 6 is the current key,
if dic.has_key(num-array[x]): #look at whether num-x is there in dic
for y in dic[num-array[x]]: #if yes, print all key-pair values
print (x,y),
if dic.has_key(array[x]): #check whether the current keyed value exists
dic[array[x]].append(x) #if so, append the index to the list of indexes for that keyed value
else:
dic[array[x]] = [x] #else create a new array
Will this run in O(N) time? If not, then what should be done to make it so? And in any case, will it be possible to make it run in O(N) time without using any auxiliary data structure?
Will this run in O(N) time?
Yes and no. The complexity is actually O(N + M) where M is the output size.
Unfortunately, the output size is in O(N^2) worst case, for example the array [3,3,3,3,3,...,3] and number == 6 - it will result in quadric number of elements needed to be produced.
However - asymptotically speaking - it cannot be done better then this, because it is linear in the input size and output size.
Very, very simple solution that actually does run in O(N) time by using array references. If you want to enumerate all the output pairs, then of course (as amit notes) it must take O(N^2) in the worst case.
from collections import defaultdict
def findpairs(arr, target):
flip = defaultdict(list)
for i, j in enumerate(arr):
flip[j].append(i)
for i, j in enumerate(arr):
if target-j in flip:
yield i, flip[target-j]
Postprocessing to get all of the output values (and filter out (i,i) answers):
def allpairs(arr, target):
for i, js in findpairs(arr, target):
for j in js:
if i < j: yield (i, j)
This might help - Optimal Algorithm needed for finding pairs divisible by a given integer k
(With a slight modification, there we are seeing for all pairs divisible by given number and not necessarily just equal to given number)

Categories

Resources