Python v-stacking in a loop - python

I was going through an example in this computer-vision book and was a bit surprised by the code:
descr = []
descr.append(sift.read_features_from_file(featurefiles[0])[1])
descriptors = descr[0] #stack all features for k-means
for i in arange(1,nbr_images):
descr.append(sift.read_features_from_file(featurefiles[i])[1])
descriptors = vstack((descriptors,descr[i]))
To me it looks like this is copying the array over and over again and a more efficient implementation would be:
descr = []
descr.append(sift.read_features_from_file(featurefiles[0])[1])
for i in arange(1,nbr_images):
descr.append(sift.read_features_from_file(featurefiles[i])[1])
descriptors = vstack((descr))
Or am I missing something here and the two codes are not identical. I ran a small test:
print("ATTENTION")
print(descriptors.shape)
print("ATTENTION")
print(descriptors[1:10])
And it seems the list is different?

You're absolutely right - repeatedly concatenating numpy arrays inside a loop is extremely inefficient. Concatenation always generates a copy, which becomes more and more costly as your array gets bigger and bigger inside the loop.
Instead, do one of two things:
As you have done, store the intermediate values in a regular Python list and convert this to a numpy array outside the loop. Appending to a list is O(1), whereas concatenating np.ndarrays is O(n+k).
If you know how large the final array will be ahead of time, you can pre-allocate it and then fill in the rows inside your for loop, e.g.:
descr = np.empty((nbr_images, nbr_features), dtype=my_dtype)
for i in range(nbr_image):
descr[i] = sift.read_features_from_file(featurefiles[i])[1]
Another variant would be to use np.fromiter to lazily generate the array from an iterable object, for example in this recent question.

Related

efficient iterative creation of multiple numpy arrays at once

I have a file with millions of lines, each of which is a list of integers (these sublists are in the range of tens to hundreds of items). What I want is to read through the file contents once and create 3 numpy arrays -- one with the average of each sublist, one with the length of each sublist, and one which is a flattened list of all the values in all the sublists.
If I just wanted one of these things, I'd do something like:
counts = np.fromiter((len(json.loads(line.rstrip())) for line in mystream), int)
but if I write 3 of those, my code would iterate through my millions of sublists 3 times, and I obviously only want to iterate through them once. So I want to do something like this:
averages = []
counts = []
allvals = []
for line in mystream:
sublist = json.loads(line.rstrip())
averages.append(np.average(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
I believe that creating regular arrays as above and then doing
np_averages = np.array(averages)
Is very inefficient (basically creating the list twice). What is the right/efficient way to iteratively create a numpy array if it's not practical to use fromiter? Or do I want to create a function that returns the 3 values and do something like list comprehension for multiple return function? with fromiter instead of traditional list comprehension?
Or would it be efficient to create a 2D array of
[[count1, average1, sublist1], [count1, average2, sublist2], ...] and then doing additional operations to slice off (and in the 3rd case also flatten) the columns as their own 1D arrays?
First of all, the json library is not the most optimized library for that. You can use the pysimdjson package based on the optimized simdjson library to speed up the computation. For small integer lists, it is about twice faster on my machine.
Moreover, Numpy functions are not great for relatively small arrays as they introduce a pretty big overhead. For example, np.average takes about 8-10 us on my machine to compute an array of 20 items. Meanwhile, sum(sublist)/len(sublist) only takes 0.25-0.30 us.
Finally, np.array needs to iterate twice to convert the list into an array because it does not know the type of all objects. You can specify it so to make the convertion faster: np.array(averages, np.float64).
Here is a significantly faster implementation:
import simdjson
averages = []
counts = []
allvals = []
for line in mystream:
sublist = simdjson.loads(line.rstrip())
averages.append(sum(sublist) / len(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
np_averages = np.array(averages, np.float64)
One issue with this implementation is that allvals will contain all the values in the form of a big list of objects. CPython objects are quite big in memory compared to native Numpy integers (especially compared to 32-bit=4bytes integers) since each object takes usually 32 bytes and the reference in the list takes usually 8 bytes (resulting in 40 bytes per items, that is to say 10 times more than Numpy 32-bit-integer-based arrays). Thus, I may be better to use a native implementation, possibly based on Cython.

Memory problems for multiple large arrays

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)

Creating variable number of variables in python

I am trying to create a variable number of variables (arrays) in python.
I have a database from experiments and I am extracting data from it. I do not have control over the database or how data is written. I am extracting data in the form of a table - first (or zeroth column from python's perspective) has location ids and subsequent columns have readings over several iterations. Location ids (in 0th col) span over million of rows, and so the readings of the iterations are captured in subsequent columns. So I read over the database and create this giant table.
In the next step, I loop over columns index 1 to n (0th col has locations) and I am trying to get this - if the difference in 2 readings is more than 0.001, then write the location id to an array.
if ( (A[i][j+1] - A[i][j]) > 0.001): #1<=j<=n, 0<=i<=max rows in the table
then write A[i][0] i.e. location id to an array, arr1[m][n] = A[i][0]
Problem: It is creating dynamic number of variables like arr1. I am storing the result of each loop iteration in an array and the number of column j's are known only during runtime. So how can I create variable number of variables like arr1? Secondly, each of these variables like arr1 can have different size.
I took a look at similar questions, but multi-dimension arrays won't work as each arr1 can have different size. Also, performance is important, so I am guessing numpy arrays would be better. I am guessing that dictionary would be slow in performance for such a huge data.
I didn't understand much from your explanation of the problem, but from what you wrote it sounds like a normal list would do the job:
arr1 = []
if (your condition here):
arr1.append(A[i][0])
memory management is dynamic, i.e. it allocates new memory as needed and afterwards if you need a numpy array just make numpy_array = np.asarray(arr1).
A (very) small primer on lists in python:
A list in python is a mutable container that stores references to objects of any kind. Unlike C++, in a python list your items can be anything and you don't have to specify the list size when you define it.
In the example above, arr1 is initially defined as empty and every time you call arr1.append() a new reference to A[i][0] is pushed at the end of the list.
For example:
a = []
a.append(1)
a.append('a string')
b = {'dict_key':'my value'}
a.append(b)
print(a)
displays:
[1, 'a string', {'dict_key': 'my value'}]
As you can see, the list doesn't really care what you append, it will store a reference to the item and increase its size of 1.
I strongly suggest you to take a look at the daa structures documentation for further insight on how lists work and some of their caveats.
I took a look at similar questions, but multi dimension arrays won't work as each arr1 can have different size.
-- but a list of arrays will work, because items in a list can be anything, including arrays of different sizes.

Array operations using multiple indices of same array

I am very new to Python, and I am trying to get used to performing Python's array operations rather than looping through arrays. Below is an example of the kind of looping operation I am doing, but am unable to work out a suitable pure array operation that does not rely on loops:
import numpy as np
def f(arg1, arg2):
# an arbitrary function
def myFunction(a1DNumpyArray):
A = a1DNumpyArray
# Create a square array with each dimension the size of the argument array.
B = np.zeros((A.size, A.size))
# Function f is a function of two elements of the 1D array. For each
# element, i, I want to perform the function on it and every element
# before it, and store the result in the square array, multiplied by
# the difference between the ith and (i-1)th element.
for i in range(A.size):
B[i,:i] = f(A[i], A[:i])*(A[i]-A[i-1])
# Sum through j and return full sums as 1D array.
return np.sum(B, axis=0)
In short, I am integrating a function which takes two elements of the same array as arguments, returning an array of results of the integral.
Is there a more compact way to do this, without using loops?
The use of an arbitrary f function, and this [i, :i] business complicates by passing a loop.
Most of the fast compiled numpy operations work on the whole array, or whole rows and/or columns, and effectively do so in parallel. Loops that are inherently sequential (value from one loop depends on the previous) don't fit well. And different size lists or arrays in each loop are also a good indicator that 'vectorizing' will be difficult.
for i in range(A.size):
B[i,:i] = f(A[i], A[:i])*(A[i]-A[i-1])
With a sample A and known f (as simple as arg1*arg2), I'd generate a B array, and look for patterns that treat B as a whole. At first glance it looks like your B is a lower triangle. There are functions to help index those. But that final sum might change the picture.
Sometimes I tackle these problems with a bottom up approach, trying to remove inner loops first. But in this case, I think some sort of big-picture approach is needed.

using lists for storing data inside loop and conversion to np.array

quite often i do something like:
data = []
for i in range(number_of_components):
d = some calculation (output may change size)
data.append(d)
data = np.asarray(data)
It is very convenient to store the data inside a list. Especially if the data may change it`s size. Quite often I work with numpy arrays and I find it easier when every function returns either an array or a list. So I end up with such constructs. Is there a better solution ?
Clearly the only way this works is if d is the same length for all i in a given loop (but as you suggest, it can change between loops). I often encounter such a scenario and when memory etc is at a premium, I just create the array on the first call.
Something like:
data = None
for i in range(number_of_components):
d = some calculation()
if data is None:
data = np.empty((number_of_components,) + d.shape, d.dtype)
data[i, ...] = d
This certainly avoids a superfluous interim list.

Categories

Resources