I have a file with millions of lines, each of which is a list of integers (these sublists are in the range of tens to hundreds of items). What I want is to read through the file contents once and create 3 numpy arrays -- one with the average of each sublist, one with the length of each sublist, and one which is a flattened list of all the values in all the sublists.
If I just wanted one of these things, I'd do something like:
counts = np.fromiter((len(json.loads(line.rstrip())) for line in mystream), int)
but if I write 3 of those, my code would iterate through my millions of sublists 3 times, and I obviously only want to iterate through them once. So I want to do something like this:
averages = []
counts = []
allvals = []
for line in mystream:
sublist = json.loads(line.rstrip())
averages.append(np.average(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
I believe that creating regular arrays as above and then doing
np_averages = np.array(averages)
Is very inefficient (basically creating the list twice). What is the right/efficient way to iteratively create a numpy array if it's not practical to use fromiter? Or do I want to create a function that returns the 3 values and do something like list comprehension for multiple return function? with fromiter instead of traditional list comprehension?
Or would it be efficient to create a 2D array of
[[count1, average1, sublist1], [count1, average2, sublist2], ...] and then doing additional operations to slice off (and in the 3rd case also flatten) the columns as their own 1D arrays?
First of all, the json library is not the most optimized library for that. You can use the pysimdjson package based on the optimized simdjson library to speed up the computation. For small integer lists, it is about twice faster on my machine.
Moreover, Numpy functions are not great for relatively small arrays as they introduce a pretty big overhead. For example, np.average takes about 8-10 us on my machine to compute an array of 20 items. Meanwhile, sum(sublist)/len(sublist) only takes 0.25-0.30 us.
Finally, np.array needs to iterate twice to convert the list into an array because it does not know the type of all objects. You can specify it so to make the convertion faster: np.array(averages, np.float64).
Here is a significantly faster implementation:
import simdjson
averages = []
counts = []
allvals = []
for line in mystream:
sublist = simdjson.loads(line.rstrip())
averages.append(sum(sublist) / len(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
np_averages = np.array(averages, np.float64)
One issue with this implementation is that allvals will contain all the values in the form of a big list of objects. CPython objects are quite big in memory compared to native Numpy integers (especially compared to 32-bit=4bytes integers) since each object takes usually 32 bytes and the reference in the list takes usually 8 bytes (resulting in 40 bytes per items, that is to say 10 times more than Numpy 32-bit-integer-based arrays). Thus, I may be better to use a native implementation, possibly based on Cython.
Related
I have two (very long) lists. I want to find the sum of the minimum of each pair in the list. Eg, if
X = [2,3,4]
Y = [5,4,2]
then, the sum would be 2+3+2 = 7.
At the moment, I'm doing this by zipping the lists and using a list comprehension. My lists are X and Y:
mins = [min(x,y) for x,y in zip(X,Y)]
summed_mins = sum(mins)
This is causing serious runtime issues in my program. Is there a faster way to do this? List comprehensions are the fastest that I know of.
You can use Python generators and the built-in map function to avoid the creation of the list, but this will probably be just slightly faster (thanks to Veedrac):
summed_mins = sum(map(min, x, y))
Alternatively, you can use Numpy. Here is how:
summed_mins = np.stack((X, Y)).min(axis=0).sum()
If you can store the input list directly as Numpy arrays, this can be much faster.
If you can even store it directly in a 2D Numpy array, you don't need the np.stack call resulting in a much faster code.
If you cannot store/create the input directly as Numpy arrays, you can create the Numpy arrays on the fly quickly by specifying the data type (assuming you are sure the list contain small integers). Here is an example:
summed_mins = np.stack((np.array(a, np.int64), np.array(b, np.int64))).min(axis=0)
I am trying to create a variable number of variables (arrays) in python.
I have a database from experiments and I am extracting data from it. I do not have control over the database or how data is written. I am extracting data in the form of a table - first (or zeroth column from python's perspective) has location ids and subsequent columns have readings over several iterations. Location ids (in 0th col) span over million of rows, and so the readings of the iterations are captured in subsequent columns. So I read over the database and create this giant table.
In the next step, I loop over columns index 1 to n (0th col has locations) and I am trying to get this - if the difference in 2 readings is more than 0.001, then write the location id to an array.
if ( (A[i][j+1] - A[i][j]) > 0.001): #1<=j<=n, 0<=i<=max rows in the table
then write A[i][0] i.e. location id to an array, arr1[m][n] = A[i][0]
Problem: It is creating dynamic number of variables like arr1. I am storing the result of each loop iteration in an array and the number of column j's are known only during runtime. So how can I create variable number of variables like arr1? Secondly, each of these variables like arr1 can have different size.
I took a look at similar questions, but multi-dimension arrays won't work as each arr1 can have different size. Also, performance is important, so I am guessing numpy arrays would be better. I am guessing that dictionary would be slow in performance for such a huge data.
I didn't understand much from your explanation of the problem, but from what you wrote it sounds like a normal list would do the job:
arr1 = []
if (your condition here):
arr1.append(A[i][0])
memory management is dynamic, i.e. it allocates new memory as needed and afterwards if you need a numpy array just make numpy_array = np.asarray(arr1).
A (very) small primer on lists in python:
A list in python is a mutable container that stores references to objects of any kind. Unlike C++, in a python list your items can be anything and you don't have to specify the list size when you define it.
In the example above, arr1 is initially defined as empty and every time you call arr1.append() a new reference to A[i][0] is pushed at the end of the list.
For example:
a = []
a.append(1)
a.append('a string')
b = {'dict_key':'my value'}
a.append(b)
print(a)
displays:
[1, 'a string', {'dict_key': 'my value'}]
As you can see, the list doesn't really care what you append, it will store a reference to the item and increase its size of 1.
I strongly suggest you to take a look at the daa structures documentation for further insight on how lists work and some of their caveats.
I took a look at similar questions, but multi dimension arrays won't work as each arr1 can have different size.
-- but a list of arrays will work, because items in a list can be anything, including arrays of different sizes.
I was going through an example in this computer-vision book and was a bit surprised by the code:
descr = []
descr.append(sift.read_features_from_file(featurefiles[0])[1])
descriptors = descr[0] #stack all features for k-means
for i in arange(1,nbr_images):
descr.append(sift.read_features_from_file(featurefiles[i])[1])
descriptors = vstack((descriptors,descr[i]))
To me it looks like this is copying the array over and over again and a more efficient implementation would be:
descr = []
descr.append(sift.read_features_from_file(featurefiles[0])[1])
for i in arange(1,nbr_images):
descr.append(sift.read_features_from_file(featurefiles[i])[1])
descriptors = vstack((descr))
Or am I missing something here and the two codes are not identical. I ran a small test:
print("ATTENTION")
print(descriptors.shape)
print("ATTENTION")
print(descriptors[1:10])
And it seems the list is different?
You're absolutely right - repeatedly concatenating numpy arrays inside a loop is extremely inefficient. Concatenation always generates a copy, which becomes more and more costly as your array gets bigger and bigger inside the loop.
Instead, do one of two things:
As you have done, store the intermediate values in a regular Python list and convert this to a numpy array outside the loop. Appending to a list is O(1), whereas concatenating np.ndarrays is O(n+k).
If you know how large the final array will be ahead of time, you can pre-allocate it and then fill in the rows inside your for loop, e.g.:
descr = np.empty((nbr_images, nbr_features), dtype=my_dtype)
for i in range(nbr_image):
descr[i] = sift.read_features_from_file(featurefiles[i])[1]
Another variant would be to use np.fromiter to lazily generate the array from an iterable object, for example in this recent question.
The actual problem I have is that I want to store a long sorted list of (float, str) tuples in RAM. A plain list doesn't fit in my 4Gb RAM, so I thought I could use two numpy.ndarrays.
The source of the data is an iterable of 2-tuples. numpy has a fromiter function, but how can I use it? The number of items in the iterable is unknown. I can't consume it to a list first due to memory limitations. I thought of itertools.tee, but it seems to add a lot of memory overhead here.
What I guess I could do is consume the iterator in chunks and add those to the arrays. Then my question is, how to do that efficiently? Should I maybe make 2 2D arrays and add rows to them? (Then later I'd need to convert them to 1D).
Or maybe there's a better approach? Everything I really need is to search through an array of strings by the value of the corresponding number in logarithmic time (that's why I want to sort by the value of float) and to keep it as compact as possible.
P.S. The iterable is not sorted.
Perhaps build a single, structured array using np.fromiter:
import numpy as np
def gendata():
# You, of course, have a different gendata...
for i in xrange(N):
yield (np.random.random(), str(i))
N = 100
arr = np.fromiter(gendata(), dtype='<f8,|S20')
Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:
arr.sort(order=['f0','f1'])
Finding the row by the value in the first column can be done with searchsorted in O(log N) time:
# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')
idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')
You've asked many important questions in the comments; let me attempt to answer them here:
The basic dtypes are explained in the numpybook. There may be one or
two extra dtypes (like float16 which have been added since that
book was written, but the basics are all explained there.)
Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, or
with default column names. 'f0', 'f1', etc. are default column
names. Since I defined the dtype as '<f8,|S20' I failed to provide
column names, so NumPy named the first column 'f0', and the second
'f1'. If we had used
dtype='[('fval','<f8'), ('text','|S20')]
then the structured array arr would have column names 'fval' and
'text'.
Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You
could conceivably iterate through gendata once to discover the
maximum length of the strings, build your dtype and then call
np.fromiter (and iterate through gendata a second time), but
that's rather burdensome. It is of course better if you know in
advance the maximum size of the strings. (|S20 defines the string
field as having a fixed length of 20 bytes.)
NumPy arrays place data of a
pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it
would be hard for NumPy to find the right offsets. By hard, I mean
NumPy would need an index or somehow be redesigned. NumPy is simply not
built this way.
NumPy does have an object dtype which allows you to place a 4-byte
pointer to any Python object you desire. This way, you can have NumPy
arrays with arbitrary Python data. Unfortunately, the np.fromiter
function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
Note that np.fromiter has better performance when the count is
specified. By knowing the count (the number of rows) and the
dtype (and thus the size of each row) NumPy can pre-allocate
exactly enough memory for the resultant array. If you do not specify
the count, then NumPy will make a guess for the initial size of the
array, and if too small, it will try to resize the array. If the
original block of memory can be extended you are in luck. But if
NumPy has to allocate an entirely new hunk of memory then all the old
data will have to be copied to the new location, which will slow down
the performance significantly.
Here is a way to build N separate arrays out of a generator of N-tuples:
import numpy as np
import itertools as IT
def gendata():
# You, of course, have a different gendata...
N = 100
for i in xrange(N):
yield (np.random.random(), str(i))
def fromiter(iterable, dtype, chunksize=7):
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
result = [chunk[name].copy() for name in chunk.dtype.names]
size = len(chunk)
while True:
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
N = len(chunk)
if N == 0:
break
newsize = size + N
for arr, name in zip(result, chunk.dtype.names):
col = chunk[name]
arr.resize(newsize, refcheck=0)
arr[size:] = col
size = newsize
return result
x, y = fromiter(gendata(), '<f8,|S20')
order = np.argsort(x)
x = x[order]
y = y[order]
# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')
idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')
The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.
I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.
I have a master array of length n of id numbers that apply to other analogous arrays with corresponding data for elements in my simulation that belong to those id numbers (e.g. data[id]). Were I to generate a list of id numbers of length m separately and need the information in the data array for those ids, what is the best method of getting a list of indices idx of the original array of ids in order to extract data[idx]? That is, given:
a=numpy.array([1,3,4,5,6]) # master array
b=numpy.array([3,4,3,6,4,1,5]) # secondary array
I would like to generate
idx=numpy.array([1,2,1,4,2,0,3])
The array a is typically in sequential order but it's not a requirement. Also, array b will most definitely have repeats and will not be in any order.
My current method of doing this is:
idx=numpy.array([numpy.where(a==bi)[0][0] for bi in b])
I timed it using the following test:
a=(numpy.random.uniform(100,size=100)).astype('int')
b=numpy.repeat(a,100)
timeit method1(a,b)
10 loops, best of 3: 53.1 ms per loop
Is there a better way of doing this?
The current way you are doing it with where searching through the whole array of a each time. You can make this look-up O(1) instead of O(N) using a dict. For instance, I used the following method:
def method2(a,b):
tmpdict = dict(zip(a,range(len(a))))
idx = numpy.array([tmpdict[bi] for bi in b])
and got a very large speed-up which will be even better for larger arrays. For the sizes that you had in your example code, I got a speed-up of 15x. The only problem with my code is that if there are repeated elements in a, then the dict will currently point to the last instance of the element while with your method it will point to the first instance. However, that can remedied if there are to be repeated elements in the actual usage of the code.
I'm not sure if there is a way to do this automatically in python, but you're probably best off sorting the two arrays and then generating your output in one pass through b. The complexity of that operation should be O(|a|*log|a|)+O(|b|*log|b|)+O(|b|) = O(|b|*log|b|) (assuming |b| > |a|). I believe your original try has complexity O(|a|*|b|), so this should provide a noticeable improvement for a sufficiently large b.