When are generators converted to lists in Dask? - python

In Dask, when do generators get converted to lists, or are they generally consumed lazily?
For example, with the code:
from collections import Counter
import numpy as np
import dask.bag as db
def foo(n):
for _ in range(n):
yield np.random.randint(10)
def add_to_count(acc, x):
acc.update(x)
return acc
def add(x,y):
return x + y
b1 = db.from_sequence([1,2,3,4,5])
b2 = b1.map(foo)
result = b2.fold(add_to_count, add, Counter())
I get the following output, where the generators
have (reasonably) been converted to lists for me to inspect:
>>> b2.compute()
[[5], [5, 6], [3, 6, 1], [5, 6, 6, 0], [5, 6, 6, 0, 3]]
While reasonable, it differs from how I usually expect generators to behave in Python, which would be to require an explicit conversion to a list.
So, when computing the fold (result.compute()),
is the input argument x of add_to_count
a generator, or has it already been converted to a list?
I'm interested in the case where the lists are very long,
and so lazy evaluation is more efficient, say,
b1 = db.from_sequence([10**6]*10).
I'm guessing I could also solve the above problem with bag.frequencies, but I have similar concerns about lazy evaluation and efficient reduction.
Is there a fundamental aspect of Dask that I'm not grokking, or am I just being lazy, and where could I have looked into the code to figure this out myself?

Not exactly appropriate, but I'll provide the answer to a slightly different question:
Dask.bag adds in defensive calls to list` for you, just in case you decide to branch out and use the bag twice in a single computation:
x = b.map(func1, b)
y = b.map(func2, b)
compute(x.frequencies(), b.frequencies())
This is also useful when using backends like multiprocessing or distributed because we can't send generators across a process boundary, but can send lists.
However, these defensive calls to list are optimized away before computation when possible in an effort to promote laziness.
In summary, everything should just work the way you want when possible, but will revert to concrete non-lazy values when laziness would get in the way of correctness.

Related

Efficient way to sequential adding multiple list elements

I have multiple lists. I want to merge the elements sequentially one-by-one.
Example:
a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
Result should be:
d = [1, 4, 7, 2, 5, 8, 3, 6, 9]
One way of doing it is:
d = []
for i, j, k in zip(a, b, c):
d.extend([i, j, k])
Is this efficient? What is most efficient way here?
A one-liner could be
import itertools
list(itertools.chain.from_iterable(zip(a,b,c)))
A variant of your method is
d=[]
for i, j, k in zip(a, b, c):
d+=[i,j,k]
Out of curiosity, i've just
Just used timeit, out of curiosity, comparing your method, that variant, my one-liner, and also the one in Olvin's comment (let's call it compound version), and verdict is
yours: 1.06-1.08
my variant (with += instead of extend): 0.94-0.97
my one-liner: 1.10-1.12
Olvin's one-liner: 1.28-1.34
Sometimes, the nicest methods aren't the fastest.
Timing may change for longer lists tho.
The fact that += is faster than .extend is quite interesting (since .extend change the list, while += build a new one, and then replace the old one. Instinct would say that rebuilding lists should be faster that extending them. But maybe memory management says otherwise).
But, well, so far, the fastest one is my second version (with +=), which, incidentally, is also the one I find the most boring, among all solutions seen here.
Edit
Since that ranking bothered me (it's itertools iterators are supposed to be faster, since they are a little bit less interpreted and a little bit more compiled), I've tried with longer list. And it is then another story
a=list(range(1000))
b=list(range(1000,2000))
c=list(range(2000,3000))
And then timeit verdict (with 100 times less run than before) is
Your method: 1.91
My += variant: 1.59
My one-liner: 0.98
Olvin's one-liner: 1.88
So, at least, itertools does win in the long run (with big enough data).
Victory of += over .extend is affirmed (I don't really know the internals of memory management. But, coming from the C world, I would say that sometimes, a new malloc and copy is faster than a constantly realloc. But maybe that's a naive view of what happens under the hood in python's interpreter. But well, is faster than extend for this usage in the long run)
Olvin's method is quite equivalent to yours. Which surprises me a little. Because it is roughly the compound version of the same thing. But I would have thought that, while building up a compound list, python could skip some steps in intermediary representation, that it could not skip in your method, where all the versions of the list (the one with just [1,4,7], then the one with [1,4,7,2,5,8] etc.) do exist in the interpreter. May be the 0.03 difference between Olvin's method and yours are because of that (it is not just noise. With this size, the timings are quite constant, and the 0.03 difference is also). But I would have thought that the difference would be higher.
But well, even if the timing differences surprise me, the ranking of the method makes more sense with big lists. With itertools > += > [compound] > .extend
a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
flat = zip(a,b,c)
d = [x for tpl in flat for x in tpl]
This list comprehension is the same as:
flat_list = []
for sublist in l:
for item in sublist:
flat_list.append(item)

NumPy vectorization without the use of numpy.vectorize

I've found myself using NumPy arrays for memory management and speed of computation more and more lately, on large volumes of structured data (such as points and polygons). In doing so, there is always a situation where I need to perform some function f(x) on the entire array. From experience, and Googling, iterating over the array is not the way to do this, so insted a function should be vectorized and broadcast to the entire array.
Looking at the documentation for numpy.vectorize we get this example:
def myfunc(a, b):
"Return a-b if a>b, otherwise return a+b"
if a > b:
return a - b
else:
return a + b
>>> vfunc = np.vectorize(myfunc)
>>> vfunc([1, 2, 3, 4], 2)
array([3, 4, 1, 2])
And per the docs it really just creates a for loop so it doesnt access the lower C loops for truly vectorized operations (either in BLAS or SIMD). So that got me wondering, if the above is "vectorized", what is this?:
def myfunc_2(a, b):
cond = a > b
a[cond] -= b
a[~cond] += b
return a
>>> myfunc_2(np.array([1, 2, 3, 4], 2))
array([3, 4, 1, 2])
Or even this:
>>> a = np.array([1, 2, 3, 4]
>>> b = 2
>>> np.where(a > b, a - b, a + b)
array([3, 4, 1, 2])
So I ran some tests on these, what I believe to be comparable examples:
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import vfunc, arr'
>>> timeit('vfunc(arr, 50)', setup=setup, number=1)
0.60175449999997
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import myfunc_2, arr'
>>> timeit('myfunc_2(arr, 50)', setup=setup, number=1)
0.07464979999997468
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import myfunc_3, arr'
>>> timeit('myfunc_3(arr, 50)', setup=setup, number=1)
0.0222587000000658
And with larger run windows:
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import vfunc, arr'
>>> timeit('vfunc(arr, 50)', setup=setup, number=1000)
621.5853878000003
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import myfunc_2, arr'
>>> timeit('myfunc_2(arr, 50)', setup=setup, number=1000)
98.19819199999984
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import myfunc_3, arr'
>>> timeit('myfunc_3(arr, 50)', setup=setup, number=1000)
26.128515100000186
Clearly the other options are major improvements over using numpy.vectorize. This leads me to wonder several things about why anybody would use numpy.vectorize at all if you can write what appear to be "purely vectorized" functions or use battery provided functions like numpy.where.
Now for the questions:
What are the requirements to say a function is "vectorized" if not converted via numpy.vectorize? Just broadcastable in its entirety?
How does NumPy determine if a function is "vectorized"/broadcastable?
Why isn't this form of vectorization documented anywhere? (i.e., why doesn't NumPy have a "How to write a vectorized function" page?
"vectorization" can mean be different things depending on context. Use of low level C code with BLAS or SIMD is just one.
In physics 101, a vector represents a point or velocity whose numeric representation can vary with coordinate system. Thus I think of "vectorization", broadly speaking, as performing math operations on the "whole" array, without explicit control over numerical elements.
numpy basically adds a ndarray class to python. It has a large number of methods (and operators and ufunc) that do indexing and math in compiled code (not necessarily using processor specific SIMD). The big gain in speed, relative to Python level iteration, is the use of compiled code optimized for the ndarray data structure. Python level iteration (interpreted code) on arrays is actually slower than on equivalent lists.
I don't think numpy formally defines "vectorization". There isn't a "vector" class. I haven't searched the documentation for those terms. Here, and possibly on other forums, it just means, writing code that makes optimal use of ndarray methods. Generally that means avoiding python level iteration.
np.vectorize is a tool for applying arrays to functions that only accept scalar inputs. It doesn't not compile or otherwise "look inside" that function. But it does accept and apply arguments in a fully broadcasted sense, such as in:
In [162]: vfunc(np.arange(3)[:,None],np.arange(4))
Out[162]:
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 1, 4, 5]])
Speedwise np.vectorize is slower than the equivalent list comprehension, at least for smaller sample cases. Recent testing shows that it scales better, so for large inputs it may be better. But still the performance is nothing like your myfunc_2.
myfunc is not "vectorized" simply because expressions like if a > b do not work with arrays.
np.where(a > b, a - b, a + b) is "vectorized" because all arguments to the where work with arrays, and where itself uses them with full broadcasting powers.
In [163]: a,b = np.arange(3)[:,None], np.arange(4)
In [164]: np.where(a>b, a-b, a+b)
Out[164]:
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 1, 4, 5]])
myfunc_2 is "vectorized", at least in a:
In [168]: myfunc_2(a,2)
Out[168]:
array([[4],
[1],
[2]])
It does not work when b is array; it's trickier to match the a[cond] shape with anything but a scalar:
In [169]: myfunc_2(a,b)
Traceback (most recent call last):
Input In [169] in <cell line: 1>
myfunc_2(a,b)
Input In [159] in myfunc_2
a[cond] -= b
IndexError: boolean index did not match indexed array along dimension 1; dimension is 1 but corresponding boolean dimension is 4
===
What are the requirements to say a function is "vectorized" if not converted via numpy.vectorize? Just broadcastable in its entirety?
In your examples, my_func is not "vectorized" because it only works with scalars. vfunc is full "vectorized", but not faster. where is also "vectorized" and (probably) faster, though this may be scale dependent. my_func2 is only "vectorized" in a.
How does NumPy determine if a function is "vectorized"/broadcastable?
numpy doesn't determine anything like this. numpy is a ndarray class with many methods. It's just the use of those methods that makes a block of code "vectorized".
Why isn't this form of vectorization documented anywhere? (i.e., why doesn't NumPy have a "How to write a vectorized function" page?
Keep in mind the distinction between "vectorization" as a performance strategy, and the basic idea of operating on whole arrays.
Vectorize Documentation
The documentation provides a great example in def mypolyval(p, x):: there's no good way to write that as a where condition or using simple logic.
def mypolyval(p, x):
_p = list(p)
res = _p.pop(0)
while _p:
res = res*x + _p.pop(0)
return res
vpolyval = np.vectorize(mypolyval, excluded=['p'])
vpolyval(p=[1, 2, 3], x=[0, 1])
array([3, 6])
That is, np.vectorize is clearly what the reference documentation states: convenience to write code in the same fashion even without the performance benefits.
And as for the documentation telling you how to write vectorized code, it does though in the relevant documentation. It says in the documentation what you mentioned above:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
Remember: the documentation is an API reference guide with some additional caveats: it's not a NumPy tutorial.
UFunc Documentation
The appropriate reference documentation and glossary document this clearly:
A universal function (or ufunc for short) is a function that operates on ndarrays in an element-by-element fashion, supporting array broadcasting, type casting, and several other standard features. That is, a ufunc is a “vectorized” wrapper for a function that takes a fixed number of specific inputs and produces a fixed number of specific outputs. For detailed information on universal functions, see Universal functions (ufunc) basics.
NumPy hands off array processing to C, where looping and computation are much faster than in Python. To exploit this, programmers using NumPy eliminate Python loops in favor of array-to-array operations. vectorization can refer both to the C offloading and to structuring NumPy code to leverage it.
Summary
Simply put, np.vectorize is for code legibility so you can write similar code to actually vectorized ufuncs. It is not for performance, but there are times when you have no good alternative.

Vectorizing nested for loops in list comprehension

I have two lists for strings where I'm calculating the Damerau–Levenshtein distance
to check which are similar. The issue that I have those list are over 200k+ and with comprehension it takes quite a lot of time. For the distance computation I'm using pyxDamerauLevenshtein package which is written in Cython so there should be no bottleneck
series = ([damerau_levenshtein_distance(i, j) for i in original_string for j in compare_string])
That's how my code looks like and I wonder if it can be vectorized somehow to boost performance, or maybe someother way to speed up the computation?
What is my dataset:
Original string - it is pd.Series of unique street names
Compare string - this is pd.Series of manually inputed street names that I want to compare to find similarity
Output should be like that:
Original Compare Distance
0 Street1 Street1 1
1 Street2 Street1 2
2 Street3 Street1 3
3 Street4 Street3 5
4 Street5 Street3 5
5 Street6 Street6 1
If you can think of a way to use map (or imap) functions rather than nested loops, you could then try using multiprocessing to fully utilise your CPU. For example, in this case:
pool.map(lambda j: map(lambda i: damerau_levenshtein_distance(i, j),original_string),compare_string)
where 'pool.map' is the multiprocessing map, and the second 'map' is regular.
Below is a quick, but functional example of multiprocessing which could cover what you are looking for. I structured it a bit differently to avoid some pickling problems and to get it to compute in the background somewhat asynchronously since your lists are long...
(This can definitely be improved, but should hopefully serve as a proof-of-concept for your example)
import multiprocessing as mp
import itertools
list1 = range(5)
list2 = range(5)
def doSomething(a,b):
return a+b #Your damerau_levenshtein_distance function goes here
def mapDoSomething(args):
i = args[0] #An element of list2
otherlist = args[1] #A copy of list1
return [doSomething(i,j) for j in otherlist]
if __name__ == '__main__':
pool = mp.Pool()
answer = pool.imap(mapDoSomething,zip(list2,itertools.repeat(list1)))
print(list(answer)) #imap will compute the results in the background whilst the rest of the code runs. You can therefore ask for individual lists of results, and it won't block unless the result hasn't been computed yet. To do this, you would use answer.next() or iterate over the results somewhere else. However, by converting to a list here, I'm forcing all results to finish before printing. This is only to show you it worked. For larger lists, don't do this.
pool.close()
pool.join()
This code produces:
[[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]]
which is each element of list1 operated with (I added them) each element of list2, which I think is what you've attmpted to do in your code with lists of strings.
The code sets up the Process Pool, then uses imap to split the processing of operating on list2 across multiple processes. The zip function lazily groups the element of list2 with a full copy of list1, since imap only supports functions with single arguments. This group is then split up and used in mapDoSomething, which runs the doSomething function on each element in list1 with each element of list2.
Since I've used imap, the lists get printed as soon as they are computed, rather than waiting for the entire result to be finished.

Selective flattening of a Python list

Suppose I have a list containing (among other things) sublists of different types:
[1, 2, [3, 4], {5, 6}]
that I'd like to flatten in a selective way, depending on the type of its elements (i.e. I'd like to only flatten sets, and leave the rest unflattened):
[1, 2, [3, 4], 5, 6]
My current solution is a function, but just for my intellectual curiosity, I wonder if it's possible to do it with a single list comprehension?
List comprehensions aren't designed for flattening (since they don't have a way to combine the values corresponding to multiple input items).
While you can get around this with nested list comprehensions, this requires each element in your top level list to be iterable.
Honestly, just use a function for this. It's the cleanest way.
Amber is probably right that a function is preferable for something like this. On the other hand, there's always room for a little variation. I'm assuming the nesting is never more than one level deep -- if it is ever more than one level deep, then you should definitely prefer a function for this. But if not, this is a potentially viable approach.
>>> from itertools import chain
>>> from collections import Set
>>> list(chain.from_iterable(x if isinstance(x, Set) else (x,) for x in l))
[1, 2, [3, 4], 5, 6]
The non-itertools way to do this would involve nested list comprehensions. Better to break that into two lines:
>>> packaged = (x if isinstance(x, collections.Set) else (x,) for x in l)
>>> [x for y in packaged for x in y]
[1, 2, [3, 4], 5, 6]
I don't have a strong intuition about whether either of these would be faster or slower than a straightforward function. These create lots of singleton tuples -- that's kind of a waste -- but they also happen at LC speed, which is usually pretty good.
You can use flatten function from funcy library:
from funcy import flatten, isa
flat_list = flatten(your_list, follow=isa(set))
You can also peek at its implementation.

Sorting based on one of the list among Nested list in python

I have a list as [[4,5,6],[2,3,1]]. Now I want to sort the list based on list[1] i.e. output should be [[6,4,5],[1,2,3]]. So basically I am sorting 2,3,1 and maintaining the order of list[0].
While searching I got a function which sorts based on first element of every list but not for this. Also I do not want to recreate list as [[4,2],[5,3],[6,1]] and then use the function.
Since [4, 5, 6] and [2, 3, 1] serves two different purposes I will make a function taking two arguments: the list to be reordered, and the list whose sorting will decide the order. I'll only return the reordered list.
This answer has timings of three different solutions for creating a permutation list for a sort. Using the fastest option gives this solution:
def pyargsort(seq):
return sorted(range(len(seq)), key=seq.__getitem__)
def using_pyargsort(a, b):
"Reorder the list a the same way as list b would be reordered by a normal sort"
return [a[i] for i in pyargsort(b)]
print using_pyargsort([4, 5, 6], [2, 3, 1]) # [6, 4, 5]
The pyargsort method is inspired by the numpy argsort method, which does the same thing much faster. Numpy also has advanced indexing operations whereby an array can be used as an index, making possible very quick reordering of an array.
So if your need for speed is great, one would assume that this numpy solution would be faster:
import numpy as np
def using_numpy(a, b):
"Reorder the list a the same way as list b would be reordered by a normal sort"
return np.array(a)[np.argsort(b)].tolist()
print using_numpy([4, 5, 6], [2, 3, 1]) # [6, 4, 5]
However, for short lists (length < 1000), this solution is in fact slower than the first. This is because we're first converting the a and b lists to array and then converting the result back to list before returning. If we instead assume you're using numpy arrays throughout your application so that we do not need to convert back and forth, we get this solution:
def all_numpy(a, b):
"Reorder array a the same way as array b would be reordered by a normal sort"
return a[np.argsort(b)]
print all_numpy(np.array([4, 5, 6]), np.array([2, 3, 1])) # array([6, 4, 5])
The all_numpy function executes up to 10 times faster than the using_pyargsort function.
The following logaritmic graph compares these three solutions with the two alternative solutions from the other answers. The arguments are two randomly shuffled ranges of equal length, and the functions all receive identically ordered lists. I'm timing only the time the function takes to execute. For illustrative purposes I've added in an extra graph line for each numpy solution where the 60 ms overhead for loading numpy is added to the time.
As we can see, the all-numpy solution beats the others by an order of magnitude. Converting from python list and back slows the using_numpy solution down considerably in comparison, but it still beats pure python for large lists.
For a list length of about 1'000'000, using_pyargsort takes 2.0 seconds, using_nympy + overhead is only 1.3 seconds, while all_numpy + overhead is 0.3 seconds.
The sorting you describe is not very easy to accomplish. The only way that I can think of to do it is to use zip to create the list you say you don't want to create:
lst = [[4,5,6],[2,3,1]]
# key = operator.itemgetter(1) works too, and may be slightly faster ...
transpose_sort = sorted(zip(*lst),key = lambda x: x[1])
lst = zip(*transpose_sort)
Is there a reason for this constraint?
(Also note that you could do this all in one line if you really want to:
lst = zip(*sorted(zip(*lst),key = lambda x: x[1]))
This also results in a list of tuples. If you really want a list of lists, you can map the result:
lst = map(list, lst)
Or a list comprehension would work as well:
lst = [ list(x) for x in lst ]
If the second list doesn't contain duplicates, you could just do this:
l = [[4,5,6],[2,3,1]] #the list
l1 = l[1][:] #a copy of the to-be-sorted sublist
l[1].sort() #sort the sublist
l[0] = [l[0][l1.index(x)] for x in l[1]] #order the first sublist accordingly
(As this saves the sublist l[1] it might be a bad idea if your input list is huge)
How about this one:
a = [[4,5,6],[2,3,1]]
[a[0][i] for i in sorted(range(len(a[1])), key=lambda x: a[1][x])]
It uses the principal way numpy does it without having to use numpy and without the zip stuff.
Neither using numpy nor the zipping around seems to be the cheapest way for giant structures. Unfortunately the .sort() method is built into the list type and uses hard-wired access to the elements in the list (overriding __getitem__() or similar does not have any effect here).
So you can implement your own sort() which sorts two or more lists according to the values in one; this is basically what numpy does.
Or you can create a list of values to sort, sort that, and recreate the sorted original list out of it.

Categories

Resources