Vectorizing nested for loops in list comprehension - python

I have two lists for strings where I'm calculating the Damerau–Levenshtein distance
to check which are similar. The issue that I have those list are over 200k+ and with comprehension it takes quite a lot of time. For the distance computation I'm using pyxDamerauLevenshtein package which is written in Cython so there should be no bottleneck
series = ([damerau_levenshtein_distance(i, j) for i in original_string for j in compare_string])
That's how my code looks like and I wonder if it can be vectorized somehow to boost performance, or maybe someother way to speed up the computation?
What is my dataset:
Original string - it is pd.Series of unique street names
Compare string - this is pd.Series of manually inputed street names that I want to compare to find similarity
Output should be like that:
Original Compare Distance
0 Street1 Street1 1
1 Street2 Street1 2
2 Street3 Street1 3
3 Street4 Street3 5
4 Street5 Street3 5
5 Street6 Street6 1

If you can think of a way to use map (or imap) functions rather than nested loops, you could then try using multiprocessing to fully utilise your CPU. For example, in this case:
pool.map(lambda j: map(lambda i: damerau_levenshtein_distance(i, j),original_string),compare_string)
where 'pool.map' is the multiprocessing map, and the second 'map' is regular.
Below is a quick, but functional example of multiprocessing which could cover what you are looking for. I structured it a bit differently to avoid some pickling problems and to get it to compute in the background somewhat asynchronously since your lists are long...
(This can definitely be improved, but should hopefully serve as a proof-of-concept for your example)
import multiprocessing as mp
import itertools
list1 = range(5)
list2 = range(5)
def doSomething(a,b):
return a+b #Your damerau_levenshtein_distance function goes here
def mapDoSomething(args):
i = args[0] #An element of list2
otherlist = args[1] #A copy of list1
return [doSomething(i,j) for j in otherlist]
if __name__ == '__main__':
pool = mp.Pool()
answer = pool.imap(mapDoSomething,zip(list2,itertools.repeat(list1)))
print(list(answer)) #imap will compute the results in the background whilst the rest of the code runs. You can therefore ask for individual lists of results, and it won't block unless the result hasn't been computed yet. To do this, you would use answer.next() or iterate over the results somewhere else. However, by converting to a list here, I'm forcing all results to finish before printing. This is only to show you it worked. For larger lists, don't do this.
pool.close()
pool.join()
This code produces:
[[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]]
which is each element of list1 operated with (I added them) each element of list2, which I think is what you've attmpted to do in your code with lists of strings.
The code sets up the Process Pool, then uses imap to split the processing of operating on list2 across multiple processes. The zip function lazily groups the element of list2 with a full copy of list1, since imap only supports functions with single arguments. This group is then split up and used in mapDoSomething, which runs the doSomething function on each element in list1 with each element of list2.
Since I've used imap, the lists get printed as soon as they are computed, rather than waiting for the entire result to be finished.

Related

Efficient way to sequential adding multiple list elements

I have multiple lists. I want to merge the elements sequentially one-by-one.
Example:
a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
Result should be:
d = [1, 4, 7, 2, 5, 8, 3, 6, 9]
One way of doing it is:
d = []
for i, j, k in zip(a, b, c):
d.extend([i, j, k])
Is this efficient? What is most efficient way here?
A one-liner could be
import itertools
list(itertools.chain.from_iterable(zip(a,b,c)))
A variant of your method is
d=[]
for i, j, k in zip(a, b, c):
d+=[i,j,k]
Out of curiosity, i've just
Just used timeit, out of curiosity, comparing your method, that variant, my one-liner, and also the one in Olvin's comment (let's call it compound version), and verdict is
yours: 1.06-1.08
my variant (with += instead of extend): 0.94-0.97
my one-liner: 1.10-1.12
Olvin's one-liner: 1.28-1.34
Sometimes, the nicest methods aren't the fastest.
Timing may change for longer lists tho.
The fact that += is faster than .extend is quite interesting (since .extend change the list, while += build a new one, and then replace the old one. Instinct would say that rebuilding lists should be faster that extending them. But maybe memory management says otherwise).
But, well, so far, the fastest one is my second version (with +=), which, incidentally, is also the one I find the most boring, among all solutions seen here.
Edit
Since that ranking bothered me (it's itertools iterators are supposed to be faster, since they are a little bit less interpreted and a little bit more compiled), I've tried with longer list. And it is then another story
a=list(range(1000))
b=list(range(1000,2000))
c=list(range(2000,3000))
And then timeit verdict (with 100 times less run than before) is
Your method: 1.91
My += variant: 1.59
My one-liner: 0.98
Olvin's one-liner: 1.88
So, at least, itertools does win in the long run (with big enough data).
Victory of += over .extend is affirmed (I don't really know the internals of memory management. But, coming from the C world, I would say that sometimes, a new malloc and copy is faster than a constantly realloc. But maybe that's a naive view of what happens under the hood in python's interpreter. But well, is faster than extend for this usage in the long run)
Olvin's method is quite equivalent to yours. Which surprises me a little. Because it is roughly the compound version of the same thing. But I would have thought that, while building up a compound list, python could skip some steps in intermediary representation, that it could not skip in your method, where all the versions of the list (the one with just [1,4,7], then the one with [1,4,7,2,5,8] etc.) do exist in the interpreter. May be the 0.03 difference between Olvin's method and yours are because of that (it is not just noise. With this size, the timings are quite constant, and the 0.03 difference is also). But I would have thought that the difference would be higher.
But well, even if the timing differences surprise me, the ranking of the method makes more sense with big lists. With itertools > += > [compound] > .extend
a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
flat = zip(a,b,c)
d = [x for tpl in flat for x in tpl]
This list comprehension is the same as:
flat_list = []
for sublist in l:
for item in sublist:
flat_list.append(item)

Does Python calculate list comprehension condition in each step?

In a list comprehension with a condition that has a function call in it, does Python (specifically CPython 3.9.4) call the function each time, or does it calculate the value once and then uses it?
For example if you have:
list_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
list_2 = [x for x in list_1 if x > np.average(list_1)]
Will Python actually calculate the np.average(list_1) len(list_1) times? So would it be more optimized to write
list_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
np_avg = np.average(list_1)
list_2 = [x for x in list_1 if x > np_avg]
instead? Or does Python already "know" to just calculate the average beforehand?
Python has to call the function each time. It cannot optimize that part, because successive calls of the function might return different results (for example because of side effects). There is no easy way for Python’s compiler to be sure that this can’t happen.
Therefore, if you (the programmer) know that the result will always be the same – like in this case – it is probably advisable to calculate the result of the function in advance and use it inside the list comprehension.
Assuming standard CPython - Short answer: Yes. Your second snippet is more efficient.
A function call in the filter part of a list comprehension will be called for each element.
We can test this quite easily with a trivial example:
def f(value):
""" Allow even values only """
print('function called')
return value % 2 == 0
mylist = [x for x in range(5) if f(x)]
# 'function called' will be printed 5 times
The above is somewhat equivalent to doing:
mylist = []
for x in range(5):
if f(x):
mylist.append(x)
Since you're comparing against the same average each time, you can indeed just calculate it beforehand and use the same value as you did in your second code snippet.

Does Python keep track of when something has been sorted, internally?

For example, if I call
L = [3,4,2,1,5]
L = sorted(L)
I get a sorted list. Now, in the future, if I want to perform some other kind of sort on L, does Python automatically know "this list has been sorted before and not modified since, so we can perform some internal optimizations on how we perform this other kind of sort" such as a reverse-sort, etc?
Nope, it doesn't. The sorting algorithm is designed to exploit (partially) sorted inputs, but the list itself doesn't "remember" being sorted in any way.
(This is actually a CPython implementation detail, and future versions/different implementations could cache the fact that a list was just sorted. However, I'm not convinced that could be done without slowing down all operations that modify the list, such as append.)
As the commenters pointed out, normal Python lists are inherently ordered and efficiently sortable (thanks, Timsort!), but do not remember or maintain sorting status.
If you want lists that invariably retain their sorted status, you can install the SortedContainers package from PyPI.
>>> from sortedcontainers import SortedList
>>> L = SortedList([3,4,2,1,5])
>>> L
SortedList([1, 2, 3, 4, 5])
>>> L.add(3.3)
>>> L
SortedList([1, 2, 3, 3.3, 4, 5])
Note the normal append method becomes add, because the item isn't added on the end. It's added wherever appropriate given the sort order. There is also a SortedListWithKey type that allows you to set your sort key/order explicitly.
Some of this, at least the specific reverse sort question, could be done using numpy:
import numpy as np
L = np.array([3,4,2,1,5])
a = np.argsort(L)
b = L[a]
r = L[a[::-1]]
print L
[3 4 2 1 5]
print b
[1 2 3 4 5]
print r
[5, 4, 3, 2, 1]
That is, here we just do the sort once (to create a, the sorting indices), and then we can manipulate a, to do other various sorts, like the normal sort b, and the reverse sort r. And many others would be similarly easy, like every other element.

Optimized method of cutting/slicing sorted lists

Is there any pre-made optimized tool/library in Python to cut/slice lists for values "less than" something?
Here's the issue: Let's say I have a list like:
a=[1,3,5,7,9]
and I want to delete all the numbers which are <= 6, so the resulting list would be
[7,9]
6 is not in the list, so I can't use the built-in index(6) method of the list. I can do things like:
#!/usr/bin/env python
a = [1, 3, 5, 7, 9]
cut=6
for i in range(len(a)-1, -2, -1):
if a[i] <= cut:
break
b = a[i+1:]
print "Cut list: %s" % b
which would be fairly quick method if the index to cut from is close to the end of the list, but which will be inefficient if the item is close to the beginning of the list (let's say, I want to delete all the items which are >2, there will be a lot of iterations).
I can also implement my own find method using binary search or such, but I was wondering if there's a more... wide-scope built in library to handle this type of things that I could reuse in other cases (for instance, if I need to delete all the number which are >=6).
Thank you in advance.
You can use the bisect module to perform a sorted search:
>>> import bisect
>>> a[bisect.bisect_left(a, 6):]
[7, 9]
bisect.bisect_left is what you are looking for, I guess.
If you just want to filter the list for all elements that fulfil a certain criterion, then the most straightforward way is to use the built-in filter function.
Here is an example:
a_list = [10,2,3,8,1,9]
# filter all elements smaller than 6:
filtered_list = filter(lambda x: x<6, a_list)
the filtered_list will contain:
[2, 3, 1]
Note: This method does not rely on the ordering of the list, so for very large lists it might be that a method optimised for ordered searching (as bisect) performs better in terms of speed.
Bisect left and right helper function
#!/usr/bin/env python3
import bisect
def get_slice(list_, left, right):
return list_[
bisect.bisect_left(list_, left):
bisect.bisect_left(list_, right)
]
assert get_slice([0, 1, 1, 3, 4, 4, 5, 6], 1, 5) == [1, 1, 3, 4, 4]
Tested in Ubuntu 16.04, Python 3.5.2.
Adding to Jon's answer, if you need to actually delete the elements less than 6 and want to keep the same reference to the list, rather than returning a new one.
del a[:bisect.bisect_right(a,6)]
You should note as well that bisect will only work on a sorted list.

Sorting based on one of the list among Nested list in python

I have a list as [[4,5,6],[2,3,1]]. Now I want to sort the list based on list[1] i.e. output should be [[6,4,5],[1,2,3]]. So basically I am sorting 2,3,1 and maintaining the order of list[0].
While searching I got a function which sorts based on first element of every list but not for this. Also I do not want to recreate list as [[4,2],[5,3],[6,1]] and then use the function.
Since [4, 5, 6] and [2, 3, 1] serves two different purposes I will make a function taking two arguments: the list to be reordered, and the list whose sorting will decide the order. I'll only return the reordered list.
This answer has timings of three different solutions for creating a permutation list for a sort. Using the fastest option gives this solution:
def pyargsort(seq):
return sorted(range(len(seq)), key=seq.__getitem__)
def using_pyargsort(a, b):
"Reorder the list a the same way as list b would be reordered by a normal sort"
return [a[i] for i in pyargsort(b)]
print using_pyargsort([4, 5, 6], [2, 3, 1]) # [6, 4, 5]
The pyargsort method is inspired by the numpy argsort method, which does the same thing much faster. Numpy also has advanced indexing operations whereby an array can be used as an index, making possible very quick reordering of an array.
So if your need for speed is great, one would assume that this numpy solution would be faster:
import numpy as np
def using_numpy(a, b):
"Reorder the list a the same way as list b would be reordered by a normal sort"
return np.array(a)[np.argsort(b)].tolist()
print using_numpy([4, 5, 6], [2, 3, 1]) # [6, 4, 5]
However, for short lists (length < 1000), this solution is in fact slower than the first. This is because we're first converting the a and b lists to array and then converting the result back to list before returning. If we instead assume you're using numpy arrays throughout your application so that we do not need to convert back and forth, we get this solution:
def all_numpy(a, b):
"Reorder array a the same way as array b would be reordered by a normal sort"
return a[np.argsort(b)]
print all_numpy(np.array([4, 5, 6]), np.array([2, 3, 1])) # array([6, 4, 5])
The all_numpy function executes up to 10 times faster than the using_pyargsort function.
The following logaritmic graph compares these three solutions with the two alternative solutions from the other answers. The arguments are two randomly shuffled ranges of equal length, and the functions all receive identically ordered lists. I'm timing only the time the function takes to execute. For illustrative purposes I've added in an extra graph line for each numpy solution where the 60 ms overhead for loading numpy is added to the time.
As we can see, the all-numpy solution beats the others by an order of magnitude. Converting from python list and back slows the using_numpy solution down considerably in comparison, but it still beats pure python for large lists.
For a list length of about 1'000'000, using_pyargsort takes 2.0 seconds, using_nympy + overhead is only 1.3 seconds, while all_numpy + overhead is 0.3 seconds.
The sorting you describe is not very easy to accomplish. The only way that I can think of to do it is to use zip to create the list you say you don't want to create:
lst = [[4,5,6],[2,3,1]]
# key = operator.itemgetter(1) works too, and may be slightly faster ...
transpose_sort = sorted(zip(*lst),key = lambda x: x[1])
lst = zip(*transpose_sort)
Is there a reason for this constraint?
(Also note that you could do this all in one line if you really want to:
lst = zip(*sorted(zip(*lst),key = lambda x: x[1]))
This also results in a list of tuples. If you really want a list of lists, you can map the result:
lst = map(list, lst)
Or a list comprehension would work as well:
lst = [ list(x) for x in lst ]
If the second list doesn't contain duplicates, you could just do this:
l = [[4,5,6],[2,3,1]] #the list
l1 = l[1][:] #a copy of the to-be-sorted sublist
l[1].sort() #sort the sublist
l[0] = [l[0][l1.index(x)] for x in l[1]] #order the first sublist accordingly
(As this saves the sublist l[1] it might be a bad idea if your input list is huge)
How about this one:
a = [[4,5,6],[2,3,1]]
[a[0][i] for i in sorted(range(len(a[1])), key=lambda x: a[1][x])]
It uses the principal way numpy does it without having to use numpy and without the zip stuff.
Neither using numpy nor the zipping around seems to be the cheapest way for giant structures. Unfortunately the .sort() method is built into the list type and uses hard-wired access to the elements in the list (overriding __getitem__() or similar does not have any effect here).
So you can implement your own sort() which sorts two or more lists according to the values in one; this is basically what numpy does.
Or you can create a list of values to sort, sort that, and recreate the sorted original list out of it.

Categories

Resources