looking on nested list for lot of data python - python

I have to find in a nested list which list have a word and return a boolear numpy array.
nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words=c
result=[1,0,1,1]
I'm using this list comprehension to do it and it works
np.array([word in x for x in nested_list])
But I'm working with a nested list with 700k lists inside so it takes a lot of time. Also, I have to do it a lot of times, lists are static but words can change.
1 loop with list comprehension takes 0.36s, I need a way to do it faster, is there a way to do it?

We could flatten out the elements in all sub-lists to give us a 1D array. Then, we simply look for any occurrence of 'c' within the limits of each sub-list in the flattened 1D array. Thus, with that philosophy, we could use two approaches, based on how we count the occurrence of any c.
Approach #1 : One approach with np.bincount -
lens = np.array([len(i) for i in nested_list])
arr = np.concatenate(nested_list)
ids = np.repeat(np.arange(lens.size),lens)
out = np.bincount(ids, arr=='c')!=0
Since, as stated in the question, nested_list won't change across iterations, we can re-use everything and just loop for the final step.
Approach #2 : Another approach with np.add.reduceat reusing arr and lens from previous one -
grp_idx = np.append(0,lens[:-1].cumsum())
out = np.add.reduceat(arr=='c', grp_idx)!=0
When looping through a list of words, we can keep this approach vectorized for the final step by using np.add.reduceat along an axis and using broadcasting to give us a 2D array boolean, like so -
np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Sample run -
In [344]: nested_list
Out[344]: [['a', 'b', 'c'], ['a', 'b'], ['b', 'c'], ['c']]
In [345]: words
Out[345]: ['c', 'b']
In [346]: lens = np.array([len(i) for i in nested_list])
...: arr = np.concatenate(nested_list)
...: grp_idx = np.append(0,lens[:-1].cumsum())
...:
In [347]: np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Out[347]:
array([[ True, False, True, True], # matches for 'c'
[ True, True, True, False]]) # matches for 'b'

A generator expression would be preferable when iterating once(in terms of performance).The solution using numpy.fromiter function:
nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words = 'c'
arr = np.fromiter((words in l for l in nested_list), int)
print(arr)
The output:
[1 0 1 1]
https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html

How much time is it taking your to finish your loop? In my test case it only takes a few hundred milliseconds.
import random
# generate the nested lists
a = list('abcdefghijklmnop')
nested_list = [ [random.choice(a) for x in range(random.randint(1,30))]
for n in range(700000)]
%%timeit -n 10
word = 'c'
b = [word in x for x in nested_list]
# 10 loops, best of 3: 191 ms per loop
Reducing each internal list to a set gives some time savings...
nested_sets = [set(x) for x in nested_list]
%%timeit -n 10
word = 'c'
b = [word in s for s in nested_sets]
# 10 loops, best of 3: 132 ms per loop
And once you have turned it into a list of sets, you can build a list of boolean tuples. No real time savings though.
%%timeit -n 10
words = list('abcde')
b = [(word in s for word in words) for s in nested_sets]
# 10 loops, best of 3: 749 ms per loop

Related

What is the fastest way to compare each element of a list with corresponding element of another list?

I want to compare each element of a list with corresponding element of another list to see if it is greater or lesser.
list1 = [4,1,3]
list2 = [2,5,2]
So compare 4 with 2, 1 with 5, 3 with 2.
Are there other fast ways to do that other than using a for loop?
You can use numpy library for this. And its significantly faster
>>> import numpy as np
>>> list1 = np.array([4,1,3])
>>> list2 = np.array([2,5,2])
>>> list1 < list2
array([False, True, False])
The time taken to run the function
>>> import timeit
>>> timeit.timeit("""
... import numpy as np
... list1 = np.array([4,1,3])
... list2 = np.array([2,5,2])
... print(list1 < list2)
... """,number=1)
[False True False]
0.00011205673217773438
Well the fact that numpy is basically written in C,C++ makes it considerable faster, if you look into the implementation of it .
You can use zip() to join the list elements and a list comprehension to create the resulting list:
list1 = [4,1,3]
list2 = [2,5,2]
list1_greater_list2_value = [a > b for a,b in zip(list1,list2)]
print ("list1-value greater then list2-value:", list1_greater_list2_value)
Output:
list1-value greater then list2-value: [True, False, True]
This does the same work as a normal loop - but looks more pythonic.
You can map the two lists to an operator such as int.__lt__ (the "less than" operator):
list(map(int.__lt__, list1, list2))
With your sample input, this returns:
[False, True, False]
You can do like this,
Use of lambda:
In [91]: map(lambda x,y:x<y,list1,list2)
Out[91]: [False, True, False]
With zip and for loop:
In [83]: [i<j for i,j in zip(list1,list2)]
Out[83]: [False, True, False]
Execution timings for lambda and for loop:
In [101]: def test_lambda():
...: map(lambda x,y:x>y,list1,list2)
...:
In [102]: def test_forloop():
...: [i<j for i,j in zip(list1,list2)]
...:
In [103]: %timeit test_lambda
...:
10000000 loops, best of 3: 21.9 ns per loop
In [104]: %timeit test_forloop
10000000 loops, best of 3: 21 ns per loop
Just for the sake of better comparison, if all approaches are timed in the same way:
paul_numpy: 0.15166378399999303
artner_LCzip: 0.9575707100000272
bhlsing_map__int__: 1.3945185019999826
rahul_maplambda: 1.4970900099999653
rahul_LCzip: 0.9604789950000168
Code used for timing:
setup_str = '''import numpy as np
list1 = list(map(int, np.random.randint(0, 100, 1000000)))
list2 = list(map(int, np.random.randint(0, 100, 1000000)))'''
paul_numpy = 'list1 = np.array(list1); list2 = np.array(list2);list1 < list2'
t = timeit.Timer(paul_numpy, setup_str)
print('paul_numpy: ', min(t.repeat(number=10)))
artner = '[a > b for a,b in zip(list1,list2)]'
t = timeit.Timer(artner, setup_str)
print('artner_LCzip: ', min(t.repeat(number=10)))
blhsing = 'list(map(int.__lt__, list1, list2))'
t = timeit.Timer(blhsing, setup_str)
print('bhlsing_map__int__: ', min(t.repeat(number=10)))
rahul_lambda = 'list(map(lambda x,y:x<y,list1,list2))'
t = timeit.Timer(rahul_lambda, setup_str)
print('rahul_maplambda: ', min(t.repeat(number=10)))
rahul_zipfor = '[i<j for i,j in zip(list1,list2)]'
t = timeit.Timer(rahul_zipfor, setup_str)
print('rahul_LCzip: ', min(t.repeat(number=10)))
Using enumerate eliminates the need for any zip or map here
[item > list2[idx] for idx, item in enumerate(list1)]
# [True, False, True]
actually, I should say that there is no faster way of doing your desired action other that loop because if you look closer to the problem you can see that by use of algorithm this action takes at least O(n) to be done. so all answers are right and will do as you want with Zip or map or ... but these solutions just make your implementing more beautiful and pythonic. not faster, in some cases, like blhsing answer, it is a little bit faster because of less line of code, not the time complexity.

Extract the first letter from each string in a numpy array

I got a huge numpy array where elements are strings. I like to replace the strings with the first alphabet of the string. For example if
C[0] = 'A90CD'
I want to replace it with
C[0] = 'A'
IN nutshell, I was thinking of applying regex in a loop where I have a dictionary of regex expression like
'^A.+$' => 'A'
'^B.+$' => 'B'
etc
How can I apply this regex over the numpy arrays ? Or is there any better method to achieve the same ?
There's no need for regex here. Just convert your array to a 1 byte string, using astype -
v = np.array(['abc', 'def', 'ghi'])
>>> v.astype('<U1')
array(['a', 'd', 'g'],
dtype='<U1')
Alternatively, you change its view and stride. Here's a slightly optimised version for equal sized strings. -
>>> v.view('<U1')[::len(v[0])]
array(['a', 'd', 'g'],
dtype='<U1')
And here's the more generalised version of .view method, but this works for arrays of strings with differing length. Thanks to Paul Panzer for the suggestion -
>>> v.view('<U1').reshape(v.shape + (-1,))[:, 0]
array(['a', 'd', 'g'],
dtype='<U1')
Performance
y = np.array([x * 20 for x in v]).repeat(100000)
y.shape
(300000,)
len(y[0]) # they're all the same length - `abcabcabc...`
60
Now, the timings -
# `astype` conversion
%timeit y.astype('<U1')
100 loops, best of 3: 5.03 ms per loop
# `view` for equal sized string arrays
%timeit y.view('<U1')[::len(y[0])]
100000 loops, best of 3: 2.43 µs per loop
# Paul Panzer's version for differing length strings
%timeit y.view('<U1').reshape(y.shape + (-1,))[:, 0]
100000 loops, best of 3: 3.1 µs per loop
The view method is faster by a huge margin.
However, use with caution, as the memory is shared.
If you're interested in a more general solution that finds you the first letter (regardless of where it may be), I'd say the fastest/easiest way would be using the re module, compiling a pattern and searching inside a list comprehension.
>>> p = re.compile('[a-zA-Z]')
>>> [p.search(x).group() for x in v]
['a', 'd', 'g']
And, its performance on the same setup above -
%timeit [p.search(x).group() for x in y]
1 loop, best of 3: 320 ms per loop

Most pythonic way of getting index of the last list item

Given a list:
l1 = [0, 211, 576, 941, 1307, 1672, 2037]
What is the most pythonic way of getting the index of the last element of the list. Given that Python lists are zero-indexed, is it:
len(l1) - 1
Or, is it the following which uses Python's list operations:
l1.index(l1[-1])
Both return the same value, that is 6.
Only the first is correct:
>>> lst = [1, 2, 3, 4, 1]
>>> len(lst) - 1
4
>>> lst.index(lst[-1])
0
However it depends on what do you mean by "the index of the last element".
Note that index must traverse the whole list in order to provide an answer:
In [1]: %%timeit lst = list(range(100000))
...: lst.index(lst[-1])
...:
1000 loops, best of 3: 1.82 ms per loop
In [2]: %%timeit lst = list(range(100000))
len(lst)-1
...:
The slowest run took 80.20 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 109 ns per loop
Note that the second timing is in nanoseconds versus milliseconds for the first one.
You should use the first. Why?
>>> l1 = [1,2,3,4,3]
>>> l1.index(l1[-1])
2
Bakuriu's answer is great!
In addition, it should be mentioned that you rarely need this value. There are usually other and better ways to do what you want to do. Consider this answer as a sidenote :)
As you mention, getting the last element can be done this way:
lst = [1,2,4,2,3]
print lst[-1] # 3
If you need to iterate over a list, you should do this:
for element in lst:
# do something with element
If you still need the index, this is the preferred method:
for i, element in enumerate(lst):
# i is the index, element is the actual list element

efficiently compute ordering permutations in numpy array

I've got a numpy array. What is the fastest way to compute all the permutations of orderings.
What I mean is, given the first element in my array, I want a list of all the elements that sequentially follow it. Then given the second element, a list of all the elements that follow it.
So given my list: b, c, & d follow a. c & d follow b, and d follows c.
x = np.array(["a", "b", "c", "d"])
So a potential output looks like:
[
["a","b"],
["a","c"],
["a","d"],
["b","c"],
["b","d"],
["c","d"],
]
I will need to do this several million times so I am looking for an efficient solution.
I tried something like:
im = np.vstack([x]*len(x))
a = np.vstack(([im], [im.T])).T
results = a[np.triu_indices(len(x),1)]
but its actually slower than looping...
You can use itertools's functions like chain.from_iterable and combinations with np.fromiter for this. This involves no loop in Python, but still not a pure NumPy solution:
>>> from itertools import combinations, chain
>>> arr = np.fromiter(chain.from_iterable(combinations(x, 2)), dtype=x.dtype)
>>> arr.reshape(arr.size/2, 2)
array([['a', 'b'],
['a', 'c'],
['a', 'd'],
...,
['b', 'c'],
['b', 'd'],
['c', 'd']],
dtype='|S1')
Timing comparisons:
>>> x = np.array(["a", "b", "c", "d"]*100)
>>> %%timeit
im = np.vstack([x]*len(x))
a = np.vstack(([im], [im.T])).T
results = a[np.triu_indices(len(x),1)]
...
10 loops, best of 3: 29.2 ms per loop
>>> %%timeit
arr = np.fromiter(chain.from_iterable(combinations(x, 2)), dtype=x.dtype)
arr.reshape(arr.size/2, 2)
...
100 loops, best of 3: 6.63 ms per loop
I've been browsing the source and it seems the tri functions have had some very substantial improvements relatively recently. The file is all Python so you can just copy it into your directory if that helps.
I seem to have completely different timings to Ashwini Chaudhary, even after taking this into account.
It is very important to know the size of the arrays you want to do this on; if it is small you should cache intermediates like triu_indices.
The fastest code for me was:
def triangalize_1(x):
xs, ys = numpy.triu_indices(len(x), 1)
return numpy.array([x[xs], x[ys]]).T
unless the x array is small.
If x is small, caching was best:
triu_cache = {}
def triangalize_1(x):
if len(x) in triu_cache:
xs, ys = triu_cache[len(x)]
else:
xs, ys = numpy.triu_indices(len(x), 1)
triu_cache[len(x)] = xs, ys
return numpy.array([x[xs], x[ys]]).T
I wouldn't do this for large x because of memory requirements.

Python - Splitting an array into two using an optimized for loop

This is a followup question to a question I posted here, but it's a very different question, so I thought I would post it separately.
I have a Python script which reads an very large array, and I needed to optimize an operation on each element (see referenced SO question). I now need to split the output array into two separate arrays.
I have the code:
output = [True if (len(element_in_array) % 2) else False for element_in_array in master_list]
which outputs an array of length len(master_list) consisting of True or False, depending on if the length of element_in_array is odd or even. My problem is that I need to split master_list into two arrays: one array containing the element_in_array's that correspond to the True elements in output and another containing the element_in_array's corresponding to the False elements in output.
This can clearly be done with traditional array operators such as append, but I need this to be as optimized and as fast as possible. I have many millions of elements in my master_list, so is there a way to accomplish this without directly looping through master_list and using append to create two new arrays.
Any advice would be greatly appreciated.
Thanks!
You can use itertools.compress:
>>> from itertools import compress, imap
>>> import operator
>>> lis = range(10)
>>> output = [random.choice([True, False]) for _ in xrange(10)]
>>> output
[True, True, False, False, False, False, False, False, False, False]
>>> truthy = list(compress(lis, output))
>>> truthy
[0, 1]
>>> falsy = list(compress(lis, imap(operator.not_,output)))
>>> falsy
[2, 3, 4, 5, 6, 7, 8, 9]
Go for NumPy if you want even faster solution, plus it also allows us to do array filtering based on boolean arrays:
>>> import numpy as np
>>> a = np.random.random(10)*10
>>> a
array([ 2.94518349, 0.09536957, 8.74605883, 4.05063779, 2.11192606,
2.24215582, 7.02203768, 2.1267423 , 7.6526713 , 3.81429322])
>>> output = np.array([True, True, False, False, False, False, False, False, False, False])
>>> a[output]
array([ 2.94518349, 0.09536957])
>>> a[~output]
array([ 8.74605883, 4.05063779, 2.11192606, 2.24215582, 7.02203768,
2.1267423 , 7.6526713 , 3.81429322])
Timing comparison:
>>> lis = range(1000)
>>> output = [random.choice([True, False]) for _ in xrange(1000)]
>>> a = np.random.random(1000)*100
>>> output_n = np.array(output)
>>> %timeit list(compress(lis, output))
10000 loops, best of 3: 44.9 us per loop
>>> %timeit a[output_n]
10000 loops, best of 3: 20.9 us per loop
>>> %timeit list(compress(lis, imap(operator.not_,output)))
1000 loops, best of 3: 150 us per loop
>>> %timeit a[~output_n]
10000 loops, best of 3: 28.7 us per loop
If you can use NumPy, this will be a lot simpler. And, as a bonus, it'll also be a lot faster, and it'll use a lot less memory to store your giant array. For example:
>>> import numpy as np
>>> import random
>>> # create an array of 1000 arrays of length 1-1000
>>> a = np.array([np.random.random(random.randint(1, 1000))
for _ in range(1000)])
>>> lengths = np.vectorize(len)(a)
>>> even_flags = lengths % 2 == 0
>>> evens, odds = a[even_flags], a[~even_flags]
>>> len(evens), len(odds)
(502, 498)
You could try using the groupby function in itertools. The key function would be the function that determines if the length of an element is even or not. The iterator returned by groupby consists of key-value tuples, where key is a value returned by the key function (here, True or False) and the value is a sequence of items which all share
the same key. Create a dictionary which maps a value returned by the key function to a list, and you can extend the appropriate list with a set of values from the initial iterator.
trues = []
falses = []
d = { True: trues, False: falses }
def has_even_length(element_in_array):
return len(element_in_array) % 2 == 0
for k, v in itertools.groupby(master_list, has_even_length):
d[k].extend(v)
The documentation for groupby says you typically want to make sure the list is sorted on the same key returned by the key function. In this case, it's OK to leave it unsorted; you'll just have more than things returned by the iterator returned by groupby, as there could be an a number of alternating true/false sets in the sequence.

Categories

Resources