Python - Splitting an array into two using an optimized for loop - python

This is a followup question to a question I posted here, but it's a very different question, so I thought I would post it separately.
I have a Python script which reads an very large array, and I needed to optimize an operation on each element (see referenced SO question). I now need to split the output array into two separate arrays.
I have the code:
output = [True if (len(element_in_array) % 2) else False for element_in_array in master_list]
which outputs an array of length len(master_list) consisting of True or False, depending on if the length of element_in_array is odd or even. My problem is that I need to split master_list into two arrays: one array containing the element_in_array's that correspond to the True elements in output and another containing the element_in_array's corresponding to the False elements in output.
This can clearly be done with traditional array operators such as append, but I need this to be as optimized and as fast as possible. I have many millions of elements in my master_list, so is there a way to accomplish this without directly looping through master_list and using append to create two new arrays.
Any advice would be greatly appreciated.
Thanks!

You can use itertools.compress:
>>> from itertools import compress, imap
>>> import operator
>>> lis = range(10)
>>> output = [random.choice([True, False]) for _ in xrange(10)]
>>> output
[True, True, False, False, False, False, False, False, False, False]
>>> truthy = list(compress(lis, output))
>>> truthy
[0, 1]
>>> falsy = list(compress(lis, imap(operator.not_,output)))
>>> falsy
[2, 3, 4, 5, 6, 7, 8, 9]
Go for NumPy if you want even faster solution, plus it also allows us to do array filtering based on boolean arrays:
>>> import numpy as np
>>> a = np.random.random(10)*10
>>> a
array([ 2.94518349, 0.09536957, 8.74605883, 4.05063779, 2.11192606,
2.24215582, 7.02203768, 2.1267423 , 7.6526713 , 3.81429322])
>>> output = np.array([True, True, False, False, False, False, False, False, False, False])
>>> a[output]
array([ 2.94518349, 0.09536957])
>>> a[~output]
array([ 8.74605883, 4.05063779, 2.11192606, 2.24215582, 7.02203768,
2.1267423 , 7.6526713 , 3.81429322])
Timing comparison:
>>> lis = range(1000)
>>> output = [random.choice([True, False]) for _ in xrange(1000)]
>>> a = np.random.random(1000)*100
>>> output_n = np.array(output)
>>> %timeit list(compress(lis, output))
10000 loops, best of 3: 44.9 us per loop
>>> %timeit a[output_n]
10000 loops, best of 3: 20.9 us per loop
>>> %timeit list(compress(lis, imap(operator.not_,output)))
1000 loops, best of 3: 150 us per loop
>>> %timeit a[~output_n]
10000 loops, best of 3: 28.7 us per loop

If you can use NumPy, this will be a lot simpler. And, as a bonus, it'll also be a lot faster, and it'll use a lot less memory to store your giant array. For example:
>>> import numpy as np
>>> import random
>>> # create an array of 1000 arrays of length 1-1000
>>> a = np.array([np.random.random(random.randint(1, 1000))
for _ in range(1000)])
>>> lengths = np.vectorize(len)(a)
>>> even_flags = lengths % 2 == 0
>>> evens, odds = a[even_flags], a[~even_flags]
>>> len(evens), len(odds)
(502, 498)

You could try using the groupby function in itertools. The key function would be the function that determines if the length of an element is even or not. The iterator returned by groupby consists of key-value tuples, where key is a value returned by the key function (here, True or False) and the value is a sequence of items which all share
the same key. Create a dictionary which maps a value returned by the key function to a list, and you can extend the appropriate list with a set of values from the initial iterator.
trues = []
falses = []
d = { True: trues, False: falses }
def has_even_length(element_in_array):
return len(element_in_array) % 2 == 0
for k, v in itertools.groupby(master_list, has_even_length):
d[k].extend(v)
The documentation for groupby says you typically want to make sure the list is sorted on the same key returned by the key function. In this case, it's OK to leave it unsorted; you'll just have more than things returned by the iterator returned by groupby, as there could be an a number of alternating true/false sets in the sequence.

Related

What is the fastest way to compare each element of a list with corresponding element of another list?

I want to compare each element of a list with corresponding element of another list to see if it is greater or lesser.
list1 = [4,1,3]
list2 = [2,5,2]
So compare 4 with 2, 1 with 5, 3 with 2.
Are there other fast ways to do that other than using a for loop?
You can use numpy library for this. And its significantly faster
>>> import numpy as np
>>> list1 = np.array([4,1,3])
>>> list2 = np.array([2,5,2])
>>> list1 < list2
array([False, True, False])
The time taken to run the function
>>> import timeit
>>> timeit.timeit("""
... import numpy as np
... list1 = np.array([4,1,3])
... list2 = np.array([2,5,2])
... print(list1 < list2)
... """,number=1)
[False True False]
0.00011205673217773438
Well the fact that numpy is basically written in C,C++ makes it considerable faster, if you look into the implementation of it .
You can use zip() to join the list elements and a list comprehension to create the resulting list:
list1 = [4,1,3]
list2 = [2,5,2]
list1_greater_list2_value = [a > b for a,b in zip(list1,list2)]
print ("list1-value greater then list2-value:", list1_greater_list2_value)
Output:
list1-value greater then list2-value: [True, False, True]
This does the same work as a normal loop - but looks more pythonic.
You can map the two lists to an operator such as int.__lt__ (the "less than" operator):
list(map(int.__lt__, list1, list2))
With your sample input, this returns:
[False, True, False]
You can do like this,
Use of lambda:
In [91]: map(lambda x,y:x<y,list1,list2)
Out[91]: [False, True, False]
With zip and for loop:
In [83]: [i<j for i,j in zip(list1,list2)]
Out[83]: [False, True, False]
Execution timings for lambda and for loop:
In [101]: def test_lambda():
...: map(lambda x,y:x>y,list1,list2)
...:
In [102]: def test_forloop():
...: [i<j for i,j in zip(list1,list2)]
...:
In [103]: %timeit test_lambda
...:
10000000 loops, best of 3: 21.9 ns per loop
In [104]: %timeit test_forloop
10000000 loops, best of 3: 21 ns per loop
Just for the sake of better comparison, if all approaches are timed in the same way:
paul_numpy: 0.15166378399999303
artner_LCzip: 0.9575707100000272
bhlsing_map__int__: 1.3945185019999826
rahul_maplambda: 1.4970900099999653
rahul_LCzip: 0.9604789950000168
Code used for timing:
setup_str = '''import numpy as np
list1 = list(map(int, np.random.randint(0, 100, 1000000)))
list2 = list(map(int, np.random.randint(0, 100, 1000000)))'''
paul_numpy = 'list1 = np.array(list1); list2 = np.array(list2);list1 < list2'
t = timeit.Timer(paul_numpy, setup_str)
print('paul_numpy: ', min(t.repeat(number=10)))
artner = '[a > b for a,b in zip(list1,list2)]'
t = timeit.Timer(artner, setup_str)
print('artner_LCzip: ', min(t.repeat(number=10)))
blhsing = 'list(map(int.__lt__, list1, list2))'
t = timeit.Timer(blhsing, setup_str)
print('bhlsing_map__int__: ', min(t.repeat(number=10)))
rahul_lambda = 'list(map(lambda x,y:x<y,list1,list2))'
t = timeit.Timer(rahul_lambda, setup_str)
print('rahul_maplambda: ', min(t.repeat(number=10)))
rahul_zipfor = '[i<j for i,j in zip(list1,list2)]'
t = timeit.Timer(rahul_zipfor, setup_str)
print('rahul_LCzip: ', min(t.repeat(number=10)))
Using enumerate eliminates the need for any zip or map here
[item > list2[idx] for idx, item in enumerate(list1)]
# [True, False, True]
actually, I should say that there is no faster way of doing your desired action other that loop because if you look closer to the problem you can see that by use of algorithm this action takes at least O(n) to be done. so all answers are right and will do as you want with Zip or map or ... but these solutions just make your implementing more beautiful and pythonic. not faster, in some cases, like blhsing answer, it is a little bit faster because of less line of code, not the time complexity.

looking on nested list for lot of data python

I have to find in a nested list which list have a word and return a boolear numpy array.
nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words=c
result=[1,0,1,1]
I'm using this list comprehension to do it and it works
np.array([word in x for x in nested_list])
But I'm working with a nested list with 700k lists inside so it takes a lot of time. Also, I have to do it a lot of times, lists are static but words can change.
1 loop with list comprehension takes 0.36s, I need a way to do it faster, is there a way to do it?
We could flatten out the elements in all sub-lists to give us a 1D array. Then, we simply look for any occurrence of 'c' within the limits of each sub-list in the flattened 1D array. Thus, with that philosophy, we could use two approaches, based on how we count the occurrence of any c.
Approach #1 : One approach with np.bincount -
lens = np.array([len(i) for i in nested_list])
arr = np.concatenate(nested_list)
ids = np.repeat(np.arange(lens.size),lens)
out = np.bincount(ids, arr=='c')!=0
Since, as stated in the question, nested_list won't change across iterations, we can re-use everything and just loop for the final step.
Approach #2 : Another approach with np.add.reduceat reusing arr and lens from previous one -
grp_idx = np.append(0,lens[:-1].cumsum())
out = np.add.reduceat(arr=='c', grp_idx)!=0
When looping through a list of words, we can keep this approach vectorized for the final step by using np.add.reduceat along an axis and using broadcasting to give us a 2D array boolean, like so -
np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Sample run -
In [344]: nested_list
Out[344]: [['a', 'b', 'c'], ['a', 'b'], ['b', 'c'], ['c']]
In [345]: words
Out[345]: ['c', 'b']
In [346]: lens = np.array([len(i) for i in nested_list])
...: arr = np.concatenate(nested_list)
...: grp_idx = np.append(0,lens[:-1].cumsum())
...:
In [347]: np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Out[347]:
array([[ True, False, True, True], # matches for 'c'
[ True, True, True, False]]) # matches for 'b'
A generator expression would be preferable when iterating once(in terms of performance).The solution using numpy.fromiter function:
nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words = 'c'
arr = np.fromiter((words in l for l in nested_list), int)
print(arr)
The output:
[1 0 1 1]
https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html
How much time is it taking your to finish your loop? In my test case it only takes a few hundred milliseconds.
import random
# generate the nested lists
a = list('abcdefghijklmnop')
nested_list = [ [random.choice(a) for x in range(random.randint(1,30))]
for n in range(700000)]
%%timeit -n 10
word = 'c'
b = [word in x for x in nested_list]
# 10 loops, best of 3: 191 ms per loop
Reducing each internal list to a set gives some time savings...
nested_sets = [set(x) for x in nested_list]
%%timeit -n 10
word = 'c'
b = [word in s for s in nested_sets]
# 10 loops, best of 3: 132 ms per loop
And once you have turned it into a list of sets, you can build a list of boolean tuples. No real time savings though.
%%timeit -n 10
words = list('abcde')
b = [(word in s for word in words) for s in nested_sets]
# 10 loops, best of 3: 749 ms per loop

Determine sum of numpy array while excluding certain values

I would like to determine the sum of a two dimensional numpy array. However, elements with a certain value I want to exclude from this summation. What is the most efficient way to do this?
For example, here I initialize a two dimensional numpy array of 1s and replace several of them by 2:
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
How can I sum over the elements in my two dimensional array while excluding all of the 2s? Note that with the 10 by 10 array the correct answer should be 97 as I replaced three elements with the value 2.
I know I can do this with nested for loops. For example:
elements = []
for idx_x in range(data_set.shape[0]):
for idx_y in range(data_set.shape[1]):
if data_set[idx_x][idx_y] != 2:
elements.append(data_set[idx_x][idx_y])
data_set_sum = numpy.sum(elements)
However on my actual data (which is very large) this is too slow. What is the correct way of doing this?
Use numpy's capability of indexing with boolean arrays. In the below example data_set!=2 evaluates to a boolean array which is True whenever the element is not 2 (and has the correct shape). So data_set[data_set!=2] is a fast and convenient way to get an array which doesn't contain a certain value. Of course, the boolean expression can be more complex.
In [1]: import numpy as np
In [2]: data_set = np.ones((10, 10))
In [4]: data_set[4,4] = 2
In [5]: data_set[5,5] = 2
In [6]: data_set[6,6] = 2
In [7]: data_set[data_set != 2].sum()
Out[7]: 97.0
In [8]: data_set != 2
Out[8]:
array([[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
...
[ True, True, True, True, True, True, True, True, True,
True]], dtype=bool)
Without numpy, the solution is not much more complex:
x = [1,2,3,4,5,6,7]
sum(y for y in x if y != 7)
# 21
Works for a list of excluded values too:
# set is faster for resolving `in`
exl = set([1,2,3])
sum(y for y in x if y not in exl)
# 22
Using np.sums where= argument, we avoid the need for array copying which would otherwise be triggered from using advanced array indexing:
>>> import numpy as np
>>> data_set = np.ones((10,10))
>>> data_set[(4,5,6),(4,5,6)] = 2
>>> np.sum(data_set, where=data_set != 2)
97.0
>>> data_set.sum(where=data_set != 2)
97.0
https://numpy.org/doc/stable/reference/generated/numpy.sum.html
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing
How about this way that makes use of numpy's boolean capabilities.
We simply set all the values that meet the specification to zero before taking the sum, that way we don't change the shape of the array as we would if we were to filter them from the array.
The other benefit of this is that it means we can sum along axis after the filter is applied.
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
print "Sum", data_set.sum()
another_set = numpy.array(data_set) # Take a copy, we'll need that later
data_set[data_set == 2] = 0 # Set all the values that are 2 to zero
print "Filtered sum", data_set.sum()
print "Along axis", data_set.sum(0), data_set.sum(1)
Equally we could use any other boolean to set the data we wish to exclude from the sum.
another_set[(another_set > 1) & (another_set < 3)] = 0
print "Another filtered sum", another_set.sum()

How to vectorize a set of items in python

I am trying to take a set of arrays and convert them into a matrix that will essentially be an indicator matrix for a set of items.
I currently have a array of N items
A_ = [A,B,C,D,E,...,Y,Z]
In addition, I have S arrays (currently stored in an array) that are have a subset of the items in vector A.
B_ = [A,B,C,Z]
C_ = [A,B]
D_ = [D,Y,Z]
The array they are stored in would is structures like so:
X = [B_,C_,D_]
I would like to convert the data into an indicator matrix for easier operation. It would ideally look like this (it would be an N x S sized matrix):
[1,1,1,0,...,0,1]
[1,1,0,0,...,0,0]
[0,0,0,1,...,1,1]
I know how I could use a for loop to iterate through this and create the matrix but I was wondering if there is a more efficient/syntactically simple way of going about this.
A concise way would be to use a list comprehension.
# Create a list containing the alphabet using a list comprehension
A_ = [chr(i) for i in range(65,91)]
# A list containing two sub-lists with some letters
M = [["A","B","C","Z"],["A","B","G"]]
# Nested list comprehension to convert character matrix
# into matrix of indicator vectors
I_M = [[1 if char in sublist else 0 for char in A_] for sublist in M]
The last line is a bit dense if you aren't familiar with comprehensions, but its not too tricky once you take it apart. The inner part...
[1 if char in sublist else 0 for char in A_]
Is a list comprehension in itself, which creates a list containing 1's for all characters (char) in A_ which are also found in sublist, and 0's for characters not found in sublist.
The outer bit...
[ ... for sublist in M]
simply runs the inner list comprehension for each sublist found in M, resulting in a list of all the sublists created by the inner list comprehension stored in I_M.
Edit:
While I tried to keep this example simple, it is worth noting (as DSM and jterrace point out) that testing membership in vanilla arrays is O(N). Converting it to a hashlike structure like a Set would speed up the checking for large sublists.
Using numpy:
>>> import numpy as np
>>> A_ = np.array(['A','B','C','D','E','Y','Z'])
>>> B_ = np.array(['A','B','C','Z'])
>>> C_ = np.array(['A','B'])
>>> D_ = np.array(['D','Y','Z'])
>>> X = [B_,C_,D_]
>>> matrix = np.array([np.in1d(A_, x) for x in X])
>>> matrix.shape
(3, 7)
>>> matrix
array([[ True, True, True, False, False, False, True],
[ True, True, False, False, False, False, False],
[False, False, False, True, False, True, True]], dtype=bool)
This is O(NS).

Counting the number of True Booleans in a Python List

I have a list of Booleans:
[True, True, False, False, False, True]
and I am looking for a way to count the number of True in the list (so in the example above, I want the return to be 3.) I have found examples of looking for the number of occurrences of specific elements, but is there a more efficient way to do it since I'm working with Booleans? I'm thinking of something analogous to all or any.
True is equal to 1.
>>> sum([True, True, False, False, False, True])
3
list has a count method:
>>> [True,True,False].count(True)
2
This is actually more efficient than sum, as well as being more explicit about the intent, so there's no reason to use sum:
In [1]: import random
In [2]: x = [random.choice([True, False]) for i in range(100)]
In [3]: %timeit x.count(True)
970 ns ± 41.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit sum(x)
1.72 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
If you are only concerned with the constant True, a simple sum is fine. However, keep in mind that in Python other values evaluate as True as well. A more robust solution would be to use the bool builtin:
>>> l = [1, 2, True, False]
>>> sum(bool(x) for x in l)
3
UPDATE: Here's another similarly robust solution that has the advantage of being more transparent:
>>> sum(1 for x in l if x)
3
P.S. Python trivia: True could be true without being 1. Warning: do not try this at work!
>>> True = 2
>>> if True: print('true')
...
true
>>> l = [True, True, False, True]
>>> sum(l)
6
>>> sum(bool(x) for x in l)
3
>>> sum(1 for x in l if x)
3
Much more evil:
True = False
You can use sum():
>>> sum([True, True, False, False, False, True])
3
After reading all the answers and comments on this question, I thought to do a small experiment.
I generated 50,000 random booleans and called sum and count on them.
Here are my results:
>>> a = [bool(random.getrandbits(1)) for x in range(50000)]
>>> len(a)
50000
>>> a.count(False)
24884
>>> a.count(True)
25116
>>> def count_it(a):
... curr = time.time()
... counting = a.count(True)
... print("Count it = " + str(time.time() - curr))
... return counting
...
>>> def sum_it(a):
... curr = time.time()
... counting = sum(a)
... print("Sum it = " + str(time.time() - curr))
... return counting
...
>>> count_it(a)
Count it = 0.00121307373046875
25015
>>> sum_it(a)
Sum it = 0.004102230072021484
25015
Just to be sure, I repeated it several more times:
>>> count_it(a)
Count it = 0.0013530254364013672
25015
>>> count_it(a)
Count it = 0.0014507770538330078
25015
>>> count_it(a)
Count it = 0.0013344287872314453
25015
>>> sum_it(a)
Sum it = 0.003480195999145508
25015
>>> sum_it(a)
Sum it = 0.0035257339477539062
25015
>>> sum_it(a)
Sum it = 0.003350496292114258
25015
>>> sum_it(a)
Sum it = 0.003744363784790039
25015
And as you can see, count is 3 times faster than sum. So I would suggest to use count as I did in count_it.
Python version: 3.6.7
CPU cores: 4
RAM size: 16 GB
OS: Ubuntu 18.04.1 LTS
Just for completeness' sake (sum is usually preferable), I wanted to mention that we can also use filter to get the truthy values. In the usual case, filter accepts a function as the first argument, but if you pass it None, it will filter for all "truthy" values. This feature is somewhat surprising, but is well documented and works in both Python 2 and 3.
The difference between the versions, is that in Python 2 filter returns a list, so we can use len:
>>> bool_list = [True, True, False, False, False, True]
>>> filter(None, bool_list)
[True, True, True]
>>> len(filter(None, bool_list))
3
But in Python 3, filter returns an iterator, so we can't use len, and if we want to avoid using sum (for any reason) we need to resort to converting the iterator to a list (which makes this much less pretty):
>>> bool_list = [True, True, False, False, False, True]
>>> filter(None, bool_list)
<builtins.filter at 0x7f64feba5710>
>>> list(filter(None, bool_list))
[True, True, True]
>>> len(list(filter(None, bool_list)))
3
It is safer to run through bool first. This is easily done:
>>> sum(map(bool,[True, True, False, False, False, True]))
3
Then you will catch everything that Python considers True or False into the appropriate bucket:
>>> allTrue=[True, not False, True+1,'0', ' ', 1, [0], {0:0}, set([0])]
>>> list(map(bool,allTrue))
[True, True, True, True, True, True, True, True, True]
If you prefer, you can use a comprehension:
>>> allFalse=['',[],{},False,0,set(),(), not True, True-1]
>>> [bool(i) for i in allFalse]
[False, False, False, False, False, False, False, False, False]
I prefer len([b for b in boollist if b is True]) (or the generator-expression equivalent), as it's quite self-explanatory. Less 'magical' than the answer proposed by Ignacio Vazquez-Abrams.
Alternatively, you can do this, which still assumes that bool is convertable to int, but makes no assumptions about the value of True:
ntrue = sum(boollist) / int(True)

Categories

Resources