python input for itertools.product - python

Looking for a way to simulate nested loops (or a cartesian product) i came across the itertools.product function.
i need a function or piece of code that receive a list of integers as input and returns a specific generator.
example:
input = [3,2,4] -> gen = product(xrange(3),xrange(2),xrange(4))
or
input = [2,4,5,6] -> gen = product(xrange(2),xrange(4),xrange(5),xrange(6))
as the size of the lists varies i am very confused in how to do that without the need of a lot of precoding based on a crazy amount of ifs and the size of the list.
also is there a difference in calling product(range(3)) or product(xrange(3))?

def bigproduct(*args):
newargs = [xrange(x) for x in args]
return itertools.product(*newargs)
for i in bigproduct(3, 2, 4):
....
range() generates a list up-front, therefore uses time up front and more space, but takes less time to get each element. xrange() generates each element on the fly, so takes up less space and initial time, but takes more time to return each element.

This can be easily accomplished using map:
from itertools import product
for i in product(*map(range, shape)):
print i

Related

List of lists updates an entire column when given a specific element [duplicate]

So I was wondering how to best create a list of blank lists:
[[],[],[]...]
Because of how Python works with lists in memory, this doesn't work:
[[]]*n
This does create [[],[],...] but each element is the same list:
d = [[]]*n
d[0].append(1)
#[[1],[1],...]
Something like a list comprehension works:
d = [[] for x in xrange(0,n)]
But this uses the Python VM for looping. Is there any way to use an implied loop (taking advantage of it being written in C)?
d = []
map(lambda n: d.append([]),xrange(0,10))
This is actually slower. :(
The probably only way which is marginally faster than
d = [[] for x in xrange(n)]
is
from itertools import repeat
d = [[] for i in repeat(None, n)]
It does not have to create a new int object in every iteration and is about 15 % faster on my machine.
Edit: Using NumPy, you can avoid the Python loop using
d = numpy.empty((n, 0)).tolist()
but this is actually 2.5 times slower than the list comprehension.
The list comprehensions actually are implemented more efficiently than explicit looping (see the dis output for example functions) and the map way has to invoke an ophaque callable object on every iteration, which incurs considerable overhead overhead.
Regardless, [[] for _dummy in xrange(n)] is the right way to do it and none of the tiny (if existent at all) speed differences between various other ways should matter. Unless of course you spend most of your time doing this - but in that case, you should work on your algorithms instead. How often do you create these lists?
Here are two methods, one sweet and simple(and conceptual), the other more formal and can be extended in a variety of situations, after having read a dataset.
Method 1: Conceptual
X2=[]
X1=[1,2,3]
X2.append(X1)
X3=[4,5,6]
X2.append(X3)
X2 thus has [[1,2,3],[4,5,6]] ie a list of lists.
Method 2 : Formal and extensible
Another elegant way to store a list as a list of lists of different numbers - which it reads from a file. (The file here has the dataset train)
Train is a data-set with say 50 rows and 20 columns. ie. Train[0] gives me the 1st row of a csv file, train[1] gives me the 2nd row and so on. I am interested in separating the dataset with 50 rows as one list, except the column 0 , which is my explained variable here, so must be removed from the orignal train dataset, and then scaling up list after list- ie a list of a list. Here's the code that does that.
Note that I am reading from "1" in the inner loop since I am interested in explanatory variables only. And I re-initialize X1=[] in the other loop, else the X2.append([0:(len(train[0])-1)]) will rewrite X1 over and over again - besides it more memory efficient.
X2=[]
for j in range(0,len(train)):
X1=[]
for k in range(1,len(train[0])):
txt2=train[j][k]
X1.append(txt2)
X2.append(X1[0:(len(train[0])-1)])
To create list and list of lists use below syntax
x = [[] for i in range(10)]
this will create 1-d list and to initialize it put number in [[number] and set length of list put length in range(length)
To create list of lists use below syntax.
x = [[[0] for i in range(3)] for i in range(10)]
this will initialize list of lists with 10*3 dimension and with value 0
To access/manipulate element
x[1][5]=value
So I did some speed comparisons to get the fastest way.
List comprehensions are indeed very fast. The only way to get close is to avoid bytecode getting exectuded during construction of the list.
My first attempt was the following method, which would appear to be faster in principle:
l = [[]]
for _ in range(n): l.extend(map(list,l))
(produces a list of length 2**n, of course)
This construction is twice as slow as the list comprehension, according to timeit, for both short and long (a million) lists.
My second attempt was to use starmap to call the list constructor for me, There is one construction, which appears to run the list constructor at top speed, but still is slower, but only by a tiny amount:
from itertools import starmap
l = list(starmap(list,[()]*(1<<n)))
Interesting enough the execution time suggests that it is the final list call that is makes the starmap solution slow, since its execution time is almost exactly equal to the speed of:
l = list([] for _ in range(1<<n))
My third attempt came when I realized that list(()) also produces a list, so I tried the apperently simple:
l = list(map(list, [()]*(1<<n)))
but this was slower than the starmap call.
Conclusion: for the speed maniacs:
Do use the list comprehension.
Only call functions, if you have to.
Use builtins.

Python - Does iterating a generator expression impact the order of iterating another generator expression?

I am iterating two different generators using two different for loops. But i could see that the iteration through one generator expression is impacting the order of iteration of another generator expression.
Though I understand and hope that this is impossible, but not sure why I am experiencing this weird behaviour.
we=KeyedVectors.load_word2vec_format('../input/nlpword2vecembeddingspretrained/GoogleNews-vectors-negative300.bin', binary='True')
data1=(['this','is','an','example','text1'],['this','is','an','example','text2'],....)
data2=(['test data1','test data2'],['test data3','test data4'],....)
txt_emb=(sum([we[token] for token in doc if token in we.key_to_index])/len(doc) for doc in data1)
phr_emb=([sum([we[token] for token in phrase.split(' ') if token in we.key_to_index])/len(phrase.split(' ')) for phrase in phrases]for phrases in data2)
for i in txt_emb:
print(i)
break
for j in phr_emb:
print(j)
break
txt_emb :
([-0.06002714 0.00999211 0.0358354 ....],..........[0.07940271 -0.02072765 -0.03981323...])
phr_emb:
([array([-0.13269043,0.03266907,...]),array([0.04994202,0.15716553,...])],
[array([-0.06970215,0.01029968,...]),array([0.02503967,0.13970947,...])],.......)
Here txt_emb is a generator expression with each iterable being a list.
The phr_emb is a generator expression with each iterable being a list and each list containing varying number of arrays(say 2-6).
When I iterate txt_emb first as in above code, i get the first element(list at index 0) of txt_emb which is as expected. Similarly when I iterate through phr_emb, i expect to get the first element(list at index 0), but I get the second element(list at index 1).
Similarly if I continue to iterate txt_emb again, i get the third element(list at index 2) rather than getting the element at index 1 of txt_emb, as I have iterated txt_emb only once before this.
I face similar issues when i zip the two generator expression txt_emb and phr_emb and try iterate through it.
I am running all this in kaggle notebook.But if I iterate both the generator expressions seperately in different cells of the notebook then I get the elements in order as expected.
From your comments, this appears to be an issue with your two generator expressions pulling data from some other, shared iterator (probably another generator expression). When the first generator expression advances, it takes data from this third iterator, which makes it unavailable to the second generator expression.
You can recreate this issue with simpler code like this:
data = range(10) # our underlying data
data_iterator = iter(data) # our shared iterator, which could be a generator expression
doubles = (x * 2 for x in data_iterator) # first generator expression
squares = (x * x for x in data_iterator) # second generator expression
print(next(doubles), next(doubles), next(doubles)) # prints 0 2 4
print(next(squares), next(squares), next(squares)) # prints 9 16 25, not 0 1 4 as you might expect
If you take a few values from one of the generator expressions, the corresponding value will be skipped in the other one. That's because they're each advancing the shared data_iterator in the background, which only goes over each value from the list once.
The solution is either to create separate iterators for each of the generator (e.g. multiple versions of data_iterator, or if it's awkward or time consuming to recompute, to dump it into a sequence like a list, so it can be iterated upon repeatedly.
For instance, we could dump data_iterator into data_list, like this, and then build the generator expressions off the list:
data = range(10)
data_iterator = iter(data)
data_list = list(data_iterator) # this can be iterated upon repeatedly
doubles = (x * 2 for x in data_list)
squares = (x * x for x in data_list)
print(next(doubles), next(doubles), next(doubles)) # prints 0 2 4
print(next(squares), next(squares), next(squares)) # prints 0 1 4 as expected
Now, storing the data in a list like that may take more memory than you want. One of the nice things about generators and generator expressions is the lazy computation that they allow. If you want to maintain that lazy computation approach and only need the a few values to be available to one generators before the other, because they're being consumed mostly in parallel (e.g. by zip), then itertools.tee may be just what you need.
import itertools
data = range(10)
data_iterator = iter(data)
data_it1, data_it2 = itertools.tee(data_iterator) # two iterators that will yield the same results
doubles = (x * 2 for x in data_it1)
squares = (x * x for x in data_it2)
for d, s in zip(doubles, squares): # consume the values in parallel
print(d, s)
The iterators that tee returns can still be used if you're planning on fully consuming one generator before starting on the other, but they're a lot less efficient in that situation (just dumping the whole intermediate iterator into a list is probably better).

Itertools Python

Is there any way to use itertools product function where the function returns each combination of lists step by step ?
For example:
itertools.product(*mylist)
-> the solution should return the first combination of the lists , after that the second one etc.
As #ggorlen has explained, itertools.product(...) returns an iterator. For example, if you have
import itertools
mylist = [('Hello','Hi'),('Andy','Betty')]
iterator = itertools.product(*mylist)
next(iterator) or iterator.__next__() will evaluate to 'Hello Andy' the first time you call them, for example. When you next call next(iterator), it will return 'Hello Betty', then 'Hi Andy', and finally 'Hi Betty', before raising StopIteration errors.
You can also convert an iterator into a list with list(iterator), if you are more comfortable with a list, but if you just need the first few values and mylist is big, this would be really inefficient, and it might be worth the time familiarising yourself with iterators.
Do consider whether you are really just iterating through iterator though. If so, just use
for combination in iterator:
pass # Body of loop
Even if you just need the first n elements and n is large you can use
for combination in itertools.islice(iterator, n):
pass # Body of loop

Why does len() not support iterators?

Many of Python's built-in functions (any(), all(), sum() to name some) take iterables but why does len() not?
One could always use sum(1 for i in iterable) as an equivalent, but why is it len() does not take iterables in the first place?
Many iterables are defined by generator expressions which don't have a well defined len. Take the following which iterates forever:
def sequence(i=0):
while True:
i+=1
yield i
Basically, to have a well defined length, you need to know the entire object up front. Contrast that to a function like sum. You don't need to know the entire object at once to sum it -- Just take one element at a time and add it to what you've already summed.
Be careful with idioms like sum(1 for i in iterable), often it will just exhaust iterable so you can't use it anymore. Or, it could be slow to get the i'th element if there is a lot of computation involved. It might be worth asking yourself why you need to know the length a-priori. This might give you some insight into what type of data-structure to use (frequently list and tuple work just fine) -- or you may be able to perform your operation without needing calling len.
This is an iterable:
def forever():
while True:
yield 1
Yet, it has no length. If you want to find the length of a finite iterable, the only way to do so, by definition of what an iterable is (something you can repeatedly call to get the next element until you reach the end) is to expand the iterable out fully, e.g.:
len(list(the_iterable))
As mgilson pointed out, you might want to ask yourself - why do you want to know the length of a particular iterable? Feel free to comment and I'll add a specific example.
If you want to keep track of how many elements you have processed, instead of doing:
num_elements = len(the_iterable)
for element in the_iterable:
...
do:
num_elements = 0
for element in the_iterable:
num_elements += 1
...
If you want a memory-efficient way of seeing how many elements end up being in a comprehension, for example:
num_relevant = len(x for x in xrange(100000) if x%14==0)
It wouldn't be efficient to do this (you don't need the whole list):
num_relevant = len([x for x in xrange(100000) if x%14==0])
sum would probably be the most handy way, but it looks quite weird and it isn't immediately clear what you're doing:
num_relevant = sum(1 for _ in (x for x in xrange(100000) if x%14==0))
So, you should probably write your own function:
def exhaustive_len(iterable):
length = 0
for _ in iterable: length += 1
return length
exhaustive_len(x for x in xrange(100000) if x%14==0)
The long name is to help remind you that it does consume the iterable, for example, this won't work as you might think:
def yield_numbers():
yield 1; yield 2; yield 3; yield 5; yield 7
the_nums = yield_numbers()
total_nums = exhaustive_len(the_nums)
for num in the_nums:
print num
because exhaustive_len has already consumed all the elements.
EDIT: Ah in that case you would use exhaustive_len(open("file.txt")), as you have to process all lines in the file one-by-one to see how many there are, and it would be wasteful to store the entire file in memory by calling list.

Python: fastest way to create a list of n lists

So I was wondering how to best create a list of blank lists:
[[],[],[]...]
Because of how Python works with lists in memory, this doesn't work:
[[]]*n
This does create [[],[],...] but each element is the same list:
d = [[]]*n
d[0].append(1)
#[[1],[1],...]
Something like a list comprehension works:
d = [[] for x in xrange(0,n)]
But this uses the Python VM for looping. Is there any way to use an implied loop (taking advantage of it being written in C)?
d = []
map(lambda n: d.append([]),xrange(0,10))
This is actually slower. :(
The probably only way which is marginally faster than
d = [[] for x in xrange(n)]
is
from itertools import repeat
d = [[] for i in repeat(None, n)]
It does not have to create a new int object in every iteration and is about 15 % faster on my machine.
Edit: Using NumPy, you can avoid the Python loop using
d = numpy.empty((n, 0)).tolist()
but this is actually 2.5 times slower than the list comprehension.
The list comprehensions actually are implemented more efficiently than explicit looping (see the dis output for example functions) and the map way has to invoke an ophaque callable object on every iteration, which incurs considerable overhead overhead.
Regardless, [[] for _dummy in xrange(n)] is the right way to do it and none of the tiny (if existent at all) speed differences between various other ways should matter. Unless of course you spend most of your time doing this - but in that case, you should work on your algorithms instead. How often do you create these lists?
Here are two methods, one sweet and simple(and conceptual), the other more formal and can be extended in a variety of situations, after having read a dataset.
Method 1: Conceptual
X2=[]
X1=[1,2,3]
X2.append(X1)
X3=[4,5,6]
X2.append(X3)
X2 thus has [[1,2,3],[4,5,6]] ie a list of lists.
Method 2 : Formal and extensible
Another elegant way to store a list as a list of lists of different numbers - which it reads from a file. (The file here has the dataset train)
Train is a data-set with say 50 rows and 20 columns. ie. Train[0] gives me the 1st row of a csv file, train[1] gives me the 2nd row and so on. I am interested in separating the dataset with 50 rows as one list, except the column 0 , which is my explained variable here, so must be removed from the orignal train dataset, and then scaling up list after list- ie a list of a list. Here's the code that does that.
Note that I am reading from "1" in the inner loop since I am interested in explanatory variables only. And I re-initialize X1=[] in the other loop, else the X2.append([0:(len(train[0])-1)]) will rewrite X1 over and over again - besides it more memory efficient.
X2=[]
for j in range(0,len(train)):
X1=[]
for k in range(1,len(train[0])):
txt2=train[j][k]
X1.append(txt2)
X2.append(X1[0:(len(train[0])-1)])
To create list and list of lists use below syntax
x = [[] for i in range(10)]
this will create 1-d list and to initialize it put number in [[number] and set length of list put length in range(length)
To create list of lists use below syntax.
x = [[[0] for i in range(3)] for i in range(10)]
this will initialize list of lists with 10*3 dimension and with value 0
To access/manipulate element
x[1][5]=value
So I did some speed comparisons to get the fastest way.
List comprehensions are indeed very fast. The only way to get close is to avoid bytecode getting exectuded during construction of the list.
My first attempt was the following method, which would appear to be faster in principle:
l = [[]]
for _ in range(n): l.extend(map(list,l))
(produces a list of length 2**n, of course)
This construction is twice as slow as the list comprehension, according to timeit, for both short and long (a million) lists.
My second attempt was to use starmap to call the list constructor for me, There is one construction, which appears to run the list constructor at top speed, but still is slower, but only by a tiny amount:
from itertools import starmap
l = list(starmap(list,[()]*(1<<n)))
Interesting enough the execution time suggests that it is the final list call that is makes the starmap solution slow, since its execution time is almost exactly equal to the speed of:
l = list([] for _ in range(1<<n))
My third attempt came when I realized that list(()) also produces a list, so I tried the apperently simple:
l = list(map(list, [()]*(1<<n)))
but this was slower than the starmap call.
Conclusion: for the speed maniacs:
Do use the list comprehension.
Only call functions, if you have to.
Use builtins.

Categories

Resources