How is does zip(*) generate n-grams?

How is does zip(*) generate n-grams? - python

I am reviewing some notes on n-grams, and I came accross a couple of interesting functions. First there's this one to generate bigrams:
def bigrams(word):
return sorted(list(set(''.join(bigram)
for bigram in zip(word,word[1:]))))
def bigram_print(word):
print("The bigrams of", word, "are:")
print(bigrams(word))
bigram_print("ababa")
bigram_print("babab")
After doing some reading and playing on my own with Python I understand why this works. However, when looking at this function, I am very puzzled by the use of zip(*word[i:]) here. I understand that the * is an unpacking operator (as explained here), but I really am getting tripped up by how it's working in combination with the list comprehension here. Can anyone explain?
def ngrams(word, n):
return sorted(list(set(''.join(ngram)
for ngram in zip(*[word[i:]
for i in range(n)]))))
def ngram_print(word, n):
print("The {}-grams of {} are:".format(n, word))
print(ngrams(word, n))
for n in [2, 3, 4]:
ngram_print("ababa", n)
ngram_print("babab", n)
print()

The following example should explain how this works. I have added code and a visual representation of it.
Intuition
The core idea is to zip together multiple versions of the same list where each of them starts from the next subsequent element.
Lets say L is a list of words/elements ['A', 'B', 'C', 'D']
Then, what's happening here is that L, L[1:], L[2:] get zipped which means the first elements of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and second elements get clubbed together and so on..
Visually this can be shown as:
The statement we are worried about -
zip ( * [L[i:] for i in range(n)])
#|___||_______||________________________|
# | | |
# zip unpack versions of L with subsequent 0 to n elements skipped
Code example
l = ['A','B','C','D']
print('original list: '.ljust(27),l)
print('list skipping 1st element: ',l[1:])
print('list skipping 2 elements: '.ljust(27),l[2:])
print('bi-gram: '.ljust(27), list(zip(l,l[1:])))
print('tri-gram: '.ljust(27), list(zip(l,l[1:],l[2:])))
original list: ['A', 'B', 'C', 'D']
list skipping 1st element: ['B', 'C', 'D']
list skipping 2 elements: ['C', 'D']
bi-gram: [('A', 'B'), ('B', 'C'), ('C', 'D')]
tri-gram: [('A', 'B', 'C'), ('B', 'C', 'D')]
As you can see, you are basically zipping the same list but with one skipped. This zips (A, B) and (B, C) ... together for bigrams.
The * operator is for unpacking. When you change the i value to skip elements, you are basically zipping a list of [l[0:], l[1:], l[2:]...]. This is passed to the zip() and unpacked inside it with *.
zip(*[word[i:] for i in range(n)] #where word is the list of words
Alternate to list comprehension
The above list comprehension is equivalent to -
n = 3
lists = []
for i in range(3):
print(l[i:]) #comment this if not needed
lists.append(l[i:])
out = list(zip(*lists))
print(out)
['A', 'B', 'C', 'D']
['B', 'C', 'D']
['C', 'D']
[('A', 'B', 'C'), ('B', 'C', 'D')]

If you break down
zip(*[word[i:] for i in range(n)])
You get:
[word[i:] for i in range(n)]
Which is equivalent to:
[word[0:], word[1:], word[2:], ... word[n-1:]]
Which are each strings that start from different positions in word
Now, if you apply the unpacking * operator to it:
*[word[0:], word[1:], word[2:], ... word[n-1:]]
You get each of the lists word[0:], word[1:] etc passed to zip()
So, zip is getting called like this:
zip(word[0:], word[1:], word[2:], ... word[n-1:])
Which - according to how zip works - would create n-tuples, with each entry coming from one of the corresponding arguments:
[(words[0:][0], words[1:][0]....),
(words[0:][1], words[1:][1]....)
...
If you map the indexes, you'll see that these values correspond to the n-gram definitions for word

Related

How to get elements out from list of lists when having the same position

I have a list of lists where I want to extract the element from each list at same position. How do I do so? As an example. I have like:
L = [[A,B,C,D][B,C,D,E][C,D,E,F]]
Now I want all the letters from position 0 which would give me:
A, B, C - > L[0][0], L[1][0], L[2][0]
I tried to use:
[row[0] for row in L]
and
L[:-1][0]
But none of them works for me.

The reason this is happening to you is because of the way you made your list.
[[A,B,C,D][B,C,D,E][C,D,E,F]]
You have to separate the list (i.e you forgot the commas in between each list). Change your list to something like this
[[A,B,C,D],[B,C,D,E],[C,D,E,F]]
Also, when testing this it doesn't work as its not in quotation marks, but i'm guessing there's a reason for that.
Hope I could help :3

You are very close. Try this,
[v[0] for i, v in enumerate(L)]

This should work,
I don't understand why its not work for you:
[row[0] for row in L]
output:
['A','B', 'C']

Transpose your list with zip.
>>> L = [['A','B','C','D'],['B','C','D','E'],['C','D','E','F']]
>>> t = zip(*L) # list(*zip(L)) in Python 3
t[position] will give you all the elements for a specific position.
>>> t[0]
('A', 'B', 'C')
>>> t[1]
('B', 'C', 'D')
>>> t[2]
('C', 'D', 'E')
>>> t[3]
('D', 'E', 'F')
By the way, your attempted solution should have worked.
>>> [row[0] for row in L]
['A', 'B', 'C']
If you only care about one specific index, this is perfectly fine. If you want the information for all indices, transposing the whole list of lists with zip is the way to go.

Pythonic way to convert two adjacent list elements to a tuple and preserve rest of list

I'm looking for an elegant way to convert
lst = [A, B, C, D, E..]
to
lst = [A, B, (C, D), E]
so given that I want to do this on index 2 and 3 but preserve the list. Is there an elegant way to perform this? I was looking with a lambda function but I did not see it.

Just alter in-place:
lst[2:4] = [tuple(lst[2:4])]
The slice assignment ensures we are replacing the old elements with the contents of the list on the right-hand side of the assignment, which contains just the one tuple.
Demo:
>>> lst = ['A', 'B', 'C', 'D', 'E']
>>> lst[2:4] = [tuple(lst[2:4])]
>>> lst
['A', 'B', ('C', 'D'), 'E']

You could use:
lst[2] = lst[2], lst.pop(3)
or more generally:
lst[i] = lst[i], lst.pop(i+1)
However you must insure that both indices are valid in avoidIndexErrorexceptions.

group the string elements with size two

myl = ['a','b','c','d','e']
I want to group the string elements with size two as shown below:
[['a','b'],['c','d'],['e']]
I tried as follows:
print [myl[i:i+2] for i in len(myl)]
What is wrong with?

for i in len(myl) doesn't make any sense. len(myl) is just a number, so you can't iterate through it.
You can use:
[myl[i:i+2] for i in range(0,len(myl),2)]
Here range(0,len(myl),2) means "every other number in the given range", so i takes the values 0, 2, 4 and you get myl[0:2], myl[2:4], and myl[4:6].

You can use this:
>>> myl = ['a','b','c','d','e']
>>> [myl[2*i:2*i+2] for i in range((len(myl) +1)/2)]
[['a', 'b'], ['c', 'd'], ['e']]
>>> myl = ['a','b','c','d','e','f']
>>> [myl[2*i:2*i+2] for i in range((len(myl) +1)/2)]
[['a', 'b'], ['c', 'd'], ['e', 'f']]
>>> myl = ['a','b','c','d','e', 'f', 'g']
>>> [myl[2*i:2*i+2] for i in range((len(myl) +1)/2)]
[['a', 'b'], ['c', 'd'], ['e', 'f'], ['g']]

This is quite what you want:
from itertools import izip_longest
myl = ['a','b','c','d','e']
print list(izip_longest(myl[::2], myl[1::2]))
result:
[('a', 'b'), ('c', 'd'), ('e', None)]
Here is another solution:
myl = ['a','b','c','d','e']
l = zip(myl[::2], myl[1::2])
if not len(l)%2:
l.append((myl[-1],))
print l
result:
[('a', 'b'), ('c', 'd'), ('e',)]

I found an improved version of the function below:
#!python
def chunker(size, seq):
return [seq[n-size:n] for n in range(size, len(seq)+size, size)]
The tricky part of this is in setting up the range arguments to start at "size" offset and step across the sequence in size steps. It relies on the quirk of Python slices which allows any index larger than the end on the sequence (though some calls to max() would get around that if it were necessary). When I've tried to write this with a range starting at zero and reversing the index (looking "forward" rather than backwards from the base index) I've found all sorts of boundary errors that this "backwards" version doesn't exhibit. Probably I'm being stupid with that.
Here's the old, verbose version (but suitable for any iterable, even those which can't take slices):
#!python
def chunker(stride, seq):
'''Return a list of lists by breaking seq into chunks of length stride
'''
results = list()
part = list()
for each in seq:
if len(part) == stride:
results.append(part)
part = list()
part.append(each)
if part:
results.append(part) # Keep any trailing partial chunk
# but don't append an empty part if the seq ended on an even
# stride
return results
I realize this may seem awfully verbose. But it is robust and will work with any iterable sequence and it is generalized for any positive integer as a stride.

Python list comparison issues

I need to write a program in Python that compares two parallel lists to grade a multiple choice exam. One list has the exam solution and the second list has a student's answers. The question number for each missed question is to be stored in a third list using the natural index numbers. The solution must use indexing.
I keep getting an empty list returned for the third list. All help much appreciated!
def main():
exam_solution = ['B', 'D', 'A', 'A', 'C', 'A', 'B', 'A', 'C', 'D', 'B', 'C',\
'D', 'A', 'D', 'C', 'C', 'B', 'D', 'A']
student_answers = ['B', 'D', 'B', 'A', 'C', 'A', 'A', 'A', 'C', 'D', 'B', 'C',\
'D', 'B', 'D', 'C', 'C', 'B', 'D', 'A']
questions_missed = []
for item in exam_solution:
if item not in student_answers:
questions_missed.append(item)

questions_missed = [i for i, (ex,st) in enumerate(zip(exam_solution, student_answers)) if ex != st]
or alternatively, if you prefer loops over list comprehensions:
questions_missed = []
for i, (ex,st) in enumerate(zip(exam_solution, student_answers)):
if ex != st:
questions_missed.append(i)
Both give [2,6,13]
Explanation:
enumerate is a utility function that returns an iterable object which yields tuples of indices and values, it can be used to, loosely speaking, "have the current index available during an iteration".
Zip creates a list of tuples, containing corresponding elements from two or more iterable objects (in your case lists).
I'd prefer the list comprehension version.
If I add some timing code, I see that performance doesn't really differ here:
def list_comprehension_version():
questions_missed = [i for i, (ex,st) in enumerate(zip(exam_solution, student_answers)) if ex != st]
return questions_missed
def loop_version():
questions_missed = []
for i, (ex,st) in enumerate(zip(exam_solution, student_answers)):
if ex != st:
questions_missed.append(i)
return questions_missed
import timeit
print "list comprehension:", timeit.timeit("list_comprehension_version", "from __main__ import exam_solution, student_answers, list_comprehension_version", number=10000000)
print "loop:", timeit.timeit("loop_version", "from __main__ import exam_solution, student_answers, loop_version", number=10000000)
gives:
list comprehension: 0.895029446804
loop: 0.877159359719

A solution based on iterators
questions_missed = list(index for (index, _)
in filter(
lambda (_, (answer, solution)): answer != solution,
enumerate(zip(student_answers, exam_solution))))
For the purists, note that you should import the equivalents of zip and filter (izip and ifilter) from itertools.

One more solution comes to my mind. I put in in a separate answers as it is "special"
Using numpy this task can be accomplished by:
import numpy as np
exam_solution = np.array(exam_solution)
student_answers = np.array(student_answers)
(exam_solution!=student_answers).nonzero()[0]
With numpy-arrays, elementwise comparison is possible via == and !=. .nonzero() returns the indices of the array elements that are not zero. That's it.
Timing is really interesting now. For your 19-elements lists, performances are (N=19,repetitions=100,000):
list comprehension: 0.904024521544
loop: 0.936516107421
numpy: 0.349371968612
This is already a factor of almost 3. Nice, but not amazing.
But when I increase the size of your lists by a factor of 100, I get (N=19*100=1900, repetitions=1000):
list comprehension: 0.866544042939
loop: 1.04464069977
numpy: 0.0334220694495
Now we have a factor of 26 or 31 - that is definitely a lot.
Probably, performance won't be your problem, but, nevertheless, I thought it's worth pointing out.

Returing lists of tuple's keys and values

I can understand zip() function is used to construct a list of tuples like this:
x = ['a', 'b', 'c']
y = ['x', 'y', 'z', 'l']
lstTupA = zip(x,y)
lstTupA would be [('a', 'x'), ('b', 'y'), ('c', 'z')].
lstA, lstB = zip(*lstTupA)
The above operation extracts the keys in the list of tuples to lstA and values in the list of tuples to lstB.
lstA was ('a', 'b', 'c') and lstB was ('x', 'y', 'z').
My query is this: Why are lstA and lstB tuples instead of lists? a, b and c are homogeneous and so are x, y and z. It's not logical to group them as tuples, is it?
Ideally lstA, lstB = zip(*lstTupA) should have assigned ['a', 'b', 'c'] to lstA and ['x', 'y', 'z'] to lstB (lists) right?
Some one please clarify!
Thanks.

"It's not logical to group them as tuples, is it?"
Yes. It is logical.
There are two kinds of built-in sequences. Lists and tuples.
The zip() function has n arguments, that defines the cardinality of the tuple to be fixed at n.
A list would only be appropriate if other arguments were somehow, magically, appended or not appended to the resulting sequence. This would mean sequences of variable length, not defined by the number of arguments to zip(). That would be a rather complex structure to build with a single function call.

zip is simply defined to behave this way:
In [2]: help(zip)
Help on built-in function zip in module __builtin__:
zip(...)
zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
--> Return a list of tuples <--, where each tuple contains the i-th element
from each of the argument sequences. The returned list is truncated
in length to the length of the shortest argument sequence.

What *lstTupA does in lstA, lstB = zip(*lstTupA) (or generally the * operator) i to flattening an iterable. So doing zip(*lstTupA) is equal to zip(lstTupA[0], lstTupA[1], ...) and these items are tuples passed to zip and that's exactly the reason why lstA and lstB are tuples.

zip doesn't know what is on the left hand side of the equal sign. As far as it know, lstTupA = zip(x,y) and lstA, lstB = zip(*lstTupA) are the same thing. zip is defined to do one thing and it is constant in doing that one thing. You have decided to break apart the list of tuples in the second statement, so you are the one that is adding extra context to the second statement.

Ideally lstA, lstB = zip(*lstTupA) should have assigned ['a', 'b', 'c'] to lstA and ['x', 'y', 'z'] to lstB (lists) right?
No, that is not right. Remember, that zip returns a list of tuples, that's exactly the way you expect it to behave when you say
lstTupA would be [('a', 'x'), ('b', 'y'), ('c', 'z')].
So, why would it return something different in the case of zip(*lstTupA)? It would still return the list of tuples, in this case [('a', 'b', 'c'), ('x', 'y', 'z')]. By performing assignment to lstA and lstB, you simply extract the tuples from the list.

Yes you have to do something stupid like
[list(t) for t in zip(*lst)]
Just to get lists.
What the 'pythonistas' rushing to defend the braindead choice of lists of tuples fail to remember is that tuples cannot be assigned to. Which makes zip(*m) useless for matrices or anything else where you want to alter items later.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How is does zip(*) generate n-grams? - python

Related

How to get elements out from list of lists when having the same position

Pythonic way to convert two adjacent list elements to a tuple and preserve rest of list

group the string elements with size two

Python list comparison issues

Returing lists of tuple's keys and values

Categories

Resources