group the string elements with size two - python

myl = ['a','b','c','d','e']
I want to group the string elements with size two as shown below:
[['a','b'],['c','d'],['e']]
I tried as follows:
print [myl[i:i+2] for i in len(myl)]
What is wrong with?

for i in len(myl) doesn't make any sense. len(myl) is just a number, so you can't iterate through it.
You can use:
[myl[i:i+2] for i in range(0,len(myl),2)]
Here range(0,len(myl),2) means "every other number in the given range", so i takes the values 0, 2, 4 and you get myl[0:2], myl[2:4], and myl[4:6].

You can use this:
>>> myl = ['a','b','c','d','e']
>>> [myl[2*i:2*i+2] for i in range((len(myl) +1)/2)]
[['a', 'b'], ['c', 'd'], ['e']]
>>> myl = ['a','b','c','d','e','f']
>>> [myl[2*i:2*i+2] for i in range((len(myl) +1)/2)]
[['a', 'b'], ['c', 'd'], ['e', 'f']]
>>> myl = ['a','b','c','d','e', 'f', 'g']
>>> [myl[2*i:2*i+2] for i in range((len(myl) +1)/2)]
[['a', 'b'], ['c', 'd'], ['e', 'f'], ['g']]

This is quite what you want:
from itertools import izip_longest
myl = ['a','b','c','d','e']
print list(izip_longest(myl[::2], myl[1::2]))
result:
[('a', 'b'), ('c', 'd'), ('e', None)]
Here is another solution:
myl = ['a','b','c','d','e']
l = zip(myl[::2], myl[1::2])
if not len(l)%2:
l.append((myl[-1],))
print l
result:
[('a', 'b'), ('c', 'd'), ('e',)]

I found an improved version of the function below:
#!python
def chunker(size, seq):
return [seq[n-size:n] for n in range(size, len(seq)+size, size)]
The tricky part of this is in setting up the range arguments to start at "size" offset and step across the sequence in size steps. It relies on the quirk of Python slices which allows any index larger than the end on the sequence (though some calls to max() would get around that if it were necessary). When I've tried to write this with a range starting at zero and reversing the index (looking "forward" rather than backwards from the base index) I've found all sorts of boundary errors that this "backwards" version doesn't exhibit. Probably I'm being stupid with that.
Here's the old, verbose version (but suitable for any iterable, even those which can't take slices):
#!python
def chunker(stride, seq):
'''Return a list of lists by breaking seq into chunks of length stride
'''
results = list()
part = list()
for each in seq:
if len(part) == stride:
results.append(part)
part = list()
part.append(each)
if part:
results.append(part) # Keep any trailing partial chunk
# but don't append an empty part if the seq ended on an even
# stride
return results
I realize this may seem awfully verbose. But it is robust and will work with any iterable sequence and it is generalized for any positive integer as a stride.

Related

How is does zip(*) generate n-grams?

I am reviewing some notes on n-grams, and I came accross a couple of interesting functions. First there's this one to generate bigrams:
def bigrams(word):
return sorted(list(set(''.join(bigram)
for bigram in zip(word,word[1:]))))
def bigram_print(word):
print("The bigrams of", word, "are:")
print(bigrams(word))
bigram_print("ababa")
bigram_print("babab")
After doing some reading and playing on my own with Python I understand why this works. However, when looking at this function, I am very puzzled by the use of zip(*word[i:]) here. I understand that the * is an unpacking operator (as explained here), but I really am getting tripped up by how it's working in combination with the list comprehension here. Can anyone explain?
def ngrams(word, n):
return sorted(list(set(''.join(ngram)
for ngram in zip(*[word[i:]
for i in range(n)]))))
def ngram_print(word, n):
print("The {}-grams of {} are:".format(n, word))
print(ngrams(word, n))
for n in [2, 3, 4]:
ngram_print("ababa", n)
ngram_print("babab", n)
print()
The following example should explain how this works. I have added code and a visual representation of it.
Intuition
The core idea is to zip together multiple versions of the same list where each of them starts from the next subsequent element.
Lets say L is a list of words/elements ['A', 'B', 'C', 'D']
Then, what's happening here is that L, L[1:], L[2:] get zipped which means the first elements of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and second elements get clubbed together and so on..
Visually this can be shown as:
The statement we are worried about -
zip ( * [L[i:] for i in range(n)])
#|___||_______||________________________|
# | | |
# zip unpack versions of L with subsequent 0 to n elements skipped
Code example
l = ['A','B','C','D']
print('original list: '.ljust(27),l)
print('list skipping 1st element: ',l[1:])
print('list skipping 2 elements: '.ljust(27),l[2:])
print('bi-gram: '.ljust(27), list(zip(l,l[1:])))
print('tri-gram: '.ljust(27), list(zip(l,l[1:],l[2:])))
original list: ['A', 'B', 'C', 'D']
list skipping 1st element: ['B', 'C', 'D']
list skipping 2 elements: ['C', 'D']
bi-gram: [('A', 'B'), ('B', 'C'), ('C', 'D')]
tri-gram: [('A', 'B', 'C'), ('B', 'C', 'D')]
As you can see, you are basically zipping the same list but with one skipped. This zips (A, B) and (B, C) ... together for bigrams.
The * operator is for unpacking. When you change the i value to skip elements, you are basically zipping a list of [l[0:], l[1:], l[2:]...]. This is passed to the zip() and unpacked inside it with *.
zip(*[word[i:] for i in range(n)] #where word is the list of words
Alternate to list comprehension
The above list comprehension is equivalent to -
n = 3
lists = []
for i in range(3):
print(l[i:]) #comment this if not needed
lists.append(l[i:])
out = list(zip(*lists))
print(out)
['A', 'B', 'C', 'D']
['B', 'C', 'D']
['C', 'D']
[('A', 'B', 'C'), ('B', 'C', 'D')]
If you break down
zip(*[word[i:] for i in range(n)])
You get:
[word[i:] for i in range(n)]
Which is equivalent to:
[word[0:], word[1:], word[2:], ... word[n-1:]]
Which are each strings that start from different positions in word
Now, if you apply the unpacking * operator to it:
*[word[0:], word[1:], word[2:], ... word[n-1:]]
You get each of the lists word[0:], word[1:] etc passed to zip()
So, zip is getting called like this:
zip(word[0:], word[1:], word[2:], ... word[n-1:])
Which - according to how zip works - would create n-tuples, with each entry coming from one of the corresponding arguments:
[(words[0:][0], words[1:][0]....),
(words[0:][1], words[1:][1]....)
...
If you map the indexes, you'll see that these values correspond to the n-gram definitions for word

Python indirect list indexing [duplicate]

In Python I have a list of elements aList and a list of indices myIndices. Is there any way I can retrieve all at once those items in aList having as indices the values in myIndices?
Example:
>>> aList = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
>>> myIndices = [0, 3, 4]
>>> aList.A_FUNCTION(myIndices)
['a', 'd', 'e']
I don't know any method to do it. But you could use a list comprehension:
>>> [aList[i] for i in myIndices]
Definitely use a list comprehension but here is a function that does it (there are no methods of list that do this). This is however bad use of itemgetter but just for the sake of knowledge I have posted this.
>>> from operator import itemgetter
>>> a_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
>>> my_indices = [0, 3, 4]
>>> itemgetter(*my_indices)(a_list)
('a', 'd', 'e')
Indexing by lists can be done in numpy. Convert your base list to a numpy array and then apply another list as an index:
>>> from numpy import array
>>> array(aList)[myIndices]
array(['a', 'd', 'e'],
dtype='|S1')
If you need, convert back to a list at the end:
>>> from numpy import array
>>> a = array(aList)[myIndices]
>>> list(a)
['a', 'd', 'e']
In some cases this solution can be more convenient than list comprehension.
You could use map
map(aList.__getitem__, myIndices)
or operator.itemgetter
f = operator.itemgetter(*aList)
f(myIndices)
If you do not require a list with simultaneous access to all elements, but just wish to use all the items in the sub-list iteratively (or pass them to something that will), its more efficient to use a generator expression rather than list comprehension:
(aList[i] for i in myIndices)
Alternatively, you could go with functional approach using map and a lambda function.
>>> list(map(lambda i: aList[i], myIndices))
['a', 'd', 'e']
I wasn't happy with these solutions, so I created a Flexlist class that simply extends the list class, and allows for flexible indexing by integer, slice or index-list:
class Flexlist(list):
def __getitem__(self, keys):
if isinstance(keys, (int, slice)): return list.__getitem__(self, keys)
return [self[k] for k in keys]
Then, for your example, you could use it with:
aList = Flexlist(['a', 'b', 'c', 'd', 'e', 'f', 'g'])
myIndices = [0, 3, 4]
vals = aList[myIndices]
print(vals) # ['a', 'd', 'e']

Alternative to using the sort function when adding to a list?

I want to insert a word alphabetically into a list. Originally I would append the word I'm adding to the end of the list and then sort the list, but I am not allowed to use the sort() function.
Is there a way to do this through a function?
Based of of #SheshankS.'s answer. A function to do this for you:
def insert(item, _list):
for index, element in enumerate(_list):
if item < element: # in python, this automatically compares alphabetical precedence.
_list.insert(index, item)
return # exit out of the function since we already inserted
# if the item was not inserted, it must have the lowest precedence, so just append it
_list.append(item)
Note that since lists are mutable, this will actually mutate the given instance.
So, this:
someList = ["a", "b", "d"]
insert("c", someList)
Will actually change someList instead of just returning the new value.
Try doing this:
array = ["asdf", "bsdf", "kkkk", "zssdd"]
insertion_string = "zzat"
i = 0
for element in array:
if insertion_string < element:
array.insert(i, insertion_string)
break
i += 1
# if it is last one
if not insertion_string in array:
array.append(insertion_string)
print (array )
Repl.it = https://repl.it/repls/VitalAvariciousCodec
You did not say if you are allowed to use third-party modules, and you did not say if speed is a factor. If you want to add a new item to your sorted list quickly and you are allowed to use a module, use the SortedList class from sortedcontainers. This is a module included in many distributions of Python, such as Anaconda.
This will be simple and fast, even for large lists.
someList = SortedList(["a", "b", "d"])
someList.add("c")
print(someList)
The printout from that is
SortedList(['a', 'b', 'c', 'd'])
>>> import bisect
>>> someList = ["a", "b", "d"]
>>> bisect.insort(someList,'c')
>>> someList
['a', 'b', 'c', 'd']
>>>
If standard lib is allowed you can use bisect:
>>> import bisect
>>> lst = list('abcefg')
>>> for x in 'Adh':
... lst.insert(bisect.bisect(lst, x), x)
... print(lst)
...
['A', 'a', 'b', 'c', 'e', 'f', 'g']
['A', 'a', 'b', 'c', 'd', 'e', 'f', 'g']
['A', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

Combining two lists ( so [['a', 'b'], ['c', 'd']] = ['ac', 'ad', 'bc', 'bd] ) the pythonic way [duplicate]

This question already has answers here:
How to get the cartesian product of multiple lists
(17 answers)
Closed 8 years ago.
Given a list of lists such as:
[['a', 'b'], ['c', 'd'], ['e']]
Result should be:
['ace', 'ade', 'bce', 'bde']
The nested lists will be different lengths. Order must be maintained -- i.e. first letter must come from first list, second letter from second list, etc.
Here is my current, recursive solution:
def combine_letters(l)
if len(l) == 0:
return l[0]
temp = [x + y for x in l[0] for y in l[1]]
new = [temp] + l[2:]
return combine_letters(new)
However, I feel like there should be a quick, maybe even one line, way to do this, possible using the reduce function. Any thoughts?
Thank you!
Edit: this is not exactly analogous to the linked question. First, it is for a arbitrarily large number of sublists. Second, it returns strings rather than tuples.
>>> L = [['a', 'b'], ['c', 'd'], ['e']]
>>> import itertools
>>> list(itertools.product(*L))
[('a', 'c', 'e'), ('a', 'd', 'e'), ('b', 'c', 'e'), ('b', 'd', 'e')]
>>> list(''.join(tup) for tup in list(itertools.product(*L)))
['ace', 'ade', 'bce', 'bde']

Python list comparison issues

I need to write a program in Python that compares two parallel lists to grade a multiple choice exam. One list has the exam solution and the second list has a student's answers. The question number for each missed question is to be stored in a third list using the natural index numbers. The solution must use indexing.
I keep getting an empty list returned for the third list. All help much appreciated!
def main():
exam_solution = ['B', 'D', 'A', 'A', 'C', 'A', 'B', 'A', 'C', 'D', 'B', 'C',\
'D', 'A', 'D', 'C', 'C', 'B', 'D', 'A']
student_answers = ['B', 'D', 'B', 'A', 'C', 'A', 'A', 'A', 'C', 'D', 'B', 'C',\
'D', 'B', 'D', 'C', 'C', 'B', 'D', 'A']
questions_missed = []
for item in exam_solution:
if item not in student_answers:
questions_missed.append(item)
questions_missed = [i for i, (ex,st) in enumerate(zip(exam_solution, student_answers)) if ex != st]
or alternatively, if you prefer loops over list comprehensions:
questions_missed = []
for i, (ex,st) in enumerate(zip(exam_solution, student_answers)):
if ex != st:
questions_missed.append(i)
Both give [2,6,13]
Explanation:
enumerate is a utility function that returns an iterable object which yields tuples of indices and values, it can be used to, loosely speaking, "have the current index available during an iteration".
Zip creates a list of tuples, containing corresponding elements from two or more iterable objects (in your case lists).
I'd prefer the list comprehension version.
If I add some timing code, I see that performance doesn't really differ here:
def list_comprehension_version():
questions_missed = [i for i, (ex,st) in enumerate(zip(exam_solution, student_answers)) if ex != st]
return questions_missed
def loop_version():
questions_missed = []
for i, (ex,st) in enumerate(zip(exam_solution, student_answers)):
if ex != st:
questions_missed.append(i)
return questions_missed
import timeit
print "list comprehension:", timeit.timeit("list_comprehension_version", "from __main__ import exam_solution, student_answers, list_comprehension_version", number=10000000)
print "loop:", timeit.timeit("loop_version", "from __main__ import exam_solution, student_answers, loop_version", number=10000000)
gives:
list comprehension: 0.895029446804
loop: 0.877159359719
A solution based on iterators
questions_missed = list(index for (index, _)
in filter(
lambda (_, (answer, solution)): answer != solution,
enumerate(zip(student_answers, exam_solution))))
For the purists, note that you should import the equivalents of zip and filter (izip and ifilter) from itertools.
One more solution comes to my mind. I put in in a separate answers as it is "special"
Using numpy this task can be accomplished by:
import numpy as np
exam_solution = np.array(exam_solution)
student_answers = np.array(student_answers)
(exam_solution!=student_answers).nonzero()[0]
With numpy-arrays, elementwise comparison is possible via == and !=. .nonzero() returns the indices of the array elements that are not zero. That's it.
Timing is really interesting now. For your 19-elements lists, performances are (N=19,repetitions=100,000):
list comprehension: 0.904024521544
loop: 0.936516107421
numpy: 0.349371968612
This is already a factor of almost 3. Nice, but not amazing.
But when I increase the size of your lists by a factor of 100, I get (N=19*100=1900, repetitions=1000):
list comprehension: 0.866544042939
loop: 1.04464069977
numpy: 0.0334220694495
Now we have a factor of 26 or 31 - that is definitely a lot.
Probably, performance won't be your problem, but, nevertheless, I thought it's worth pointing out.

Categories

Resources