Group a list by word length - python

For example, I have a list, say
list = ['sight', 'first', 'love', 'was', 'at', 'It']
I want to group this list by word length, say
newlist = [['sight', 'first'],['love'], ['was'], ['at', 'It']]
Please help me on it.
Appreciation!

Use itertools.groupby:
>>> from itertools import groupby
>>> lis = ['sight', 'first', 'love', 'was', 'at', 'It']
>>> [list(g) for k, g in groupby(lis, key=len)]
[['sight', 'first'], ['love'], ['was'], ['at', 'It']]
Note that for itertools.groupby to work properly all the items must be sorted by length, otherwise use collections.defaultdict(O(N)) or sort the list first and then use itertools.groupby(O(NlogN)). :
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> lis = ['sight', 'first', 'foo', 'love', 'at', 'was', 'at', 'It']
>>> for x in lis:
... d[len(x)].append(x)
...
>>> d.values()
[['at', 'at', 'It'], ['foo', 'was'], ['love'], ['sight', 'first']]
If you want the final output list to be sorted too then better sort the list items by length and apply itertools.groupby to it.

You can use a temp dictionary then sort by length:
li=['sight', 'first', 'love', 'was', 'at', 'It']
d={}
for word in li:
d.setdefault(len(word), []).append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result
# [['sight', 'first'], ['love'], ['was'], ['at', 'It']]
You can use defaultdict:
from collections import defaultdict
d=defaultdict(list)
for word in li:
d[len(word)].append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result
or use __missing__ like so:
class Dicto(dict):
def __missing__(self, key):
self[key]=[]
return self[key]
d=Dicto()
for word in li:
d[len(word)].append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result

Since the groupby solution was already taken ;-)
from collections import defaultdict
lt = ['sight', 'first', 'love', 'was', 'at', 'It']
d = defaultdict(list)
for x in lt:
d[len(x)].append(x)
d.values()
[['at', 'It'], ['was'], ['love'], ['sight', 'first']]

Related

How to Remove Duplicate Lists from a List in Python [duplicate]

The following list has some duplicated sublists, with elements in different order:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
['there', 'hi'],
['jumps', 'dog', 'over','lazy', 'the'],
]
How can I remove duplicates, retaining the first instance seen, to get:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
]
I tried to:
[list(i) for i in set(map(tuple, l1))]
Nevertheless, I do not know if this is the fastest way of doing it for large lists, and my attempt is not working as desired. Any idea of how to remove them efficiently?
This one is a little tricky. You want to key a dict off of frozen counters, but counters are not hashable in Python. For a small degradation in the asymptotic complexity, you could use sorted tuples as a substitute for frozen counters:
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
The same idea in a one-liner would look like this:
[*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
I did a quick benchmark, comparing the various answers:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
from collections import Counter
def method1():
"""manually construct set, keyed on sorted tuple"""
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
return result
def method2():
"""frozenset-of-Counter"""
return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())
def method3():
"""wim"""
return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
from timeit import timeit
print(timeit(lambda: method1(), number=1000))
print(timeit(lambda: method2(), number=1000))
print(timeit(lambda: method3(), number=1000))
Prints:
0.0025010189856402576
0.016385524009820074
0.0026451340527273715
This:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
s = {tuple(item) for item in map(sorted, l1)}
l2 = [list(item) for item in s]
l2 gives the list with reverse duplicates removed.
Compare with: Pythonic way of removing reversed duplicates in list
#wim's answer is inefficient since it sorts the list items as a way to uniquely identify a set of counts of list items, which costs O(n log n) in time complexity for each sublist.
To achieve the same in a linear time complexity, you can use a frozenset of counts of items with the collections.Counter class instead. Since dict comprehension retains the last value of items with duplicating keys, and since you want to retain the first value of items with duplicating keys in your question, you would have to construct the dict in reverse order of the list, and reverse it again after the list of de-duplicated sublists has been constructed:
from collections import Counter
list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]
This returns:
[['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]

How to separate a list into two list at '\n'?

I would like to separate a list in different lists at '\n'. For example, if I have a list like this one:
l = ['hi', 'my', 'name', 'is', 'john', '\n', '\n', 'nice', 'to', 'meet', 'you']
I'd like to separate the items this way:
l = [['hi', 'my', 'name', 'is', 'john'], ['nice', 'to', 'meet', 'you']]
Can someone help me?
Some code that I tried to write:
l = ['hi', 'my', 'name', 'is', 'john', '\n', '\n', 'nice', 'to', 'meet', 'you']
lst = []
ls = []
for word in l:
if word != '\n':
ls.append(l)
else:
lst.append(ls)
print(lst)
I think you just wanted to append word to the list ls. Also, clear the partial list at the newlines like so:
lst = []
ls = []
for word in l:
if word != '\n':
ls.append(word)
else:
if len(ls) > 0:
lst.append(ls)
ls = []
if len(ls) > 0:
lst.append(ls)
print(lst)
resulting in
[['hi', 'my', 'name', 'is', 'john'], ['nice', 'to', 'meet', 'you']]
You could use itertools.groupby:
>>> from itertools import groupby
>>> l = ['hi', 'my', 'name', 'is', 'john', '\n', '\n', 'nice', 'to', 'meet', 'you']
>>> l = [list(group) for key, group in groupby(l, lambda s: s != '\n') if key]
>>> l
[['hi', 'my', 'name', 'is', 'john'], ['nice', 'to', 'meet', 'you']]

Efficiently remove duplicates, order-agnostic, from list of lists

The following list has some duplicated sublists, with elements in different order:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
['there', 'hi'],
['jumps', 'dog', 'over','lazy', 'the'],
]
How can I remove duplicates, retaining the first instance seen, to get:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
]
I tried to:
[list(i) for i in set(map(tuple, l1))]
Nevertheless, I do not know if this is the fastest way of doing it for large lists, and my attempt is not working as desired. Any idea of how to remove them efficiently?
This one is a little tricky. You want to key a dict off of frozen counters, but counters are not hashable in Python. For a small degradation in the asymptotic complexity, you could use sorted tuples as a substitute for frozen counters:
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
The same idea in a one-liner would look like this:
[*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
I did a quick benchmark, comparing the various answers:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
from collections import Counter
def method1():
"""manually construct set, keyed on sorted tuple"""
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
return result
def method2():
"""frozenset-of-Counter"""
return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())
def method3():
"""wim"""
return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
from timeit import timeit
print(timeit(lambda: method1(), number=1000))
print(timeit(lambda: method2(), number=1000))
print(timeit(lambda: method3(), number=1000))
Prints:
0.0025010189856402576
0.016385524009820074
0.0026451340527273715
This:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
s = {tuple(item) for item in map(sorted, l1)}
l2 = [list(item) for item in s]
l2 gives the list with reverse duplicates removed.
Compare with: Pythonic way of removing reversed duplicates in list
#wim's answer is inefficient since it sorts the list items as a way to uniquely identify a set of counts of list items, which costs O(n log n) in time complexity for each sublist.
To achieve the same in a linear time complexity, you can use a frozenset of counts of items with the collections.Counter class instead. Since dict comprehension retains the last value of items with duplicating keys, and since you want to retain the first value of items with duplicating keys in your question, you would have to construct the dict in reverse order of the list, and reverse it again after the list of de-duplicated sublists has been constructed:
from collections import Counter
list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]
This returns:
[['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]

Splitting a list of strings into sub lists based on their length

If for instance I have a list
['My', 'Is', 'Name', 'Hello', 'William']
How can I manipulate it such that I can create a new list
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
You could use itertools.groupby:
>>> from itertools import groupby
>>> l = ['My', 'Is', 'Name', 'Hello', 'William']
>>> [list(g) for k, g in groupby(l, key=len)]
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
If however the list is not already sorted by length you will need to sort it first as #recnac mentions in the comments below:
>>> l2 = ['My', 'Name', 'Hello', 'Is', 'William']
>>> [list(g) for k, g in groupby(sorted(l2, key=len), key=len)]
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
You can build a dict that maps word lengths to a list of matching words, and then get the list of the dict's values:
l = ['My', 'Is', 'Name', 'Hello', 'William']
d = {}
for w in l:
d.setdefault(len(w), []).append(w)
print(list(d.values()))
This outputs:
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
Hi guys I have also found a solution, while it is not the most concise I thought it would be worth sharing
data = ['My', 'Is', 'Name', 'Hello', 'William']
dict0 = {}
for index in data:
if len(index) not in dict0:
dict0[len(index)] = [index]
elif len(index) in dict0:
dict0[len(index)] += [index]
list0 = []
for i in dict0:
list0.append(dict0[i])
print(list0)
you can use dict to record the string group by length, defaultdict is used for convenient here.
from collections import defaultdict
str_list = ['My', 'Is', 'Name', 'Hello', 'William']
group_by_len = defaultdict(list)
for s in str_list:
group_by_len[len(s)].append(s)
result = list(group_by_len.values())
output:
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
Hope that will help you, and comment if you have further questions. : )

Python Sequential Dictionary

How do you create a dictionary (e.g. food_dictionary) with the keys being the unique words in 'word_list' and the value being the list of words immediately following it (i.e. for words that have a word immediately following it)?
word_list = [ ['always', 'want', 'pizza' ], ['we', 'want', 'potato', 'chips' ] ]
food_dictionary = { 'always' : ['want'], 'want': ['pizza', 'potato'], 'we': ['want'], potato': ['chips'] }
Try this -
from collections import defaultdict
word_list = [ ['always', 'want', 'pizza' ], ['we', 'want', 'potato', 'chips' ] ]
food_dict = defaultdict(list)
for wl in word_list:
for w1, w2 in zip(wl, wl[1:]):
food_dict[w1].append(w2)
print food_dict
Useful links -
1. defaultdict
2. zip()
As always for these types of problems, consider this good ol' zip trick:
lst = ['always', 'want', 'want', 'pizza']
pairs = list(zip(lst[:-1], lst[1:])
Gives:
>>> pairs
[('always', 'want'), ('want', 'want'), ('want', 'pizza')]
Next up, we want to group all tuples beginning with the same word:
from itertools import groupby
groups = groupby(sorted(pairs, key=lambda x: x[0]), lambda x: x[0])
And finally convert to dictionary:
dict((k, [x[1] for x in g]) for k, g in groups)
This can be done for your entire word_list in the following manner:
from itertools import groupby
word_list = [['always', 'want', 'pizza'], ['we', 'want', 'potato', 'chips']]
pairs = [x for lst in word_list for x in zip(lst[:-1], lst[1:])]
sorted_pairs = sorted(pairs, key=lambda x: x[0])
groups = groupby(sorted_pairs, lambda x: x[0])
food_dict = dict((k, [x[1] for x in g]) for k, g in groups)
Gives:
>>> food_dict
{'always': ['want'], 'potato': ['chips'], 'want': ['pizza', 'potato'], 'we': ['want']}

Categories

Resources