Python Sequential Dictionary - python

How do you create a dictionary (e.g. food_dictionary) with the keys being the unique words in 'word_list' and the value being the list of words immediately following it (i.e. for words that have a word immediately following it)?
word_list = [ ['always', 'want', 'pizza' ], ['we', 'want', 'potato', 'chips' ] ]
food_dictionary = { 'always' : ['want'], 'want': ['pizza', 'potato'], 'we': ['want'], potato': ['chips'] }

Try this -
from collections import defaultdict
word_list = [ ['always', 'want', 'pizza' ], ['we', 'want', 'potato', 'chips' ] ]
food_dict = defaultdict(list)
for wl in word_list:
for w1, w2 in zip(wl, wl[1:]):
food_dict[w1].append(w2)
print food_dict
Useful links -
1. defaultdict
2. zip()

As always for these types of problems, consider this good ol' zip trick:
lst = ['always', 'want', 'want', 'pizza']
pairs = list(zip(lst[:-1], lst[1:])
Gives:
>>> pairs
[('always', 'want'), ('want', 'want'), ('want', 'pizza')]
Next up, we want to group all tuples beginning with the same word:
from itertools import groupby
groups = groupby(sorted(pairs, key=lambda x: x[0]), lambda x: x[0])
And finally convert to dictionary:
dict((k, [x[1] for x in g]) for k, g in groups)
This can be done for your entire word_list in the following manner:
from itertools import groupby
word_list = [['always', 'want', 'pizza'], ['we', 'want', 'potato', 'chips']]
pairs = [x for lst in word_list for x in zip(lst[:-1], lst[1:])]
sorted_pairs = sorted(pairs, key=lambda x: x[0])
groups = groupby(sorted_pairs, lambda x: x[0])
food_dict = dict((k, [x[1] for x in g]) for k, g in groups)
Gives:
>>> food_dict
{'always': ['want'], 'potato': ['chips'], 'want': ['pizza', 'potato'], 'we': ['want']}

Related

Splitting a list of strings into sub lists based on their length

If for instance I have a list
['My', 'Is', 'Name', 'Hello', 'William']
How can I manipulate it such that I can create a new list
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
You could use itertools.groupby:
>>> from itertools import groupby
>>> l = ['My', 'Is', 'Name', 'Hello', 'William']
>>> [list(g) for k, g in groupby(l, key=len)]
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
If however the list is not already sorted by length you will need to sort it first as #recnac mentions in the comments below:
>>> l2 = ['My', 'Name', 'Hello', 'Is', 'William']
>>> [list(g) for k, g in groupby(sorted(l2, key=len), key=len)]
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
You can build a dict that maps word lengths to a list of matching words, and then get the list of the dict's values:
l = ['My', 'Is', 'Name', 'Hello', 'William']
d = {}
for w in l:
d.setdefault(len(w), []).append(w)
print(list(d.values()))
This outputs:
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
Hi guys I have also found a solution, while it is not the most concise I thought it would be worth sharing
data = ['My', 'Is', 'Name', 'Hello', 'William']
dict0 = {}
for index in data:
if len(index) not in dict0:
dict0[len(index)] = [index]
elif len(index) in dict0:
dict0[len(index)] += [index]
list0 = []
for i in dict0:
list0.append(dict0[i])
print(list0)
you can use dict to record the string group by length, defaultdict is used for convenient here.
from collections import defaultdict
str_list = ['My', 'Is', 'Name', 'Hello', 'William']
group_by_len = defaultdict(list)
for s in str_list:
group_by_len[len(s)].append(s)
result = list(group_by_len.values())
output:
[['My', 'Is'], ['Name'], ['Hello'], ['William']]
Hope that will help you, and comment if you have further questions. : )

Match adjacent list elements with a list of tuples in Python

I have an ordered list of individual words from a document, like so:
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
I have a second list of tuples of significant bigrams/collocations, like so:
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]
I would like to iterate through the list of individual words and replace adjacent words with an underscore-separated bigram, ending up with a list like this:
words_fixed = ['apple_orange', 'boat', 'car', 'happy_day', 'cow', ...]
I've considered flattening words and bigrams into strings (" ".join(words), etc.) and then using regex to find and replace the adjacent words, but that seems horribly inefficient and unpythonic.
What's the best way to quickly match and combine adjacent list elements from a list of tuples?
Not as flashy as #inspectorG4dget:
words_fixed = []
last = None
for word in words:
if (last,word) in bigrams:
words_fixed.append( "%s_%s" % (last,word) )
last = None
else:
if last:
words_fixed.append( last )
last = word
if last:
words_fixed.append( last )
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
bigrams_dict = dict(item for item in bigrams)
bigrams_dict.update(item[::-1] for item in bigrams)
words_fixed = ["{}_{}".format(word, bigrams_dict[word])
if word in bigrams_dict else word
for word in words]
[edit] another way to create dictionary:
from itertools import chain
bigrams_rev = (reversed(x) for x in bigrams)
bigrams_dict = dict(chain(bigrams, bigrams_rev))
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]
First, some optimization:
import collections
bigrams = collections.defaultdict(set)
for w1,w2 in bigrams:
bigrams[w1].add(w2)
Now, onto the fun stuff:
import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
if w1 in bigrams and w2 in bigrams[w1]:
words_fixed.append("%s_%s" %(w1, w2))
If you want to see words that are not in your bigrams, in addition to the words you've recorded in your bigrams, then this should do the trick:
import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
if w1 in bigrams and w2 in bigrams[w1]:
words_fixed.append("%s_%s" %(w1, w2))
else:
words_fixed.append(w1)
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
print 'words :',words
print 'bigrams :',bigrams
print
def zwii(words,bigrams):
it = iter(words)
dict_bigrams = dict(bigrams)
for x in it:
if x in dict_bigrams:
try:
y = it.next()
if dict_bigrams[x] == y:
yield '-'.join((x,y))
else:
yield x
yield y
except:
yield x
else:
yield x
print list(zwii(words,bigrams))
result
words : ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams : [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
['apple-orange', 'boat', 'car', 'happy-day', 'cow', 'big']

Group a list by word length

For example, I have a list, say
list = ['sight', 'first', 'love', 'was', 'at', 'It']
I want to group this list by word length, say
newlist = [['sight', 'first'],['love'], ['was'], ['at', 'It']]
Please help me on it.
Appreciation!
Use itertools.groupby:
>>> from itertools import groupby
>>> lis = ['sight', 'first', 'love', 'was', 'at', 'It']
>>> [list(g) for k, g in groupby(lis, key=len)]
[['sight', 'first'], ['love'], ['was'], ['at', 'It']]
Note that for itertools.groupby to work properly all the items must be sorted by length, otherwise use collections.defaultdict(O(N)) or sort the list first and then use itertools.groupby(O(NlogN)). :
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> lis = ['sight', 'first', 'foo', 'love', 'at', 'was', 'at', 'It']
>>> for x in lis:
... d[len(x)].append(x)
...
>>> d.values()
[['at', 'at', 'It'], ['foo', 'was'], ['love'], ['sight', 'first']]
If you want the final output list to be sorted too then better sort the list items by length and apply itertools.groupby to it.
You can use a temp dictionary then sort by length:
li=['sight', 'first', 'love', 'was', 'at', 'It']
d={}
for word in li:
d.setdefault(len(word), []).append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result
# [['sight', 'first'], ['love'], ['was'], ['at', 'It']]
You can use defaultdict:
from collections import defaultdict
d=defaultdict(list)
for word in li:
d[len(word)].append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result
or use __missing__ like so:
class Dicto(dict):
def __missing__(self, key):
self[key]=[]
return self[key]
d=Dicto()
for word in li:
d[len(word)].append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result
Since the groupby solution was already taken ;-)
from collections import defaultdict
lt = ['sight', 'first', 'love', 'was', 'at', 'It']
d = defaultdict(list)
for x in lt:
d[len(x)].append(x)
d.values()
[['at', 'It'], ['was'], ['love'], ['sight', 'first']]

Groupby the list according to category

my code gives me output as a list
def extractKeywords():
<code>
return list
list = []
data = extractKeywords()
for x in range(0,5):
get = data[0][x]
list.append(get)
print list12
Output list is
['LION', 'tv', 'TIGER', 'keyboard', 'cd-writer','ELEPHANT']
How can i categorize this list into two groups like ( Expected output)
Animals = ['LION', 'TIGER', 'ELEPHANT']
Electronics = ['tv', 'keyboard', 'cd-writer']
All animals are in Capital letter and Electronics are in small letters
This solution uses itertools.groupby to avoid traversing the list twice.
>>> from itertools import groupby
>>> data = ['LION', 'tv', 'TIGER', 'keyboard', 'cd-writer','ELEPHANT']
>>> # upper case letters have lower `ord` values than lower case letters
>>> sort_by_case = sorted(data, key=lambda word: ord(word[0]))
>>> sort_by_case
['ELEPHANT', 'LION', 'TIGER', 'cd-writer', 'keyboard', 'tv']
>>> # group the words according to whether their first letter is upper case or not
>>> group_by_case = groupby(sort_by_case, lambda word: word[0].isupper())
>>> # use tuple unpacking to assign the two groups to appropriate variables
>>> upper_case, lower_case = [list(g) for (k, g) in group_by_case]
>>> upper_case
['ELEPHANT', 'LION', 'TIGER']
>>> lower_case
['cd-writer', 'keyboard', 'tv']
mylist = ['LION', 'tv', 'TIGER', 'keyboard', 'cd-writer','ELEPHANT']
[word for word in mylist if word==word.lower()]
Here is one possible solution
>>> from itertools import tee
>>>
>>> def splitOnCondition(lst, condition):
... l1, l2 = tee((condition(i), i) for i in lst)
... return [i for c, i in l1 if c], [i for c, i in l2 if not c]
...
>>> splitOnCondition(['LION', 'tv', 'TIGER', 'keyboard',
... 'cd-writer','ELEPHANT'], lambda x: x==x.lower())
(['tv', 'keyboard', 'cd-writer'], ['LION', 'TIGER', 'ELEPHANT'])

Find all the strings with max length using max() function [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Longest strings from list
lst = [str1, str2, str3, ...]
max(lst, key=len)
This returns only one of the strings with max length. Is there any way to do that without defining another procedure?
How about:
maxlen = len(max(l, key=len))
maxlist = [s for s in l if len(s) == maxlen]
If you want to get all the values with the max length, you probably want to sort the list by length; then you just need to take all the values until the length changes. itertools provides multiple ways to do that—takewhile, groupby, etc. For example:
>>> l = ['abc', 'd', 'ef', 'ghi', 'j']
>>> l2 = sorted(l, key=len, reverse=True)
>>> groups = itertools.groupby(len, l2)
>>> maxlen, maxvalues = next(groups)
>>> print(maxlen, list(maxvalues))
3, ['abc', 'ghi']
If you want a one-liner:
>>> maxlen, maxvalues = next(itertools.groupby(len, sorted(l, key=len, reverse=True)))
>>> print(maxlen, list(maxvalues))
Of course you can always just make two passes over the list if you prefer—first to find the max length, then to find all matching values:
>>> maxlen = len(max(l, key=len))
>>> maxvalues = (value for value in l if len(value) == maxlen)
>>> print(maxlen, list(maxvalues))
Just for the sake of completeness, filter is also an option:
maxlens = filter(lambda s: len(s)==max(myList, key=len), myList)
Here is a one-pass solution, collecting longest-seen-so-far words as they are found.
def findLongest(words):
if not words:
return []
worditer = iter(words)
ret = [next(worditer)]
cur_len = len(ret[0])
for wd in worditer:
len_wd = len(wd)
if len_wd > cur_len:
ret = [wd]
cur_len = len_wd
else:
if len_wd == cur_len:
ret.append(wd)
return ret
Here are the results from some test lists:
tests = [
[],
"Four score and seven years ago".split(),
"To be or not to be".split(),
"Now is the winter of our discontent made glorious summer by this sun of York".split(),
]
for test in tests:
print test
print findLongest(test)
print
[]
[]
['Four', 'score', 'and', 'seven', 'years', 'ago']
['score', 'seven', 'years']
['To', 'be', 'or', 'not', 'to', 'be']
['not']
['Now', 'is', 'the', 'winter', 'of', 'our', 'discontent', 'made', 'glorious', 'summer', 'by', 'this', 'sun', 'of', 'York']
['discontent']

Categories

Resources