Related
I'm not sure if the title describes my question well, but if there is something wrong I'll edit later. I've checked many questions related to this but since, the code is so nested, I'm not very experienced in programming and I need to use combinations I couldn't handle.
I have a nested dict, which is similar to this:
example_dictionary = {'I want to eat peach and egg.':{'apple':3, 'orange':2, 'banana':5},\
'Peach juice is so delicious.':{'apple':3, 'orange':5, 'banana':2}, \
'Goddamn monkey ate my banana.':{'rice':4, 'apple':6, 'monkey':2}, \
'They say apple is good for health.':{'grape':10, 'monkey':5, 'peach':5, 'egg':8}}
What I'm trying to do is building an adjacency matrix by following some rules.
The rules are:
1) If a word in the any of the inner dict exist in any of the sentence(outer dict keys), then add a weight as the value of the word between related sentences.
2) If any of two sentences has the same inner dict key(word) but different value then multiply the values of the words and add as weight between related sentences.
Extra note: inner dicts can have different lengths, same inner dict keys(words) might have different values. I want them to be multiplied only in this case, if they have the same values I don't want to take into account.
Example:
Sentence1(0): I want to eat peach and egg. {'apple':3, 'orange':2, 'banana':5}
Sentence2(1): Peach juice is so delicious. {'apple':3, 'orange':5, 'banana':2}
Sentence3(2): Goddamn monkey ate my banana.{'rice':4, 'apple':6, 'monkey':2}
Sentence4(3): They say apple is good for health. {'grape':10, 'monkey':5, 'peach':5, 'egg':8}
Between 0 and 1: 5*2+5*2=20 (because, their apple's has the same value, just multiplied the values for orange and banana. And none of the words exists in any sentence.)
Between 2 and 3: (2*5=10 (monkey is the same key with different value) +
6 (the key of sentence3 'apple' exists in sentence4) +
5 (the key of sentence4 'monkey' exists in sentence3)= 21
Between 0 and 3: 3+5+8=16 (sentence1 key 'apple' exists in sentence4, and sentence4 keys 'egg' and 'peach' exist in sentence1.
I hope these examples makes it clear.
What I have tried( this was pretty much confusing for me due to nested structure and combinations):
from itertools import combinations, zip_longest
import networkx as nx
def compare_inner_dicts(d1,d2):
#this is for comparing the inner dict keys and multiplying them
#if they have the same key but different value
values = []
inner_values = 0
for common_key in d1.keys() & d2.keys():
if d1[common_key]!= d2[common_key]:
_value = d1[common_key]*d2[common_key]
values.append(_value)
inner_values = sum([p for p in values])
inner_dict_values = inner_values
del inner_values
return inner_dict_values
def build_adj_mat(a_dict):
gr = nx.Graph()
for sentence, words in a_dict.items():
sentences = list(a_dict.keys())
gr.add_nodes_from(sentences)
sentence_pairs = combinations(gr.nodes, 2)
dict_pairs = combinations(a_dict.values(), 2)
for pair, _pair in zip_longest(sentence_pairs, dict_pairs):
numbers = []
x_numbers = []
#y_numbers = []
sentence1 = pair[0]
sentence2 = pair[1]
dict1 = _pair[0]
dict2 = _pair[1]
inner_dict_numbers = compare_inner_dicts(dict1, dict2)
numbers.append(inner_dict_numbers)
for word, num in words.items():
if sentence2.find(word)>-1:
x = words[word]
x_numbers.append(x)
numbers.extend(x_numbers)
# if sentence1.find(word)>-1: #reverse case
# y = words[word]
# y_numbers.append(y)
# numbers.extend(y_numbers)
total = sum([p for p in numbers if len(numbers)>0])
if total>0:
gr.add_edge(sentence1, sentence2, weight=total)
del total
else: del total
else:
continue
numbers.clear()
x_numbers.clear()
#y_numbers.clear()
return gr
G = build_adj_mat(example_dictionary)
print(nx.adjacency_matrix(G))
Expected result:
(0, 1) 5*2+5*2=20
(0, 2) 3*6=18+5=23
(0, 3) 3+5+8=16
(1, 0) 20
(1, 2) 3*6=18+2=20
(1, 3) 3+5=8
(2, 0) 23
(2, 1) 20
(2, 3) 2*5=10+5+6=21
(3, 0) 16
(3, 1) 8
(3, 2) 21
Output:
(0, 2) 23
(0, 3) 6
(1, 2) 23
(1, 3) 6
(2, 0) 23
(2, 1) 23
(2, 3) 16
(3, 0) 6
(3, 1) 6
(3, 2) 16
By comparing expected output and compared output I can understand one of the problem, which is that my code just checks if the word in sentence1 exist in sentence2, but doesn't do the reverse. I tried to solve it by using commented out part, but it returned more nonsense results. Also I'm not sure if there are any other problem. I don't know how to get the correct result, these two combinations and nested structure making me totally lost. Sorry for the long question, to make it clear I described everything. Any help would be greatly appreciated, thanks in advance.
You can use the following function:
from collections import defaultdict
import itertools as it
import re
def compute_scores(sentence_dict):
scores = defaultdict(int)
for (j, (s1, d1)), (k, (s2, d2)) in it.combinations(enumerate(sentence_dict.items()), 2):
shared_keys = d1.keys() & d2.keys()
scores[j, k] += sum(d1[k]*d2[k] for k in shared_keys if d1[k] != d2[k])
scores[j, k] += sum(d1[k] for k in d1.keys() & get_words(s2))
scores[j, k] += sum(d2[k] for k in d2.keys() & get_words(s1))
return scores
def get_words(sentence):
return set(map(str.lower, re.findall(r'(?<=\b)\w+(?=\b)', sentence)))
The result depends of course on what you define as a word, so you'd need to fill in your own definition in the function get_words. The default implementation seems to fit your example data. Since the score for a sentence pair is symmetric according to your definition, there's no need to consider the reverse pairing as well (it has the same score); i.e. (0, 1) has the same score as (1, 0). That's why the code uses itertools.combinations.
Running the example data:
from pprint import pprint
example_dictionary = {
'I want to eat peach and egg.': {'apple':3, 'orange':2, 'banana':5},
'Peach juice is so delicious.': {'apple':3, 'orange':5, 'banana':2},
'Goddamn monkey ate my banana.': {'rice':4, 'apple':6, 'monkey':2},
'They say apple is good for health.': {'grape':10, 'monkey':5, 'peach':5, 'egg':8}}
pprint(compute_scores(example_dictionary))
gives the following scores:
defaultdict(<class 'int'>,
{(0, 1): 20,
(0, 2): 23,
(0, 3): 16,
(1, 2): 20,
(1, 3): 8,
(2, 3): 21})
In case the dicts can not only contain words, but also phrases (i.e. multiple words) a slight modification of the original implementation will do (also works for single words):
scores[j, k] += sum(weight for phrase, weight in d1.items() if phrase in s2.lower())
scores[j, k] += sum(weight for phrase, weight in d2.items() if phrase in s1.lower())
The code below is a brute force method of searching a list of words and creating sub-lists of any that are Anagrams.
Searching the entire English dictionary is prohibitively time consuming so I'm curious of anyone has tips for reducing the compute complexity of the code?
def anogramtastic(anagrms):
d = []
e = []
for j in range(len(anagrms)):
if anagrms[j] in e:
pass
else:
templist = []
tester = anagrms[j]
tester = list(tester)
tester.sort()
tester = ''.join(tester)
for k in range(len(anagrms)):
if k == j:
pass
else:
testers = anagrms[k]
testers = list(testers)
testers.sort()
testers = ''.join(testers)
if testers == tester:
templist.append(anagrms[k])
e.append(anagrms[k])
if len(templist) > 0:
templist.append(anagrms[j])
d.append(templist)
d.sort(key=len,reverse=True)
return d
print(anogramtastic(wordlist))
How about using a dictionary of frozensets? Frozensets are immutable, meaning you can hash them for constant lookup. And when it comes to anagrams, what makes two words anagrams of each other is that they have the same letters with the same count. So you can construct a frozenset of {(letter, count), ...} pairs, and hash these for efficient lookup.
Here's a quick little function to convert a word to a multiset using collections.Counter:
from collections import Counter, defaultdict
def word2multiset(word):
return frozenset(Counter(word).items())
Now, given a list of words, populate your anagram dictionary like this:
list_of_words = [... ]
anagram_dict = defaultdict(set)
for word in list_of_words:
anagram_dict[word2multiset(word)].add(word)
For example, when list_of_words = ['hello', 'olleh', 'test', 'apple'], this is the output of anagram_dict after a run of the loop above:
print(anagram_dict)
defaultdict(set,
{frozenset({('e', 1), ('h', 1), ('l', 2), ('o', 1)}): {'hello',
'olleh'},
frozenset({('e', 1), ('s', 1), ('t', 2)}): {'test'},
frozenset({('a', 1), ('e', 1), ('l', 1), ('p', 2)}): {'apple'}})
Unless I'm misunderstanding the problem, simply grouping the words by sorting their characters should be an efficient solution -- as you've already realized. The trick is to avoid comparing every word to all the other ones. A dict with the char-sorted string as key will make finding the right group for each word fast; a lookup/insertion will be O(log n).
#!/usr/bin/env python3
#coding=utf8
from sys import stdin
groups = {}
for line in stdin:
w = line.strip()
g = ''.join(sorted(w))
if g not in groups:
groups[g] = []
groups[g].append(w)
for g, words in groups.items():
if len(words) > 1:
print('%2d %-20s' % (len(words), g), ' '.join(words))
Testing on my words file (99171 words), it seems to work well:
anagram$ wc /usr/share/dict/words
99171 99171 938848 /usr/share/dict/words
anagram$ time ./anagram.py < /usr/share/dict/words | tail
2 eeeprsw sweeper weepers
2 brsu burs rubs
2 aeegnrv avenger engrave
2 ddenoru redound rounded
3 aesy ayes easy yeas
2 gimnpu impugn umping
2 deeiinsst densities destinies
2 abinost bastion obtains
2 degilr girdle glider
2 orsttu trouts tutors
real 0m0.366s
user 0m0.357s
sys 0m0.012s
You can speed things up considerably by using a dictionary for checking membership instead of doing linear searches. The only "trick" is to devise a way to create keys for it such that it will be the same for anagrammatical words (and not for others).
In the code below this is being done by creating a sorted tuple from the letters in each word.
def anagramtastic(words):
dct = {}
for word in words:
key = tuple(sorted(word)) # Identifier based on letters.
dct.setdefault(key, []).append(word)
# Return a list of all that had an anagram.
return [words for words in dct.values() if len(words) > 1]
wordlist = ['act', 'cat', 'binary', 'brainy', 'case', 'aces',
'aide', 'idea', 'earth', 'heart', 'tea', 'tee']
print('result:', anagramtastic(wordlist))
Output produced:
result: [['act', 'cat'], ['binary', 'brainy'], ['case', 'aces'], ['aide', 'idea'], ['earth', 'heart']]
I am looking to iterate over a list with duplicate values. The 101 has 101.A and 101.B which is right but the 102 starts from 102.C instead of 102.A
import string
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
door_numbers = []
num_count = 0
for el in room_numbers:
if room_numbers.count(el) == 1:
door_numbers.append("%s.%s" % (el, string.ascii_uppercase[0]))
elif room_numbers.count(el) > 1:
door_numbers.append("%s.%s" % (el, string.ascii_uppercase[num_count]))
num_count += 1
door_numbers = ['101.A','103.A','101.B','102.C','104.A',
'105.A','106.A','107.A','102.D','108.A']
Given
import string
import itertools as it
import collections as ct
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
letters = string.ascii_uppercase
Code
Simple, Two-Line Solution
dd = ct.defaultdict(it.count)
print([".".join([room, letters[next(dd[room])]]) for room in room_numbers])
or
dd = ct.defaultdict(lambda: iter(letters))
print([".".join([room, next(dd[room])]) for room in room_numbers])
Output
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
Details
In the first example we are using itertools.count as a default factory. This means that a new count() iterator is made whenever a new room number is added to the defaultdict dd. Iterators are useful because they are lazily evaluated and memory efficient.
In the list comprehension, these iterators get initialized per room number. The next number of the counter is yielded, the number is used as an index to get a letter, and the result is simply joined as a suffix to each room number.
In the second example (recommended), we use an iterator of strings as the default factory. The callable requirement is satisfied by returning the iterator in a lambda function. An iterator of strings enables us to simply call next() and directly get the next letter. Consequently, the comprehension is simplified since slicing letters is no longer required.
The problem in your implementation is that you have a value num_count which is continuously incremented for each item in the list than just the specific items' count. What you'd have to do instead is to count the number of times each of the item has occurred in the list.
Pseudocode would be
1. For each room in room numbers
2. Add the room to a list of visited rooms
3. Count the number of times the room number is available in visited room
4. Add the count to 64 and convert it to an ascii uppercase character where 65=A
5. Join the required strings in the way you want to and then append it to the door_numbers list.
Here's an implementation
import string
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
door_numbers = []
visited_rooms = []
for room in room_numbers:
visited_rooms.append(room)
room_count = visited_rooms.count(room)
door_value = chr(64+room_count) # Since 65 = A when 1st item is present
door_numbers.append("%s.%s"%(room, door_value))
door_numbers now contains the final list you're expecting which is
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
for the given input room_numbers
The naive way, simply count the number of times the element is contained in the list up until that index:
>>> door_numbers = []
>>> for i in xrange(len(room_numbers)):
... el = room_numbers[i]
... n = 0
... for j in xrange(0, i):
... n += el == room_numbers[j]
... c = string.ascii_uppercase[n]
... door_numbers.append("{}.{}".format(el, c))
...
>>> door_numbers
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
This two explicit for-loops make the quadratic complexity pop out. Indeed, (1/2) * (N * (N-1)) iterations are made. I would say that in most cases you would be better off keeping a dict of counts instead of counting each time.
>>> door_numbers = []
>>> counts = {}
>>> for el in room_numbers:
... count = counts.get(el, 0)
... c = string.ascii_uppercase[count]
... counts[el] = count + 1
... door_numbers.append("{}.{}".format(el, c))
...
>>> door_numbers
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
That way, there's no messing around with indices, and it's more time efficient (at the expense of auxiliary space).
Using iterators and comprehensions:
Enumerate the rooms to preserve the original order
Group rooms by room number, sorting first as required by groupby()
For each room in a group, append .A, .B, etc.
Sort by the enumeration values from step 1 to restore the original order
Extract the door numbers, e.g. '101.A'
.
#!/usr/bin/env python3
import operator
from itertools import groupby
import string
room_numbers = ['101', '103', '101', '102', '104',
'105', '106', '107', '102', '108']
get_room_number = operator.itemgetter(1)
enumerated_and_sorted = sorted(list(enumerate(room_numbers)),
key=get_room_number)
# [(0, '101'), (2, '101'), (3, '102'), (8, '102'), (1, '103'),
# (4, '104'), (5, '105'), (6, '106'), (7, '107'), (9, '108')]
grouped_by_room = groupby(enumerated_and_sorted, key=get_room_number)
# [('101', [(0, '101'), (2, '101')]),
# ('102', [(3, '102'), (8, '102')]),
# ('103', [(1, '103')]),
# ('104', [(4, '104')]),
# ('105', [(5, '105')]),
# ('106', [(6, '106')]),
# ('107', [(7, '107')]),
# ('108', [(9, '108')])]
door_numbers = ((order, '{}.{}'.format(room, char))
for _, room_list in grouped_by_room
for (order, room), char in zip(room_list,
string.ascii_uppercase))
# [(0, '101.A'), (2, '101.B'), (3, '102.A'), (8, '102.B'),
# (1, '103.A'), (4, '104.A'), (5, '105.A'), (6, '106.A'),
# (7, '107.A'), (9, '108.A')]
door_numbers = [room for _, room in sorted(door_numbers)]
# ['101.A', '103.A', '101.B', '102.A', '104.A',
# '105.A', '106.A', '107.A', '102.B', '108.A']
I am trying to take the Spark word count example and aggregate word counts by some other value (for example, words and counts by person where person is "VI" or "MO" in the case below)
I have an rdd which is a list of tuples whose values are lists of tuples:
from operator import add
reduced_tokens = tokenized.reduceByKey(add)
reduced_tokens.take(2)
Which gives me:
[(u'VI', [(u'word1', 1), (u'word2', 1), (u'word3', 1)]),
(u'MO',
[(u'word4', 1),
(u'word4', 1),
(u'word5', 1),
(u'word8', 1),
(u'word10', 1),
(u'word1', 1),
(u'word4', 1),
(u'word6', 1),
(u'word9', 1),
...
)]
I want something like:
[
('VI',
[(u'word1', 1), (u'word2', 1), (u'word3', 1)],
('MO',
[(u'word4', 58), (u'word8', 2), (u'word9', 23) ...)
]
Similar to the word count example here, I would like to be able to filter out words with a count below some threshold for some person. Thanks!
The keys that you're trying to reduce across are (name, word) pairs, not just names. So you need to do a .map step to fix-up your data:
def key_by_name_word(record):
name, (word, count) = record
return (name, word), count
tokenized_by_name_word = tokenized.map(key_by_name_word)
counts_by_name_word = tokenized_by_name_word.reduce(add)
This should give you
[
(('VI', 'word1'), 1),
(('VI', 'word2'), 1),
(('VI', 'word3'), 1),
(('MO', 'word4'), 58),
...
]
To get it into exactly the same format you mentioned, you can then do:
def key_by_name(record):
# this is the inverse of key_by_name_word
(name, word), count = record
return name, (word, count)
output = counts_by_name_word.map(key_by_name).reduceByKey(add)
But it might actually be easier to work with the data in the flat format that counts_by_name_word is in.
For completeness, here is how I solved each part of the question:
Ask 1: Aggregate word counts by some key
import re
def restructure_data(name_and_freetext):
name = name_and_freetext[0]
tokens = re.sub('[&|/|\d{4}|\.|\,|\:|\-|\(|\)|\+|\$|\!]', ' ', name_and_freetext[1]).split()
return [((name, token), 1) for token in tokens]
filtered_data = data.filter((data.flag==1)).select('name', 'item')
tokenized = filtered_data.rdd.flatMap(restructure_data)
Ask 2: Filter out words with a count below some threshold:
from operator import add
# keep words which have counts >= 5
counts_by_state_word = tokenized.reduceByKey(add).filter(lambda x: x[1] >= 5)
# map filtered word counts into a list by key so we can sort them
restruct = counts_by_name_word.map(lambda x: (x[0][0], [(x[0][1], x[1])]))
Bonus: Sort words from most frequent to least frequent
# sort the word counts from most frequent to least frequent words
output = restruct.reduceByKey(add).map(lambda x: (x[0], sorted(x[1], key=lambda y: y[1], reverse=True))).collect()
I'm trying to count the occurrence of each character for any given string input, the occurrences must be output in ascending order( includes numbers and exclamation marks)
I have this for my code so far, i am aware of the Counter function, but it does not output the answer in the format I'd like it to, and I do not know how to format Counter. Instead I'm trying to find away to use the count() to count each character. I've also seen the dictionary function , but I'd be hoping that there is a easier way to do it with count()
from collections import Counter
sentence=input("Enter a sentence b'y: ")
lowercase=sentence.lower()
list1=list(lowercase)
list1.sort()
length=len(list1)
list2=list1.count(list1)
print(list2)
p=Counter(list1)
print(p)
collections.Counter objects provide a most_common() method that returns a list of tuples in decreasing frequency. So, if you want it in ascending frequency, reverse the list:
from collections import Counter
sentence = input("Enter a sentence: ")
c = Counter(sentence.lower())
result = reversed(c.most_common())
print(list(result))
Demo run
Enter a sentence: Here are 3 sentences. This is the first one. Here is the second. The end!
[('a', 1), ('!', 1), ('3', 1), ('f', 1), ('d', 2), ('o', 2), ('c', 2), ('.', 3), ('r', 4), ('i', 4), ('n', 5), ('t', 6), ('h', 6), ('s', 7), (' ', 14), ('e', 14)]
Just call .most_common and reverse the output with reversed to get the output from least to most common:
from collections import Counter
sentence= "foobar bar"
lowercase = sentence.lower()
for k, count in reversed(Counter(lowercase).most_common()):
print(k,count)
If you just want to format the Counter output differently:
for key, value in Counter(list1).items():
print('%s: %s' % (key, value))
Your best bet is to use Counter (which does work on a string) and then sort on it's output.
from collections import Counter
sentence = input("Enter a sentence b'y: ")
lowercase = sentence.lower()
# Counter will work on strings
p = Counter(lowercase)
count = Counter.items()
# count is now (more or less) equivalent to
# [('a', 1), ('r', 1), ('b', 1), ('o', 2), ('f', 1)]
# And now you can run your sort
sorted_count = sorted(count)
# Which will sort by the letter. If you wanted to
# sort by quantity, tell the sort to use the
# second element of the tuple by setting key:
# sorted_count = sorted(count, key=lambda x:x[1])
for letter, count in sorted_count:
# will cycle through in order of letters.
# format as you wish
print(letter, count)
Another way to avoid using Counter.
sentence = 'abc 11 222 a AAnn zzz?? !'
list1 = list(sentence.lower())
#If you want to remove the spaces.
#list1 = list(sentence.replace(" ", ""))
#Removing duplicate characters from the string
sentence = ''.join(set(list1))
dict = {}
for char in sentence:
dict[char] = list1.count(char)
for item in sorted(dict.items(), key=lambda x: x[1]):
print 'Number of Occurences of %s is %d.' % (item[0], item[1])
Output:
Number of Occurences of c is 1.
Number of Occurences of b is 1.
Number of Occurences of ! is 1.
Number of Occurences of n is 2.
Number of Occurences of 1 is 2.
Number of Occurences of ? is 2.
Number of Occurences of 2 is 3.
Number of Occurences of z is 3.
Number of Occurences of a is 4.
Number of Occurences of is 6.
One way to do this would be by removing instances of your sub string and looking at the length...
def nofsub(s,ss):
return((len(s)-len(s.replace(ss,"")))/len(ss))
alternatively you could use re or regular expressions,
from re import *
def nofsub(s,ss):
return(len(findall(compile(ss), s)))
finally you could count them manually,
def nofsub(s,ss):
return(len([k for n,k in enumerate(s) if s[n:n+len(ss)]==ss]))
Test any of the three with...
>>> nofsub("asdfasdfasdfasdfasdf",'asdf')
5
Now that you can count any given character you can iterate through your string's unique characters and apply a counter for each unique character you find. Then sort and print the result.
def countChars(s):
s = s.lower()
d = {}
for k in set(s):
d[k]=nofsub(s,k)
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
You could use the list function to break the words apart`from collections
from collections import Counter
sentence=raw_input("Enter a sentence b'y: ")
lowercase=sentence.lower()
list1=list(lowercase)
list(list1)
length=len(list1)
list2=list1.count(list1)
print(list2)
p=Counter(list1)
print(p)