How can it search inside the dictionary, if the dictionary is empty? - python

I just started learning python and found this snippet. It's supposed to count how many times a word appears. I guess, for all of you this will seem very logical, but unfortunately for me, it doesn't make any sense.
str = "house is where you live, you don't leave the house."
dict = {}
list = str.split(" ")
for word in list: # Loop over the list
if word in dict: # How can I loop over the dictionary if it's empty?
dict[word] = dict[word] + 1
else:
dict[word] = 1
So, my question here is, how can I loop over the dictionary? Shouldn't the dictionary be empty because I didn't pass anything inside?
Maybe I am not smart enough, but I don't see the logic. Can anybody explain me how does it work?
Many thanks

As somebody else pointed out, the terms str, dict, and list shouldn't be used for variable names, because these are actual Python commands that do special things in Python. For example, str(33) turns the number 33 into the string "33". Granted, Python is often smart enough to understand that you want to use these things as variable names, but to avoid confusion you really should use something else. So here's the same code with different variable names, plus some print statements at the end of the loop:
mystring = "house is where you live, you don't leave the house."
mydict = {}
mylist = mystring.split(" ")
for word in mylist: # Loop over the list
if word in mydict:
mydict[word] = mydict[word] + 1
else:
mydict[word] = 1
print("\nmydict is now:")
print(mydict)
If you run this, you'll get the following output:
mydict is now:
{'house': 1}
mydict is now:
{'house': 1, 'is': 1}
mydict is now:
{'house': 1, 'is': 1, 'where': 1}
mydict is now:
{'house': 1, 'is': 1, 'where': 1, 'you': 1}
mydict is now:
{'house': 1, 'is': 1, 'live,': 1, 'where': 1, 'you': 1}
mydict is now:
{'house': 1, 'is': 1, 'live,': 1, 'where': 1, 'you': 2}
mydict is now:
{"don't": 1, 'house': 1, 'is': 1, 'live,': 1, 'you': 2, 'where': 1}
mydict is now:
{"don't": 1, 'house': 1, 'is': 1, 'live,': 1, 'leave': 1, 'you': 2, 'where': 1}
mydict is now:
{"don't": 1, 'house': 1, 'is': 1, 'live,': 1, 'leave': 1, 'you': 2, 'where': 1, 'the': 1}
mydict is now:
{"don't": 1, 'house': 1, 'is': 1, 'live,': 1, 'house.': 1, 'leave': 1, 'you': 2, 'where': 1, 'the': 1}
So mydict is indeed updating with every word it finds. This should also give you a better idea of how dictionaries work in Python.
To be clear, you're not "looping" over the dictionary. The for command starts a loop; the if word in mydict: command isn't a loop, but just a comparison. It looks at all of the keys in mydict and sees if there's one that matches the same string as word.
Also, note that since you only split your sentence on strings, your list of words includes for example both "house" and "house.". Since these two don't exactly match, they're treated as two different words, which is why you see 'house': 1 and 'house.': 1 in your dictionary instead of 'house': 2.

Related

Dictionary with a query of sets in python

So i am trying to get the position of each word in a list, and print it in a dictionary that has the word for key and a set of integers where it belongs in the list.
list_x = ["this is the first", "this is the second"]
my_dict = {}
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x] += 1
else:
my_dict[x] = 1
print(my_dict)
This is the code i tried but this gives me the total number of how many time it appears in the list each word.
What i am trying to get is this format:
{'this': {0, 1}, 'is': {0, 1}, 'the': {0, 1}, 'first': {0}, 'second': {1}}
As you can see this is the key and it appears once, in the "0" position and once in the "1" and .. Any idea how i might get to this point?
Fixed two lines:
list_x = ["this is the first", "this is the second"]
my_dict = {}
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x].append(i)
else:
my_dict[x] = [i]
print(my_dict)
Returns:
{'this': [0, 1], 'is': [0, 1], 'the': [0, 1], 'first': [0], 'second': [1]}
Rather than using integers in your dict, you should use a set:
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x].add(i)
else:
my_dict[x] = set([i])
Or, more briefly,
for i in range(len(list_x)):
for x in list_x[i].split():
my_dict.setdefault(x, set()).add(i)
You can also do this with defaultdict and enumerate:
from collections import defaultdict
list_x = ["this is the first",
"this is the second",
"third is this"]
pos = defaultdict(set)
for i, sublist in enumerate(list_x):
for word in sublist.split():
pos[word].add(i)
Output:
>>> from pprint import pprint
>>> pprint(dict(pos))
{'first': {0},
'is': {0, 1, 2},
'second': {1},
'the': {0, 1},
'third': {2},
'this': {0, 1, 2}}
The purpose of enumerate is to provide the index (position) of each string within list_x. For each word encountered, the position of its sentence within list_x will be added to the set for its corresponding key in the result, pos.

Finding an unknown pattern in a string python

I am well aware of following question which also exists on stack overflow String Unknown pattern Matching but the answer there doesn't really work for me.
My problem is next. I get a string of characters e.g
'1211' and what I need to do is see that 1 is most often repeated
and this 2 times in a row.
But it can also be "121212112" where 12 is repeated 3 times in a
row.
But with 12221221 it is 221 that is repeated 2 times rather than 2
that repeats 3 times.
here are some results I like to get (the only numbers ever used are 1 and 2's)
>>> counter('1211')
1
>>> counter('1212')
2
>>> counter('21212')
2
the outcome I want is how many times it occurs.
I have no idea how to even start looking for a pattern since it is not known on forehand and I did some research online and don't find anything similar.
Does anyone have any idea how I even start to tackle this problem ? All help is welcome and if you want more information don't hesitate to let me know.
Really inefficient, but you can
find all substrings (https://stackoverflow.com/a/22470047/264596)
put them into a set to avoid duplicates
for each of the substring, find all its occurrences - and use some function to find the max (I am not sure how you choose between short strings occurring many times and long strings occurring few times)
Obviously you can use some datastructure to pass through the string once and do some counting on the way, but since I am not sure what your constraints and desired output is, I can give you only this.
I agree with Jirka, not sure how you score long vs short to select the optimal results but this function will give you the menu:
#Func1
def sub_string_cts(string):
combos = {}
for i in range(len(string)):
u_start = len(string) - i
for start in range(u_start):
c_str = string[start:i+start+1]
if c_str in combos:
combos[c_str] += 1
else:
combos[c_str] = 1
return combos
sub_string_cts('21212')
{'2': 3,
'1': 2,
'21': 2,
'12': 2,
'212': 2,
'121': 1,
'2121': 1,
'1212': 1,
'21212': 1}
After your comment I think this is more what you're looking for:
#Func2
def sub_string_cts(string):
combos = {}
for i in range(len(string)):
u_start = len(string) - i
substrs = set([string[start:i+start+1] for start in range(u_start)])
for substring in substrs:
combos[substring] = max([len(i) for i in re.findall("((?:{})+)".format(substring), string)])//len(substring)
return combos
sub_string_cts('21212')
{'2': 1,
'1': 1,
'21': 2,
'12': 2,
'212': 1,
'121': 1,
'2121': 1,
'1212': 1,
'21212': 1}
You could narrow that down to the 'best' candidates by collapsing on the highest occuring instance of each string length:
def max_by_len(result_dict):
results = {}
for k, v in result_dict.items():
if len(k) not in results:
results[len(k)] = {}
for c_len in [ln for ln in results]:
len_max_count = max([v for (k, v) in result_dict.items() if len(k) == c_len])
for k,v in result_dict.items():
if len(k) == c_len:
if v == len_max_count:
results[c_len][k] = v
return results
#Func1:
max_by_len(sub_string_cts('21212'))
{1: {'2': 3},
2: {'21': 2, '12': 2},
3: {'212': 2},
4: {'2121': 1, '1212': 1},
5: {'21212': 1}}
#Func2:
max_by_len(sub_string_cts('21212'))
{1: {'2': 1, '1': 1},
2: {'21': 2, '12': 2},
3: {'212': 1, '121': 1},
4: {'2121': 1, '1212': 1},
5: {'21212': 1}}
Assuming we wouldn't select '2121' or '1212' because their occurrence matches '21212' and they're shorter in length, and that similarly we wouldn't select '21' or '12' as they occur at the same frequency as '212' we could limit our viable candidates down to '2', '212', and '21212' with the following code:
def remove_lesser_patterns(result_dict):
len_lst = sorted([k for k in result_dict], reverse=True)
#len_lst = sorted([k for k in max_len_results])
len_crosswalk = {i_len: max([v for (k,v) in result_dict[i_len].items()]) for i_len in len_lst}
for i_len in len_lst[:-1]:
eval_lst = [i for i in len_lst if i < i_len]
for i in eval_lst:
if len_crosswalk[i] <= len_crosswalk[i_len]:
if i in result_dict:
del result_dict[i]
return result_dict
#Func1
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))
{1: {'2': 3}, 3: {'212': 2}, 5: {'21212': 1}}
#Func2
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))
{2: {'21': 2, '12': 2}, 5: {'21212': 1}}
results:
test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
print("<Input: '{}'".format(string))
c_answer = remove_lesser_patterns(max_by_len(sub_string_cts(string)))
print("<Output: {}\n".format(c_answer))
<Input: '1211'
<Output: {1: {'1': 2}, 4: {'1211': 1}}
# '1' is repeated twice
<Input: '1212'
<Output: {2: {'12': 2}, 4: {'1212': 1}}
# '12' is repeated twice
<Input: '21212'
<Output: {2: {'21': 2, '12': 2}, 5: {'21212': 1}}
# '21' and '12' are both repeated twice
<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice
These functions together give you the highest occurrence of each pattern by length. The key for the dictionary is the length, with a sub-dictionary of the highest (multiple if tied) occuring patterns.
Func2 requires the patterns be sequential, whereas Func1 does not -- it is strictly occurrence based.
Note:
With your example:
3. But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.
the code solves this ambiguity in your desired output (2 or 3) by giving you both:
<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice
If you're only interested in the 2 char lengths you can easily pull those out of the max_by_len results as follows:
test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
print("<Input: '{}'".format(string))
c_answer = remove_lesser_patterns({k:v for (k,v) in max_by_len(sub_string_cts(string)).items() if k == 2})
print("<Output: {}\n".format(max([v for (k,v) in c_answer[2].items()])))
#Func2
<Input: '1211'
<Output: 1
<Input: '1212'
<Output: 2
<Input: '21212'
<Output: 2
<Input: '12221221'
<Output: 1

Python increment values in a dictionary

I am trying to count every word from text files and appending the word and count to a dictionary as the key-value pairs. It throws me this error: if key not in wordDict:
TypeError: unhashable type: 'list'
Also, I am wondering of .split() is good because my text files contain different punctuation marks.
fileref = open(mypath + '/' + i, 'r')
wordDict = {}
for line in fileref.readlines():
key = line.split()
if key not in wordDict:
wordDict[key] = 1
else:
wordDict[key] += 1
from collections import Counter
text = '''I am trying to count every word from text files and appending the word and count to a dictionary as the key-value pairs. It throws me this error: if key not in wordDict: TypeError: unhashable type: 'list' Also, I am wondering of .split() is good because my text files contain different punctuation marks. Thanks ahead for those who help!'''
split_text = text.split()
counter = Counter(split_text)
print(counter)
out:
Counter({'count': 2, 'and': 2, 'text': 2, 'to': 2, 'I': 2, 'files': 2, 'word': 2, 'am': 2, 'the': 2, 'dictionary': 1, 'a': 1, 'not': 1, 'in': 1, 'ahead': 1, 'me': 1, 'trying': 1, 'every': 1, '.split()': 1, 'type:': 1, 'my': 1, 'punctuation': 1, 'is': 1, 'key': 1, 'error:': 1, 'help!': 1, 'those': 1, 'different': 1, 'throws': 1, 'TypeError:': 1, 'contain': 1, 'wordDict:': 1, 'appending': 1, 'if': 1, 'It': 1, 'Also,': 1, 'unhashable': 1, 'from': 1, 'because': 1, 'marks.': 1, 'pairs.': 1, 'this': 1, 'key-value': 1, 'wondering': 1, 'Thanks': 1, 'of': 1, 'good': 1, "'list'": 1, 'for': 1, 'who': 1, 'as': 1})
key is a list of space-delimited words found in the current line. You would need to iterate over that list as well.
for line in fileref:
keys = line.split()
for key in keys:
if key not in wordDict:
wordDict[key] = 1
else:
wordDict[key] += 1
This can be cleaned up considerably by either using the setdefault method or a defaultdict from the collections module; both allow you to avoid explicitly checking for a key by automatically adding the key with an initial value if it isn't already in the dict.
for key in keys:
wordDict.setdefault(key, 0) += 1
or
from collections import defaultdict
wordDict = defaultdict(int) # Default to 0, since int() == 0
...
for key in keys:
wordDict[key] += 1
key is a list and you're trying to see if a list is in a dictionary which is equivalent to seeing if it is one of the keys. Dictionary keys canot be lists hence the "unhashable type" error.
str.split return a list of words
>>> "hello world".split()
['hello', 'world']
>>>
and lists or any other mutable object cannot be used as a key of a dictionary, and that is why you get the error TypeError: unhashable type: 'list'.
You need to iterate over it to include each one of those, also the recommended way to work with a file is with the with statement
wordDict = {}
with open(mypath + '/' + i, 'r') as fileref:
for line in fileref:
for word in line.split():
if word not in wordDict:
wordDict[word] = 1
else:
wordDict[word] += 1
the above can be shortened with the use Counter and an appropriate call to it
from collections import Counter
with open(mypath + '/' + i, 'r') as fileref:
wordDict = Counter( word for line in fileref for word in line.split() )

counting words from a dictionary?

My function is supposed to have:
One parameter as a tweet.
This tweet can involve numbers, words, hashtags, links and punctuations.
A second parameter is a dictionary that counts the words in that string with tweets, disregarding the hashtag's, mentions, links, and punctuation included in it.
The function returns all individual words in the dictionary as lowercase letters without any punctuation.
If the tweet had Don't then the dictionary would count it as dont.
Here is my function:
def count_words(tweet, num_words):
''' (str, dict of {str: int}) -> None
Return a NoneType that updates the count of words in the dictionary.
>>> count_words('We have made too much progress', num_words)
>>> num_words
{'we': 1, 'have': 1, 'made': 1, 'too': 1, 'much': 1, 'progress': 1}
>>> count_words("#utmandrew Don't you wish you could vote? #MakeAmericaGreatAgain", num_words)
>>> num_words
{'dont': 1, 'wish': 1, 'you': 2, 'could': 1, 'vote': 1}
>>> count_words('I am fighting for you! #FollowTheMoney', num_words)
>>> num_words
{'i': 1, 'am': 1, 'fighting': 1, 'for': 1, 'you': 1}
>>> count_words('', num_words)
>>> num_words
{'': 0}
'''
I might misunderstand your question, but if you want to update the dictionary you can do it in this manner:
d = {}
def update_dict(tweet):
for i in tweet.split():
if i not in d:
d[i] = 1
else:
d[i] += 1
return d

How to pass a dictionary as value to a function in python

In python, I am using the mincemeat map-reduce framework
From my map function I would like to yield (k,v) in a loop, which would send the output to the reduce function (sample data given which is the output of my map function )
auth3 {'practical': 1, 'volume': 1, 'physics': 1}
auth34 {'practical': 1, 'volume': 1, 'chemistry': 1}
....
There would be many such entries; this is just a few as an example.
Here, auth3 and auth34 are keys and the respective values are dictionary items
Inside the reduce function when I try to print the key,values, I am getting "too many values to unpack" error. My reduce function looks like this
def reducefn(k, v):
for k,val in (k,v):
print k, v
Please let me know how to resolve this error.
First, define your dictionary with python built-in dict
>>> dic1 = dict(auth3 = {'practical': 1, 'volume': 1, 'physics': 1},
auth34 = {'practical': 1, 'volume': 1, 'chemistry': 1} )
>>> dic1
{'auth3': {'practical': 1, 'volume': 1, 'physics': 1},
'auth34': {'practical': 1, 'volume': 1, 'chemistry': 1}}
Then, your reduce function may go as
def reducefn(dictofdicts):
for key, value in dictofdicts.iteritems() :
print key, value
In the end,
>>> reducefn(dic1)
auth3 {'practical': 1, 'volume': 1, 'physics': 1}
auth34 {'practical': 1, 'volume': 1, 'chemistry': 1}
Use zip
def reducefn(k, v):
for k,val in zip(k,v):
print k, v
>>> reducefn({'practical': 1, 'volume': 1, 'physics': 1} ,{'practical': 1, 'volume': 1, 'chemistry': 1})
practical {'practical': 1, 'volume': 1, 'chemistry': 1}
volume {'practical': 1, 'volume': 1, 'chemistry': 1}
physics {'practical': 1, 'volume': 1, 'chemistry': 1}
>>>
reducefn(k,v) : constitutes a tuple of tuples ((k1,k2,k3..), (v1,v2,v3...))
zippping them gives you ((k1,v1), (k2,v2), (k3,v3)...) and thats what you want
def reducefn(*dicts): #collects multiple arguments and stores in dicts
for dic in dicts: #go over each dictionary passed in
for k,v in dic.items(): #go over key,value pairs in the dic
print(k,v)
reducefn({'practical': 1, 'volume': 1, 'physics': 1} ,{'practical': 1, 'volume': 1, 'chemistry': 1})
Produces
>>>
physics 1
practical 1
volume 1
chemistry 1
practical 1
volume 1
Now, regarding your implementation:
def reducefn(k, v):
The function signature above takes two arguments. The arguments passed to the function are accessed via k and v respectively. So an invocation of reducefn({"key1":"value"},{"key2":"value"}) results in k being assigned {"key1":"value"} and v being assigned {"key2":"vlaue"}.
When you try to invoke it like so: reducefn(dic1,dic2,dic3,...) you are passing in more than the allowed number of parameters as defined by the declaration/signature of reducefn.
for k,val in (k,v):
Now, assuming you passed in two dictionaries to reducefn, both k and v would be dictionaries. The for loop above would be equivalent to:
>>> a = {"Name":"A"}
>>> b = {"Name":"B"}
>>> for (d1,d2) in (a,b):
print(d1,d2)
Which gives the following error:
ValueError: need more than 1 value to unpack
This occurs because you're essentially doing this when the for loop is invoked:
d1,d2=a
You can see we get this error when we try that in a REPL
>>> d1,d2=a
Traceback (most recent call last):
File "<pyshell#24>", line 1, in <module>
d1,d2=a
ValueError: need more than 1 value to unpack
We could do this:
>>> for (d1,d2) in [(a,b)]:
print(d1,d2)
{'Name': 'A'} {'Name': 'B'}
Which assigns the tuple (a,b) to d1,d2. This is called unpacking and would look like this:
d1,d2 = (a,b)
However, in our for loop for k,val in (k,v): it wouldn't make sense as we would end up with k,and val representing the same thing as k,v did originally. Instead we need to go over the key,value pairs in the dictionaries. But seeing as we need to cope with n dictionaries, we need to rethink the function definition.
Hence:
def reducefn(*dicts):
When you invoke the function like this:
reducefn({'physics': 1},{'volume': 1, 'chemistry': 1},{'chemistry': 1})
*dicts collects the arguments, in such a way that dicts ends up as:
({'physics': 1}, {'volume': 1, 'chemistry': 1}, {'chemistry': 1})
As you can see, the three dictionaries passed into the function were collected into a tuple. Now we iterate over the tuple:
for dic in dicts:
So now, on each iteration, dic is one of the dictionaries we passed in, so now we go ahead and print out the key,value pairs inside it:
for k,v in dic.items():
print(k,v)

Categories

Resources