Dictionary with a query of sets in python - python

So i am trying to get the position of each word in a list, and print it in a dictionary that has the word for key and a set of integers where it belongs in the list.
list_x = ["this is the first", "this is the second"]
my_dict = {}
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x] += 1
else:
my_dict[x] = 1
print(my_dict)
This is the code i tried but this gives me the total number of how many time it appears in the list each word.
What i am trying to get is this format:
{'this': {0, 1}, 'is': {0, 1}, 'the': {0, 1}, 'first': {0}, 'second': {1}}
As you can see this is the key and it appears once, in the "0" position and once in the "1" and .. Any idea how i might get to this point?

Fixed two lines:
list_x = ["this is the first", "this is the second"]
my_dict = {}
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x].append(i)
else:
my_dict[x] = [i]
print(my_dict)
Returns:
{'this': [0, 1], 'is': [0, 1], 'the': [0, 1], 'first': [0], 'second': [1]}

Rather than using integers in your dict, you should use a set:
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x].add(i)
else:
my_dict[x] = set([i])
Or, more briefly,
for i in range(len(list_x)):
for x in list_x[i].split():
my_dict.setdefault(x, set()).add(i)

You can also do this with defaultdict and enumerate:
from collections import defaultdict
list_x = ["this is the first",
"this is the second",
"third is this"]
pos = defaultdict(set)
for i, sublist in enumerate(list_x):
for word in sublist.split():
pos[word].add(i)
Output:
>>> from pprint import pprint
>>> pprint(dict(pos))
{'first': {0},
'is': {0, 1, 2},
'second': {1},
'the': {0, 1},
'third': {2},
'this': {0, 1, 2}}
The purpose of enumerate is to provide the index (position) of each string within list_x. For each word encountered, the position of its sentence within list_x will be added to the set for its corresponding key in the result, pos.

Related

index word in dictionary

I have a text file where I want each word in the text file in a dictionary and then print out the index position each time the word is in the text file.
The code I have is only giving me the number of times the word is in the text file. How can I change this?
I have already converted to lowercase.
dicti = {}
for eachword in wordsintxt:
freq = dicti.get(eachword, None)
if freq == None:
dicti[eachword] = 1
else:
dicti[eachword] = freq + 1
print(dicti)
Change your code to keep the indices themselves, rather than merely count them:
for index, eachword in enumerate(wordsintxt):
freq = dicti.get(eachword, None)
if freq == None:
dicti[eachword] = []
else:
dicti[eachword].append(index)
If you still need the word frequency: that's easy to recover:
freq = len(dicti[word])
Update per OP comment
Without enumerate, simply provide that functionality yourself:
for index in range(len(wordsintxt)):
eachword = wordsintxt[i]
I'm not sure why you'd want to do that; the operation is idiomatic and common enough that Python developers created enumerate for exactly that purpose.
You can use this:
wordsintxt = ["hello", "world", "the", "a", "Hello", "my", "name", "is", "the"]
words_data = {}
for i, word in enumerate(wordsintxt):
word = word.lower()
words_data[word] = words_data.get(word, {'freq': 0, 'indexes': []})
words_data[word]['freq'] += 1
words_data[word]['indexes'].append(i)
for k, v in words_data.items():
print(k, '\t', v)
Which prints:
hello {'freq': 2, 'indexes': [0, 4]}
world {'freq': 1, 'indexes': [1]}
the {'freq': 2, 'indexes': [2, 8]}
a {'freq': 1, 'indexes': [3]}
my {'freq': 1, 'indexes': [5]}
name {'freq': 1, 'indexes': [6]}
is {'freq': 1, 'indexes': [7]}
You can avoid checking if the value exists in your dictionary and then performing a custom action by just using data[key] = data.get(key, STARTING_VALUE)
Greetings!
Use collections.defaultdict with enumerate, just append all the indexes you retrieve from enumerate
from collections import defaultdict
with open('test.txt') as f:
content = f.read()
words = content.split()
dd = defaultdict(list)
for i, v in enumerate(words):
dd[v.lower()].append(i)
print(dd)
# defaultdict(<class 'list'>, {'i': [0, 6, 35, 54, 57], 'have': [1, 36, 58],... 'lowercase.': [62]})

Finding an unknown pattern in a string python

I am well aware of following question which also exists on stack overflow String Unknown pattern Matching but the answer there doesn't really work for me.
My problem is next. I get a string of characters e.g
'1211' and what I need to do is see that 1 is most often repeated
and this 2 times in a row.
But it can also be "121212112" where 12 is repeated 3 times in a
row.
But with 12221221 it is 221 that is repeated 2 times rather than 2
that repeats 3 times.
here are some results I like to get (the only numbers ever used are 1 and 2's)
>>> counter('1211')
1
>>> counter('1212')
2
>>> counter('21212')
2
the outcome I want is how many times it occurs.
I have no idea how to even start looking for a pattern since it is not known on forehand and I did some research online and don't find anything similar.
Does anyone have any idea how I even start to tackle this problem ? All help is welcome and if you want more information don't hesitate to let me know.
Really inefficient, but you can
find all substrings (https://stackoverflow.com/a/22470047/264596)
put them into a set to avoid duplicates
for each of the substring, find all its occurrences - and use some function to find the max (I am not sure how you choose between short strings occurring many times and long strings occurring few times)
Obviously you can use some datastructure to pass through the string once and do some counting on the way, but since I am not sure what your constraints and desired output is, I can give you only this.
I agree with Jirka, not sure how you score long vs short to select the optimal results but this function will give you the menu:
#Func1
def sub_string_cts(string):
combos = {}
for i in range(len(string)):
u_start = len(string) - i
for start in range(u_start):
c_str = string[start:i+start+1]
if c_str in combos:
combos[c_str] += 1
else:
combos[c_str] = 1
return combos
sub_string_cts('21212')
{'2': 3,
'1': 2,
'21': 2,
'12': 2,
'212': 2,
'121': 1,
'2121': 1,
'1212': 1,
'21212': 1}
After your comment I think this is more what you're looking for:
#Func2
def sub_string_cts(string):
combos = {}
for i in range(len(string)):
u_start = len(string) - i
substrs = set([string[start:i+start+1] for start in range(u_start)])
for substring in substrs:
combos[substring] = max([len(i) for i in re.findall("((?:{})+)".format(substring), string)])//len(substring)
return combos
sub_string_cts('21212')
{'2': 1,
'1': 1,
'21': 2,
'12': 2,
'212': 1,
'121': 1,
'2121': 1,
'1212': 1,
'21212': 1}
You could narrow that down to the 'best' candidates by collapsing on the highest occuring instance of each string length:
def max_by_len(result_dict):
results = {}
for k, v in result_dict.items():
if len(k) not in results:
results[len(k)] = {}
for c_len in [ln for ln in results]:
len_max_count = max([v for (k, v) in result_dict.items() if len(k) == c_len])
for k,v in result_dict.items():
if len(k) == c_len:
if v == len_max_count:
results[c_len][k] = v
return results
#Func1:
max_by_len(sub_string_cts('21212'))
{1: {'2': 3},
2: {'21': 2, '12': 2},
3: {'212': 2},
4: {'2121': 1, '1212': 1},
5: {'21212': 1}}
#Func2:
max_by_len(sub_string_cts('21212'))
{1: {'2': 1, '1': 1},
2: {'21': 2, '12': 2},
3: {'212': 1, '121': 1},
4: {'2121': 1, '1212': 1},
5: {'21212': 1}}
Assuming we wouldn't select '2121' or '1212' because their occurrence matches '21212' and they're shorter in length, and that similarly we wouldn't select '21' or '12' as they occur at the same frequency as '212' we could limit our viable candidates down to '2', '212', and '21212' with the following code:
def remove_lesser_patterns(result_dict):
len_lst = sorted([k for k in result_dict], reverse=True)
#len_lst = sorted([k for k in max_len_results])
len_crosswalk = {i_len: max([v for (k,v) in result_dict[i_len].items()]) for i_len in len_lst}
for i_len in len_lst[:-1]:
eval_lst = [i for i in len_lst if i < i_len]
for i in eval_lst:
if len_crosswalk[i] <= len_crosswalk[i_len]:
if i in result_dict:
del result_dict[i]
return result_dict
#Func1
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))
{1: {'2': 3}, 3: {'212': 2}, 5: {'21212': 1}}
#Func2
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))
{2: {'21': 2, '12': 2}, 5: {'21212': 1}}
results:
test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
print("<Input: '{}'".format(string))
c_answer = remove_lesser_patterns(max_by_len(sub_string_cts(string)))
print("<Output: {}\n".format(c_answer))
<Input: '1211'
<Output: {1: {'1': 2}, 4: {'1211': 1}}
# '1' is repeated twice
<Input: '1212'
<Output: {2: {'12': 2}, 4: {'1212': 1}}
# '12' is repeated twice
<Input: '21212'
<Output: {2: {'21': 2, '12': 2}, 5: {'21212': 1}}
# '21' and '12' are both repeated twice
<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice
These functions together give you the highest occurrence of each pattern by length. The key for the dictionary is the length, with a sub-dictionary of the highest (multiple if tied) occuring patterns.
Func2 requires the patterns be sequential, whereas Func1 does not -- it is strictly occurrence based.
Note:
With your example:
3. But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.
the code solves this ambiguity in your desired output (2 or 3) by giving you both:
<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice
If you're only interested in the 2 char lengths you can easily pull those out of the max_by_len results as follows:
test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
print("<Input: '{}'".format(string))
c_answer = remove_lesser_patterns({k:v for (k,v) in max_by_len(sub_string_cts(string)).items() if k == 2})
print("<Output: {}\n".format(max([v for (k,v) in c_answer[2].items()])))
#Func2
<Input: '1211'
<Output: 1
<Input: '1212'
<Output: 2
<Input: '21212'
<Output: 2
<Input: '12221221'
<Output: 1

How do I filter a dictionary based on the partial string matches?

I have a dictionary of values:
dic = {1: "a1+b+c", 2: "a1+c+v", 3: "a1+z+e", 4: "a2+p+a", 5: "a2+z+v", 6: "a3+q+v", ...}
I have a page in Flask, that has checkboxes for each partial string value in a dictionary, e.g. checkboxes "a", "b", "c",... etc. On the page, the checkboxes are located in groups a1, a2, a3, etc.
I need to filter the dictionary by the partial values based on the values of the selected checkboxes, for example, when selecting "c" in group a1, it would return:
1: a1+b+c
2: a1+c+v
When selecting "z" from group a2, it would return:
5: "a2+z+v"
The code that generates an error is:
sol = [k for k in dic if 'a1' in k]
Can someone point me to the right direction?
You can easily solve this with a quite short function:
def lookup(dct, *args):
for needle in args:
dct = {key: value for key, value in dct.items() if needle in value}
return dct
For example:
>>> dic = {1: "a1+b+c", 2: "a1+c+v", 3: "a1+z+e", 4: "a2+p+a", 5: "a2+z+v", 6: "a3+q+v"}
>>> lookup(dic, "a1", "c")
{1: 'a1+b+c', 2: 'a1+c+v'}
However that always needs to iterate over all keys for each "needle". You can do better if you have a helper dictionary (I'll use a collections.defaultdict here) that stores all keys that match one needle (assuming + is supposed to be a delimiter in your dictionary):
from collections import defaultdict
helperdict = defaultdict(set)
for key, value in dic.items():
for needle in value.split('+'):
helperdict[needle].add(key)
That helperdict now contains all keys that match one particular part of a value:
>>> print(dict(helperdict))
{'z': {3, 5}, 'p': {4}, 'a1': {1, 2, 3}, 'a3': {6}, 'v': {2, 5, 6}, 'a2': {4, 5}, 'e': {3}, 'b': {1}, 'a': {4}, 'c': {1, 2}, 'q': {6}}
And using set.intersection allows you to quickly get all matches for different combinations:
>>> search = ['a2', 'z']
>>> matches = set.intersection(*[helperdict[needle] for needle in search])
>>> {match: dic[match] for match in matches}
{5: 'a2+z+v'}
It's definitely longer than the first approach and requires more external memory but if you plan to do several queries it will be much faster.

counting words from a dictionary?

My function is supposed to have:
One parameter as a tweet.
This tweet can involve numbers, words, hashtags, links and punctuations.
A second parameter is a dictionary that counts the words in that string with tweets, disregarding the hashtag's, mentions, links, and punctuation included in it.
The function returns all individual words in the dictionary as lowercase letters without any punctuation.
If the tweet had Don't then the dictionary would count it as dont.
Here is my function:
def count_words(tweet, num_words):
''' (str, dict of {str: int}) -> None
Return a NoneType that updates the count of words in the dictionary.
>>> count_words('We have made too much progress', num_words)
>>> num_words
{'we': 1, 'have': 1, 'made': 1, 'too': 1, 'much': 1, 'progress': 1}
>>> count_words("#utmandrew Don't you wish you could vote? #MakeAmericaGreatAgain", num_words)
>>> num_words
{'dont': 1, 'wish': 1, 'you': 2, 'could': 1, 'vote': 1}
>>> count_words('I am fighting for you! #FollowTheMoney', num_words)
>>> num_words
{'i': 1, 'am': 1, 'fighting': 1, 'for': 1, 'you': 1}
>>> count_words('', num_words)
>>> num_words
{'': 0}
'''
I might misunderstand your question, but if you want to update the dictionary you can do it in this manner:
d = {}
def update_dict(tweet):
for i in tweet.split():
if i not in d:
d[i] = 1
else:
d[i] += 1
return d

Counting items inside tuples in Python

I am fairly new to python and I could not figure out how to do the following.
I have a list of (word, tag) tuples
a = [('Run', 'Noun'),('Run', 'Verb'),('The', 'Article'),('Run', 'Noun'),('The', 'DT')]
I am trying to find all tags that has been assigned to each word and collect their counts. For example, word "run" has been tagged twice to 'Noun' and once to 'Verb'.
To clarify: I would like to create another list of tuples that contains (word, tag, count)
You can use collections.Counter:
>>> import collections
>>> a = [('Run', 'Noun'),('Run', 'Verb'),('The', 'Article'),('Run', 'Noun'),('The', 'DT')]
>>> counter = collections.Counter(a)
Counter({('Run', 'Noun'): 2, ('Run', 'Verb'): 1, ... })
>>> result = {}
>>> for (tag, word), count in counter.items():
... result.setdefault(tag, []).append({word: count})
>>> print(result)
{'Run': [{'Noun': 2}, {'Verb': 1}], 'The': [{'Article': 1}, {'DT': 1}]}
Pretty easy with a defaultdict:
>>> from collections import defaultdict
>>> output = defaultdict(defaultdict(int).copy)
>>> for word, tag in a:
... output[word][tag] += 1
...
>>> output
defaultdict(<function copy>,
{'Run': defaultdict(int, {'Noun': 2, 'Verb': 1}),
'The': defaultdict(int, {'Article': 1, 'DT': 1})})

Categories

Resources