Clean_data is list with over 9000 text files. rules is list of dictionary containing over 500 elements. Below is the rules list
rules = [{'id': 1, 'kwd_root': 'add', 'kwd_sub': 'price target', 'word_count': 5, 'occurance': 1, 'kwd_search': 1, 'status': 1}, {'id': 2, 'kwd_root': 'add', 'kwd_sub': 'PT', 'word_count': 5, 'occurance': 1, 'kwd_search': 1, 'status': 1},.....]
My Question is : I need apply the rules for each and every element in clean_data list.below is the code i have used
for word in clean_data:
for i,d in enumerate(rules):
if any(d['kwd_root'] in word and d['kwd_sub'] in word):
if abs(word.index(d['kwd_root']) - word.index(d['kwd_sub'])) <= d['word_count']:
research.append(word)
else:
non_research.append(word)
else:
non_research.append(word)
After running this code i'm getting the len(non_research) to as 110000 and len(research) as 5500
But the expected output as len(non_research) + len(research) should be equal to len(clean_data)
Thanks
The code indentation posted is wrong. On the other side, the line 3 you use 'any' which need a list as argument. In addition research/non_research append a value each word and each condition (word x condition times). Maybe you can use:
for word in clean_data:
flag_rules = False
for i,d in enumerate(rules):
if d['kwd_root'] in word and d['kwd_sub'] in word:
if abs(word.index(d['kwd_root']) - word.index(d['kwd_sub'])) <= d['word_count']:
flag_rules = True
if flag_rules:
research.append(word)
else:
non_research.append(word)
Related
I'm having trouble transforming every word of a string in a dictionary and passing how many times the word appears as the value.
For example
string = 'How many times times appeared in this many times'
The dict i wanted is:
dict = {'times':3, 'many':2, 'how':1 ...}
Using Counter
from collections import Counter
res = dict(Counter(string.split()))
#{'How': 1, 'many': 2, 'times': 3, 'appeared': 1, 'in': 1, 'this': 1}
You can loop through the words and increment the count like so:
d = {}
for word in string.split(" "):
d.setdefault(word, 0)
d[word] += 1
This question already has answers here:
Iterating through a string word by word
(7 answers)
Closed 4 years ago.
I have made a text string and removed all non alphabetical symbols and added whitespaces in between the words, but when I add them to a dictionary to count the frequency of the words it counts the letters instead. How do I count the words from a dictionary?
dictionary = {}
for item in text_string:
if item in dictionary:
dictionary[item] = dictionary[item]+1
else:
dictionary[item] = 1
print(dictionary)
Change this
for item in text_string:
to this
for item in text_string.split():
Function .split() splits the string to words using whitespace characters (including tabs and newlines) as delimiters.
You are very close. Since you state that your words are already whitespace separated, you need to use str.split to make a list of words.
An example is below:
dictionary = {}
text_string = 'there are repeated words in this sring with many words many are repeated'
for item in text_string.split():
if item in dictionary:
dictionary[item] = dictionary[item]+1
else:
dictionary[item] = 1
print(dictionary)
{'there': 1, 'are': 2, 'repeated': 2, 'words': 2, 'in': 1,
'this': 1, 'sring': 1, 'with': 1, 'many': 2}
Another solution is to use collections.Counter, available in the standard library:
from collections import Counter
text_string = 'there are repeated words in this sring with many words many are repeated'
c = Counter(text_string.split())
print(c)
Counter({'are': 2, 'repeated': 2, 'words': 2, 'many': 2, 'there': 1,
'in': 1, 'this': 1, 'sring': 1, 'with': 1})
I have a lengthy Python list and would like to count the number of occurrences of a single character. For example, how many total times does 'o' occur? I want N=4.
lexicon = ['yuo', 'want', 'to', 'sioo', 'D6', 'bUk', 'lUk'], etc.
list.count() is the obvious solution. However, it consistently returns 0. It doesn't matter which character I look for. I have double checked my file - the characters I am searching for are definitely there. I happen to be calculating count() in a for loop:
for i in range(100):
# random sample 500 words
sample = list(set(random.sample(lexicon, 500)))
C1 = ['k']
total = sum(len(i) for i in sample) # total words
sample_count_C1 = sample.count(C1) / total
But it returns 0 outside of the for loop, over the list 'lexicon' as well. I don't want a list of overall counts so I don't think Counter will work.
Ideas?
If we take your list (the shortened version you supplied):
lexicon = ['yu', 'want', 'to', 'si', 'D6', 'bUk', 'lUk']
then we can get the count using sum() and a generator-expression:
count = sum(s.count(c) for s in lexicon)
so if c were, say, 'k' this would give 2 as there are two occurances of k.
This will work in a for-loop or not, so you should be able to incorporate this into your wider code by yourself.
With your latest edit, I can confirm that this produces a count of 4 for 'o' in your modified list.
If I understand your question correctly, you would like to count the number of occurrences of each character for each word in the list. This is known as a frequency distribution.
Here is a simple implementation using Counter
from collections import Counter
lexicon = ['yu', 'want', 'to', 'si', 'D6', 'bUk', 'lUk']
chars = [char for word in lexicon for char in word]
freq_dist = Counter(chars)
Counter({'t': 2, 'U': 2, 'k': 2, 'a': 1, 'u': 1, 'l': 1, 'i': 1, 'y': 1, 'D': 1, '6': 1, 'b': 1, 's': 1, 'w': 1, 'n': 1, 'o': 1})
Using freq_dist, you can return the number of occurrences for a character.
freq_dist.get('a')
1
# get() method returns None if character is not in dict
freq_dist.get('4')
None
It's giving zero because sample.count('K') will matches k as a string. It will not consider buk or luk.
If u want to calculate frequency of character go like this
for i in range(100):
# random sample 500 words
sample = list(set(random.sample(lexicon, 500)))
C1 = ['k']
total = sum(len(i) for i in sample) # total words
sample_count=sum([x.count(C1) for x in sample])
sample_count_C1 = sampl_count / total
I have a dictionary. The keys are words the value is the number of times those words occur.
countDict = {'house': 2, 'who': 41, 'joey': 409, 'boy': 2, 'girl':2}
I'd like to find out how many elements occur with a value of more than 1, with a value of more than 20 and with a value of more than 50.
I found this code
a = sum(1 for i in countDict if countDict.values() >= 2)
but I get an error that I'm guessing means that values in dictionaries can't be processed as integers.
builtin.TypeError: unorderable types: dict_values() >= int()
I tried modifying the above code to make the dictionary value be an integer but that did not work either.
a = sum(1 for i in countDict if int(countDict.values()) >= 2)
builtins.TypeError: int() argument must be a string or a number, not 'dict_values'
Any suggestions?
countDict.items() gives you key-value pairs in countDict so you can write:
>>> countDict = {'house': 2, 'who': 41, 'joey': 409, 'boy': 2, 'girl':2}
>>> [word for word, occurrences in countDict.items() if occurrences >= 20]
['who', 'joey']
If you just want the number of words, use len:
>>> countDict = {'house': 2, 'who': 41, 'joey': 409, 'boy': 2, 'girl':2}
>>> wordlist = [word for word, occurrences in countDict.items() if occurrences >= 20]
>>> len(wordlist)
2
Note that Python variables use lowercase and underscores (snake case): count_dict rather than countDict. By convention camel case is used for classes in Python:
breakfast = SpamEggs() # breakfast is new instance of class SpamEggs
lunch = spam_eggs() # call function spam_eggs and store result in lunch
dinner = spam_eggs # assign value of spam_eggs variable to dinner
See PEP8 for more details.
You need this:
>>> countDict = {'house': 2, 'who': 41, 'joey': 409, 'boy': 2, 'girl':2}
>>> sum(1 for i in countDict.values() if i >= 2)
5
values() returns a list of all the values available in a given dictionary which means you can't convert the list to integer.
You could use collections.Counter and a "classification function" to get the result in one-pass:
def classify(val):
res = []
if val > 1:
res.append('> 1')
if val > 20:
res.append('> 20')
if val > 50:
res.append('> 50')
return res
from collections import Counter
countDict = {'house': 2, 'who': 41, 'joey': 409, 'boy': 2, 'girl':2}
Counter(classification for val in countDict.values() for classification in classify(val))
# Counter({'> 1': 5, '> 20': 2, '> 50': 1})
Of course you can alter the return values or thresholds in case you want a different result.
But you were actually pretty close, you probably just mixed up the syntax - correct would be:
a = sum(1 for i in countDict.values() if i >= 2)
because you want to iterate over the values() and check the condition for each value.
What you got was an exception because the comparison between
>>> countDict.values()
dict_values([2, 409, 2, 41, 2])
and an integer like 2 doesn't make any sense.
Try using .items() before applying your discrimination logic.
in -
for key, value in countDict.items():
if value == 2: #edit for use
print key
out -
house
boy
girl
You're quite close. You just need to remember that i is accessible via the if statement. I added all three examples to get you started. Additionally, values creates a list of all values in the dictionary, which is not what you want, you instead want the currently evaluated value, i.
moreThan2 = sum(1 for i in countDict if countDict[i] >= 2)
moreThan20 = sum(1 for i in countDict if countDict[i] >= 20)
moreThan50 = sum(1 for i in countDict if countDict[i] >= 50)
I have a dictionary that's two levels deep. That is, each key in the first dictionary is a url and the value is another dictionary with each key being words and each value being the number of times the word appeared on that url. It looks something like this:
dic = {
'http://www.cs.rpi.edu/news/seminars.html': {
'hyper': 1,
'summer': 2,
'expert': 1,
'koushk': 1,
'semantic': 1,
'feedback': 1,
'sandia': 1,
'lewis': 1,
'global': 1,
'yener': 1,
'laura': 1,
'troy': 1,
'session': 1,
'greenhouse': 1,
'human': 1
...and so on...
The dictionary itself is very long and has 25 urls in it, each url having another dictionary as its value with every word found within the url and the number of times its found.
I want to find the word or words that appear in the most different urls in the dictionary. So the output should look something like this:
The following words appear x times on y pages: list of words
It seems that you should use a Counter for this:
from collections import Counter
print sum((Counter(x) for x in dic.values()),Counter()).most_common()
Or the multiline version:
c = Counter()
for d in dic.values():
c += Counter(d)
print c.most_common()
To get the words which are common in all of the subdicts:
subdicts = iter(dic.values())
s = set(next(subdicts)).intersection(*subdicts)
Now you can use that set to filter the resulting counter, removing words which don't appear in every subdict:
c = Counter((k,v) for k,v in c.items() if k in s)
print c.most_common()
A Counter isn't quite what you want. From the output you show, it looks like you want to keep track of both the total number of occurrences, and the number of pages the word occurs on.
data = {
'page1': {
'word1': 5,
'word2': 10,
'word3': 2,
},
'page2': {
'word2': 2,
'word3': 1,
}
}
from collections import defaultdict
class Entry(object):
def __init__(self):
self.pages = 0
self.occurrences = 0
def __iadd__(self, occurrences):
self.pages += 1
self.occurrences += occurrences
return self
def __str__(self):
return '{} occurrences on {} pages'.format(self.occurrences, self.pages)
def __repr__(self):
return '<Entry {} occurrences, {} pages>'.format(self.occurrences, self.pages)
counts = defaultdict(Entry)
for page_words in data.itervalues():
for word, count in page_words.iteritems():
counts[word] += count
for word, entry in counts.iteritems():
print word, ':', entry
This produces the following output:
word1 : 5 occurrences on 1 pages
word3 : 3 occurrences on 2 pages
word2 : 12 occurrences on 2 pages
That would capture the information you want, the next step would be to find the most common n words. You could do that using a heapsort (which has the handy feature of not requiring that you sort the whole list of words by number of pages then occurrences - that might be important if you've got a lot of words in total, but n of 'top n' is relatively small).
from heapq import nlargest
def by_pages_then_occurrences(item):
entry = item[1]
return entry.pages, entry.occurrences
print nlargest(2, counts.iteritems(), key=by_pages_then_occurrences)