Efficient way of frequency counting of continuous words?

Efficient way of frequency counting of continuous words? - python

I have a string like this:
inputString = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"
and a dictionary like this:
{
'always first': 0,
'book the': 0,
'first': 0,
'first sentence': 0,
'in this': 0,
'interesting the': 0,
'is always': 0,
'is really': 0,
'is the': 0,
'most interesting': 0,
'really the': 0,
'sentence in': 0,
'sentence is': 0,
'the first': 0,
'the first sentence': 0,
'the first sentence is': 0,
'the most': 0,
'this': 0,
'this book': 0,
'this is': 0
}
What is the most efficient way of updating the frequency counts of this dictionary in one pass of the input string (if it is possible)? I get a feeling that there must be a parser technique to do this but am not an expert in this area so am stuck. Any suggestions?

Check out the Aho-Corasick algorithm.

The Aho–Corasick seems definitely the way to go, but if I needed a simple Python implementation, I'd write:
import collections
def consecutive_groups(seq, n):
return (seq[i:i+n] for i in range(len(seq)-n))
def get_snippet_ocurrences(snippets):
split_snippets = [s.split() for s in snippets]
max_snippet_length = max(len(sp) for sp in split_snippets)
for group in consecutive_groups(inputString.split(), max_snippet_length):
for lst in split_snippets:
if group[:len(lst)] == lst:
yield " ".join(lst)
print collections.Counter(get_snippet_ocurrences(snippets))
# Counter({'the first sentence': 3, 'first sentence': 3, 'the first': 3, 'first': 3, 'the first sentence is': 2, 'this': 2, 'this book': 1, 'in this': 1, 'book the': 1, 'most interesting': 1, 'really the': 1, 'sentence in': 1, 'is really': 1, 'sentence is': 1, 'is the': 1, 'interesting the': 1, 'this is': 1, 'the most': 1})

When confronted with this problem, I think, "I know, I'll use regular expressions".
Start off by making a list of all the patterns, sorted by decreasing length:
patterns = sorted(counts.keys(), key=len, reverse=True)
Now make that into a single massive regular expression which is an alternation between each of the patterns:
allPatterns = re.compile("|".join(patterns))
Now run that pattern over the input string, and count up the number of hits on each pattern as you go:
pos = 0
while (True):
match = allPatterns.search(inputString, pos)
if (match is None): break
pos = match.start() + 1
counts[match.group()] = counts[match.group()] + 1
You will end up with the counts of each of the strings.
(An aside: i believe most good regular expression libraries will compile a large alternation over fixed strings like this using the Aho-Corasick algorithm that e.dan mentioned. Using a regular expression library is probably the easiest way of applying this algorithm.)
With one problem: where a pattern is a prefix of another pattern (eg 'first' and 'first sentence'), only the longer pattern will have got a count against it. This is by design: that's what the sort by length at the start was for.
We can deal with this as a postprocessing step; go through the counts, and whenever one pattern is a prefix of another, add the longer pattern's counts to the shorter pattern's. Be careful not to double-add. That's simply done as a nested loop:
correctedCounts = {}
for donor in counts:
for recipient in counts:
if (donor.startswith(recipient)):
correctedCounts[recipient] = correctedCounts.get(recipient, 0) + counts[donor]
That dictionary now contains the actual counts.

Try with Suffix tree or Trie to store words instead of characters.

Just go through the string and use the dictionary as you would normally to increment any occurance. This is O(n), since dictionary lookup is often O(1). I do this regularly, even for large word collections.

Related

Count total number of modal verbs in text

I am trying to create a custom collection of words as shown in the following Categories:
Modal Tentative Certainty Generalizing
Can Anyhow Undoubtedly Generally
May anytime Ofcourse Overall
Might anything Definitely On the Whole
Must hazy No doubt In general
Shall hope Doubtless All in all
ought to hoped Never Basically
will uncertain always Essentially
need undecidable absolute Most
Be to occasional assure Every
Have to somebody certain Some
Would someone clear Often
Should something clearly Rarely
Could sort inevitable None
Used to sorta forever Always
I am reading text from a CSV file row by row:
import nltk
import numpy as np
import pandas as pd
from collections import Counter, defaultdict
from nltk.tokenize import word_tokenize
count = defaultdict(int)
header_list = ["modal","Tentative","Certainity","Generalization"]
categorydf = pd.read_csv('Custom-Dictionary1.csv', names=header_list)
def analyze(file):
df = pd.read_csv(file)
modals = str(categorydf['modal'])
tentative = str(categorydf['Tentative'])
certainity = str(categorydf['Certainity'])
generalization = str(categorydf['Generalization'])
for text in df["Text"]:
tokenize_text = text.split()
for w in tokenize_text:
if w in modals:
count[w] += 1
analyze("test1.csv")
print(sum(count.values()))
print(count)
I want to find number of Modal/Tentative/Certainty verbs which are present in the above table and in each row in test1.csv, but not able to do so. This is generating words frequency with number.
19
defaultdict(<class 'int'>, {'to': 7, 'an': 1, 'will': 2, 'a': 7, 'all': 2})
See 'an','a' are not present in the table. I want to get No of Model verbs = total modal verbs present in 1 row of test.csv text
test1.csv:
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
"They convey the content of a communication."
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
I am stuck and not getting anything. How can I proceed?

I've solved your task for initial CSV format, could be of cause adopted to XML input if needed.
I've did quite fancy solution using NumPy, that's why solution might be a bit complex, but runs very fast and suitable for large data, even Giga-Bytes.
It uses sorted table of words, also sorts text to count words and sorted-search in table, hence works in O(n log n) time complexity.
It outputs original text line on first line, then Found-line where it lists each found in Tabl word in sorted order with (Count, Modality, (TableRow, TableCol)), then Non-Found-line where it lists non-found-in-table words plus Count (number of occurancies of this word in text).
Also a much simpler (but slower) similar solution is located after the first one.
Try it online!
import io, pandas as pd, numpy as np
# Instead of io.StringIO(...) provide filename.
tab = pd.read_csv(io.StringIO("""
Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))
tabc = np.array(tab.columns.values.tolist(), dtype = np.str_)
taba = tab.values.astype(np.str_)
tabw = np.char.lower(taba.ravel())
tabi = np.zeros([tabw.size, 2], dtype = np.int64)
tabi[:, 0], tabi[:, 1] = [e.ravel() for e in np.split(np.mgrid[:taba.shape[0], :taba.shape[1]], 2, axis = 0)]
t = np.argsort(tabw)
tabw, tabi = tabw[t], tabi[t, :]
texts = pd.read_csv(io.StringIO("""
Text
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
""")).values[:, 0].astype(np.str_)
for i, (a, text) in enumerate(zip(map(np.array, np.char.split(texts)), texts)):
vs, cs = np.unique(np.char.lower(a), return_counts = True)
ps = np.searchsorted(tabw, vs)
unc = np.zeros_like(a, dtype = np.bool_)
psm = ps < tabi.shape[0]
psm[psm] = tabw[ps[psm]] == vs[psm]
print(
i, ': Text:', text,
'\nFound:',
', '.join([f'"{vs[i]}": ({cs[i]}, {tabc[tabi[ps[i], 1]]}, ({tabi[ps[i], 0]}, {tabi[ps[i], 1]}))'
for i in np.flatnonzero(psm).tolist()]),
'\nNon-Found:',
', '.join([f'"{vs[i]}": {cs[i]}'
for i in np.flatnonzero(~psm).tolist()]),
'\n',
)
Outputs:
0 : Text: When LIWC was first developed, the goal was to devise an efficient will system
Found: "will": (1, Modal, (6, 0))
Non-Found: "an": 1, "developed,": 1, "devise": 1, "efficient": 1, "first": 1, "goal": 1, "liwc": 1, "system": 1, "the": 1, "to": 1, "was": 2, "when":
1
1 : Text: Within a few years, it became clear that there are two very broad categories of words
Found: "clear": (1, Certainty, (10, 2))
Non-Found: "a": 1, "are": 1, "became": 1, "broad": 1, "categories": 1, "few": 1, "it": 1, "of": 1, "that": 1, "there": 1, "two": 1, "very": 1, "withi
n": 1, "words": 1, "years,": 1
2 : Text: Content words are generally nouns, regular verbs, and many adjectives and adverbs.
Found: "generally": (1, Generalizing, (0, 3))
Non-Found: "adjectives": 1, "adverbs.": 1, "and": 2, "are": 1, "content": 1, "many": 1, "nouns,": 1, "regular": 1, "verbs,": 1, "words": 1
3 : Text: They convey the content of a communication.
Found:
Non-Found: "a": 1, "communication.": 1, "content": 1, "convey": 1, "of": 1, "the": 1, "they": 1
4 : Text: To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”
Found:
Non-Found: "a": 1, "and": 2, "are:": 1, "back": 1, "content": 1, "dark": 1, "go": 1, "night”": 1, "phrase": 1, "stormy": 1, "the": 2, "to": 2, "was":
1, "words": 1, "“dark,”": 1, "“it": 1, "“night.”": 1, "“stormy,”": 1
Second solution is implemented in pure Python just for simplicity, only standard python modules io and csv are used.
Try it online!
import io, csv
# Instead of io.StringIO(...) just read from filename.
tab = csv.DictReader(io.StringIO("""Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))
texts = csv.DictReader(io.StringIO("""
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
"""), fieldnames = ['Text'])
tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]
for text in texts:
cnt, mod = {}, {}
for word in text.lower().split():
if word in tabi:
cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))
It outputs:
'will': (1, Modal)
'clear': (1, Certainty)
'generally': (1, Generalizing)
I'm reading from StringIO content of CSV, that is to convenience so that code contains everything without need of extra files, for sure in your case you'll need direct files reading, for this you may do same as in next code and next link (named Try it online!):
Try it online!
import io, csv
tab = csv.DictReader(open('table.csv', 'r', encoding = 'utf-8-sig'))
texts = csv.DictReader(open('texts.csv', 'r', encoding = 'utf-8-sig'), fieldnames = ['Text'])
tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]
for text in texts:
cnt, mod = {}, {}
for word in text.lower().split():
if word in tabi:
cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))

python count the number of words in the list of strings [duplicate]

This question already has answers here:
How to find the count of a word in a string
(9 answers)
Closed 2 years ago.
consider
doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
I have this as input I just wanted to print the number of times each word in the whole list occurs:
For example student occurs 3 times so
expected output student=3, a=2,etc
I was able to print the unique words in the doc, but not able to print the occurrences. Here is the function i used:
def fit(doc):
unique_words = set()
if isinstance(dataset, (list,)):
for row in dataset:
for word in row.split(" "):
if len(word) < 2:
continue
unique_words.add(word)
unique_words = sorted(list(unique_words))
return (unique_words)
doc=fit(docs)
print(doc)
['am', 'are', 'both', 'fellow', 'good', 'hard', 'student', 'the', 'we', 'works']
I got this as output I just want the number of occurrences of the unique_words. How do i do this please?

You just need to use Counter, and you will solve the problem by using a single line of code:
from collections import Counter
doc = ["i am a fellow student",
"we both are the good student",
"a student works hard"]
count = dict(Counter(word for sentence in doc for word in sentence.split()))
count is your desired dictionary:
{
'i': 1,
'am': 1,
'a': 2,
'fellow': 1,
'student': 3,
'we': 1,
'both': 1,
'are': 1,
'the': 1,
'good': 1,
'works': 1,
'hard': 1
}
So for example count['student'] == 3, count['a'] == 2 etc.
Here it's important to use split() instead of split(' '): in this way you will not end up with having an "empty" word within count. Example:
>>> sentence = "Hello world"
>>> dict(Counter(sentence.split(' ')))
{'Hello': 1, '': 4, 'world': 1}
>>> dict(Counter(sentence.split()))
{'Hello': 1, 'world': 1}

Use
from collections import Counter
Counter(" ".join(doc).split())
results in
Counter({'i': 1,
'am': 1,
'a': 2,
'fellow': 1,
'student': 3,
'we': 1,
'both': 1,
'are': 1,
'the': 1,
'good': 1,
'works': 1,
'hard': 1})
Explanation: first create one string by using join and split it on spaces with split to have a list of single words. Use Counter to count the appearances of each word

doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
p = doc[0].split() #first list
p1 = doc[1].split() #second list
p2 = doc[2].split() #third list
f1 = p + p1 + p2
j = len(f1)-1
n = 0
while n < j:
print(f1[n],"is found",f1.count(f1[n]), "times")
n+=1

You can use set and a string to aggregate all word in each sentence after that to use dictionary comprehension to create a dictionary by the key of the word and value of the count in the sentence
doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
uniques = set()
all_words = ''
for i in doc:
for word in i.split(" "):
uniques.add(word)
all_words += f" {word}"
print({i: all_words.count(f" {i} ") for i in uniques})
Output
{'the': 1, 'hard': 0, 'student': 3, 'both': 1, 'fellow': 1, 'works': 1, 'a': 2, 'are': 1, 'am': 1, 'good': 1, 'i': 1, 'we': 1}

Thanks for Posting in Stackoverflow I have written a sample code that does what you need just check it and ask if there is anything you don't understand
doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
checked = []
occurence = []
for sentence in doc:
for word in sentence.split(" "):
if word in checked:
occurence[checked.index(word)] = occurence[checked.index(word)] + 1
else:
checked.append(word)
occurence.append(1)
for i in range(len(checked)):
print(checked[i]+" : "+str(occurence[i]))

try this one
doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
words=[]
for a in doc:
b=a.split()
for c in b:
#if len(c)>3: #most words there length > 3 this line in your choice
words.append(c)
wc=[]
for a in words:
count = 0
for b in words:
if a==b :
count +=1
wc.append([a,count])
print(wc)

Unique word frequency using NLTK

Code to get the unique Word Frequency for the following using NLTK.
Seq Sentence
1 Let's try to be Good.
2 Being good doesn't make sense.
3 Good is always good.
Output:
{'good':3, 'let':1, 'try':1, 'to':1, 'be':1, 'being':1, 'doesn':1, 't':1, 'make':1, 'sense':1, 'is':1, 'always':1, '.':3, ''':2, 's':1}

If you are very particular about using nltk you the refer the following code snippet
import nltk
text1 = '''Seq Sentence
1 Let's try to be Good.
2 Being good doesn't make sense.
3 Good is always good.'''
words = nltk.tokenize.word_tokenize(text1)
fdist1 = nltk.FreqDist(words)
filtered_word_freq = dict((word, freq) for word, freq in fdist1.items() if not word.isdigit())
print(filtered_word_freq)
Hope it helps.
Referred some parts from:
How to check if string input is a number?
Dropping specific words out of an NLTK distribution beyond stopwords

Try this
from collections import Counter
import pandas as pd
import nltk
sno = nltk.stem.SnowballStemmer('english')
s = "1 Let's try to be Good. 2 Being good doesn't make sense. 3 Good is always good."
s1 = s.split(' ')
d = pd.DataFrame(s1)
s2 = d[0].apply(lambda x: sno.stem(x))
counts = Counter(s2)
print(counts)
Output will be:
Counter({'': 6, 'be': 2, 'good.': 2, 'good': 2, '1': 1, 'let': 1, 'tri': 1, 'to': 1, '2': 1, "doesn't": 1, 'make': 1, 'sense.': 1, '3': 1, 'is': 1, 'alway': 1})

Determine the proximity distance of two string in python

Clean_data is list with over 9000 text files. rules is list of dictionary containing over 500 elements. Below is the rules list
rules = [{'id': 1, 'kwd_root': 'add', 'kwd_sub': 'price target', 'word_count': 5, 'occurance': 1, 'kwd_search': 1, 'status': 1}, {'id': 2, 'kwd_root': 'add', 'kwd_sub': 'PT', 'word_count': 5, 'occurance': 1, 'kwd_search': 1, 'status': 1},.....]
My Question is : I need apply the rules for each and every element in clean_data list.below is the code i have used
for word in clean_data:
for i,d in enumerate(rules):
if any(d['kwd_root'] in word and d['kwd_sub'] in word):
if abs(word.index(d['kwd_root']) - word.index(d['kwd_sub'])) <= d['word_count']:
research.append(word)
else:
non_research.append(word)
else:
non_research.append(word)
After running this code i'm getting the len(non_research) to as 110000 and len(research) as 5500
But the expected output as len(non_research) + len(research) should be equal to len(clean_data)
Thanks

The code indentation posted is wrong. On the other side, the line 3 you use 'any' which need a list as argument. In addition research/non_research append a value each word and each condition (word x condition times). Maybe you can use:
for word in clean_data:
flag_rules = False
for i,d in enumerate(rules):
if d['kwd_root'] in word and d['kwd_sub'] in word:
if abs(word.index(d['kwd_root']) - word.index(d['kwd_sub'])) <= d['word_count']:
flag_rules = True
if flag_rules:
research.append(word)
else:
non_research.append(word)

Trying to create variables from elements in a list

this is my second time trying to ask this question, hoping it comes across more concise this time round.
I have a list of batsmen for a cricket score calculator I am trying to get up and running.
eg.
batsmen = ['S Ganguly', 'M Brown', 'R Uthappa', 'A Majumdar', 'S Smith', 'A Mathews', 'M Manhas', 'W Parnell', 'B Kumar', 'M Kartik', 'A Nehra']
With this list I have a for loop that I run through currently without the list, it just works to calculate 2 teams and finds a winner.
for overs in range(overlimit):
for balls in range(6):
balls_team = balls_team+ 1
runtotal_team = input('Enter runs: ')
I am trying to utilise the list as a means of keeping score as well as holding values for each of the batsmen.
I'm hoping one of you guys can help. I'm assuming a while loop can be used but I am unable to figure out how to implement the list..

Dictionary?
batsmenDict = {
'A Majumdar': 0,
'A Mathews': 0,
'A Nehra': 0,
'B Kumar': 0,
'M Brown': 0,
'M Kartik': 0,
'M Manhas': 0,
'R Uthappa': 0,
'S Ganguly': 0,
'S Smith': 0,
'W Parnell': 0}
batsmenDict['M Manhas'] += 1
There is even a special collection type called a defaultdict that would let you make the default value 0 for each player:
from collections import defaultdict
batsmenDict = defaultdict(int)
print batsmenDict['R Uthappa']
# 0
batsmenDict['R Uthappa'] +=1
print batsmenDict['R Uthappa']
# 1

You want to use a dict, then the names of the batsmen can become the keys for the dict, and their runs/scores the value.

Use a dictionary
>>> batsmen = [ ... ]
>>> d = dict.fromkeys(batsmen)
>>> d['A Mathews'] = 14
>>> d['A Mathews']
14

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way of frequency counting of continuous words? - python

Check out the Aho-Corasick algorithm.

Try with Suffix tree or Trie to store words instead of characters.

Just go through the string and use the dictionary as you would normally to increment any occurance. This is O(n), since dictionary lookup is often O(1). I do this regularly, even for large word collections.

Related

Count total number of modal verbs in text

python count the number of words in the list of strings [duplicate]

Unique word frequency using NLTK

Determine the proximity distance of two string in python

Trying to create variables from elements in a list

Categories

Resources