For example if I had a string without any punctuation:
"She walked the dog to the park and played ball with the dog When she threw the ball to the dog the dog missed the ball and ran to the other side of the park to fetch it"
I know how to do it by converting the string to uppercase/lowercase and using the function
from collections import Counter
but I can't think of any other way to count without using built-in functions (this includes set.default, get, sorted, etc.)
It should come out in a key:value format. Any ideas?
Forget about libraries and "fast" ways of doing it, use simpler logic:
Start by splitting your string using stringName.split(). This returns to you an array of words. Now create an empty dicitonary. Then iterate through the array and do one of two things, if it exists in the dictionary, increment the count by 1, otherwise, create the key value pair with key as the word and value as 1.
At the end, you'll have a count of words.
The code:
testString = "She walked the dog to the park and played ball with the dog When she threw the ball to the dog the dog missed the ball and ran to the other side of the park to fetch it"
dic = {}
words = testString.split()
for raw_word in words:
word = raw_word.lower()
if word in dic:
dic[word] += 1
else:
dic[word] = 1
print dic
Related
I would like to make a function that outputs as a list of how many times it is in the sentence
def words(text:str,number:int):
text=text.split()
for freq in text:
print(freq,"," text.count(freq)
I know this isn't correct, because i want the output to look like
[(cat,4),(dog,10),(pig,3)]
I am unsure how to approach this problem.
I also want it to only allow it to count the amount of different repeated words with the number assigned to the function
After splitting the string I created a set with unique entries. With re.findall you can find the count of every word in the text. The function words returns a dictionary with the word as key and the count as value.
import re
def words(text:str):
textsp = text.split()
unique = set(textsp)
count = []
for word in unique:
count.append(len(re.findall(word, text)))
out = dict(zip(unique, count))
return out
te = "cat dog cat pig dog dog pig cat"
print(words(te))
Learn how to use sets - https://realpython.com/python-sets/
This is a possible approach using sets:
s = "cat cat cat cat dog dog dog".split()
def words(s):
counter = []
for i in set(s):
counter.append((i,s.count(i)))
return counter
print(words(s))
So I have this textfile, and in that file it goes like this... (just a bit of it)
"The truest love that ever heart
Felt at its kindled core
Did through each vein in quickened start
The tide of being pour
Her coming was my hope each day
Her parting was my pain
The chance that did her steps delay
Was ice in every vein
I dreamed it would be nameless bliss
As I loved loved to be
And to this object did I press
As blind as eagerly
But wide as pathless was the space
That lay our lives between
And dangerous as the foamy race
Of ocean surges green
And haunted as a robber path
Through wilderness or wood
For Might and Right and Woe and Wrath
Between our spirits stood
I dangers dared I hindrance scorned
I omens did defy
Whatever menaced harassed warned
I passed impetuous by
On sped my rainbow fast as light
I flew as in a dream
For glorious rose upon my sight
That child of Shower and Gleam"
Now, the calculate the length of words without the letter 'e' in each line of text. So in the first line it should have 4, then 5, then 17, etc.
My current code is
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
for word in line_strip_split:
if "e" not in word:
word_e = word
print (len(word_e))
My explanation is: Strip each word from each other by removing spaces, so it becomes ['Felt','at','its','kindled','core'], etc. Then we split each word because we can regard it individually when removing words with 'e'?. So we want words without e, then print the length of the string.
HOWEVER, this separates each word into a different line by splitting then separating the string? So this doesn't add all the words together in each line but separates it, so the answer becomes "4 / 2 / 3"
Try this:
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
words_with_no_e = []
for word in line_strip_split:
if "e" not in word:
# Adding words without e to a new list
words_with_no_e.append(word)
# ''.join() will returns all the elements of array concatenated
# len() will count the length
print(len(''.join(words_with_no_e)))
It append all the words without e in into new list in each line, then concatenate all words then it prints length of it.
I have a list of reviews and a list of words that I am trying to count how many times each word shows in each review. The list of keywords is roughly around 30 and could grow/change. The current population of reviews is roughly 5000 with the review word count ranging from 3 to several hundred words. The number of reviews will definitely grow. Right now the keyword list is static and the number of reviews will not be growing to much so any solution to get the counts of keywords in each review will work, but ideally it will be one where there isn't a major performance issue if the number reviews drastically increase or the keywords change and all the reviews have to be reanalyzed.
I have been reading through different methods on stackoverflow and haven't been able to get any to work. I know you can use skikit learn to get the count of each word, but haven't figured out if there is a way to count a phrase. I have also tried various regex expressions. If the keyword list was all single words, I know I could very easily use skikit learn, a loop or regex, but I am having issues when the keyword has multiple words.
Two links I have tried
Python - Check If Word Is In A String
Phrase matching using regex and Python
the solution here is close, but it doesn't count all occurrences of the same word
How to return the count of words from a list of words that appear in a list of lists?
both the list of keywords and reviews are being pulled from a MySQL DB. All keywords are in lowercase. All text has been made lowercase and all non-alphanumeric except spaces have been stripped from the reviews. My original though was to use skikit learn countvectorizer to count the words, but not knowing how to handle counting a phrase I switched. I am currently attempting with loops and regex, but I am open to any solution
# Example of what I am currently attempting with regex
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
for review in reviews:
for word in keywords:
results = re.findall(r'\bword\b',review) #this returns no results, the variable word is not getting picked up
#--also tried variations of this to no avail
#--tried creating the pattern first and passing it
# pattern = "r'\\b" + word + "\\b'"
# results = re.findall(pattern,review) #this errors with the msg: sre_constants.error: multiple repeat at position 9
#The results would be
review1: test=2; 'blue sky'=0;'grass is green'=0
review2: test=2; 'blue sky'=1;'grass is green'=0
review3: test=1; 'blue sky'=0;'grass is green'=1
I would first do it in brute force rather than overcomplicating it and try to optimize it later.
from collections import defaultdict
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
results = dict()
for i in keywords:
for j in reviews:
results[i] = results.get(i, 0) + j.count(i)
print results
>{'test': 6, 'blue sky': 1, 'grass is green': 1}
it's importont that we query the dict with .get, in case we don't have a key set, we don't want to deal with KeyError exception.
If you want to go the complicated route, you can build your own trie and counter structure to do searches in large text files.
Parsing one terabyte of text and efficiently counting the number of occurrences of each word
None of the options you tried search for the value of word:
results = re.findall(r'\bword\b', review) checks for the word word in the string.
When you try pattern = "r'\\b" + word + "\\b'" you check for the string "r'\b[value of word]\b'.
You can use the first option, but the pattern should be r'\b%s\b' % word. That will search for the value of word.
I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks
import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']
Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]
Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2
I am trying to compare two strings: devLine and fin_h.
When I split the words in both strings and iterate over words in both strings, I keep getting the error too many values to unpack.
If I add iteritems() like other SO posts mention, then, I get this error: list has no attribute 'iteritems'.
The strings are:
Rob car Mary bike George House Jerry Condo
Rob car Mary dc George dc Jerry dc
I want to check if the word in string1 matches the word in string2. The words I want to compare are the alternate words such as car, bike, house, condo. I want to compare those words with car, dc, dc, dc. If the words are equal, then print true or else, it is false. If the word tuple is (bike, dc), then still print out true because dc signifies any value that can be accepted as an input.
My code looks like this:
def compareLines(devLine, final_hypoth):
devSplit = devLine.split()
hypSplit = final_hypoth.split()
for word in hypSplit.iteritems():
#if hword != "?":
print word
I had also tried using the zip() function because it seemed more pythonic to use it:
def compareLines(devLine, final_hypoth):
devSplit = devLine.split()
hypSplit = final_hypoth.split()
wordSet = [" ".join(tup for tup in zip(devSplit[1::2], hypSplit[1::2])]
# what to do next?
This prints out the odd words in both strings together in an array like (car, car), (bike dc), (house dc), (condo, dc). However, how do I compare both of these values? This way seems easier to print out either true or false if the two words are equal or if the two comparisons include a dc.
Loop over pairs of words with zip:
for word1, word2 in zip(devSplit[1::2], hypSplit[1::2]):
if word1 == word2 or word2 == 'dc':
print 'true'
else:
print 'false'
Note that printing true or false for each pair might not be the most useful behavior. You might want to only print a single value summarizing whether all pairs matched, or you might want to create a list of boolean comparison values, or you might want to do something else.