word frequencies in text file in python

word frequencies in text file in python - python

I want to find frequencies for the certain words in wanted, and while it finds me the frequecies, the displayed result contains lots of unnecessary data.
Code:
from collections import Counter
import re
wanted = "whereby also thus"
cnt = Counter()
words = re.findall('\w+', open('C:/Users/user/desktop/text.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
Results:
Counter({'e': 131, 'a': 119, 'by': 38, 'where': 16, 's': 14, 'also': 13, 'he': 4, 'whereby': 2, 'al': 2, 'b': 2, 'o': 1, 't': 1})
Questions:
How do i omit all those 'e', 'a' 'by', 'where', etc.?
If I then wanted to sum up the frequencies of words (also, thus, whereby) and divide them by total number of words in text, would that be possible?
disclaimer: this is not school assignment. i jut got lots of free time at work now and since i spend a lot of time with reading texts i decided to do this little project of mine to remind myself a bit of what i've been taught couple years ago.
Thanks in advance for any help.

As others have pointed out, you need to change your string wanted to a list. I just hardcoded a list, but you could do use str.split(" ") if you were passed a string in a function. I also implemented you the frequency counter. Just as a note, make sure you close your files; it's also easier (and recommended) that you use the open directive.
from collections import Counter
import re
wanted = ["whereby", "also", "thus"]
cnt = Counter()
with open('C:/Users/user/desktop/text.txt', 'r') as fp:
fp_contents = fp.read().lower()
words = re.findall('\w+', fp_contents)
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
total_cnt = sum(cnt.values())
print(float(total_cnt)/len(cnt))

Reading from the web
I made this little mod of the code of Axel to read from a txt on the web, Alice in wonderland, to apply the code (as I don't have your txt file and I wanted to try it). So, I publish it here in case someone should need something like this.
from collections import Counter
import re
from urllib.request import urlopen
testo = str(urlopen("https://www.gutenberg.org/files/11/11.txt").read())
wanted = ["whereby", "also", "thus", "Alice", "down", "up", "cup"]
cnt = Counter()
words = re.findall('\w+', testo)
for word in words:
if word in wanted:
cnt[word] += 1
print(cnt)
total_cnt = sum(cnt.values())
print(float(total_cnt) / len(cnt))
output
Counter({'Alice': 334, 'up': 97, 'down': 90, 'also': 4, 'cup': 2})
105.4
>>>
How many times the same word is found in adjacent sentences
This answer to the request (from the author of the question) of looking for how many times a word is found in adjacent sentences. If in a sentence there are more same words (ex.: 'had') and in the next there is another equal, I counted that for 1 ripetition. That is why I used the wordfound list.
from collections import Counter
import re
testo = """There was nothing so VERY remarkable in that; nor did Alice think it so? Thanks VERY much. Out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed. Quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS? WAISTCOAT-POCKET, and looked at it, and then hurried on.
Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit. with either a waistcoat-pocket, or a watch to take out of it! and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop? Down a large rabbit-hole under the hedge.
Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw. How she longed to get out of that dark hall, and wander about among those beds of bright flowers and those cool fountains, but she could not even get her head through the doorway; 'and even if my head would go through,' thought poor Alice, 'it would be of very little use without my shoulders. Oh, how I wish I could shut up like a telescope! I think I could, if I only knew how to begin.'For, you see, so many out-of-the-way things had happened lately, that Alice had begun to think that very few things indeed were really impossible. There seemed to be no use in waiting by the little door, so she went back to the table, half hoping she might find another key on it, or at any rate a book of rules for shutting people up like telescopes: this time she found a little bottle on it, ('which certainly was not here before,' said Alice,) and round the neck of the bottle was a paper label, with the words 'DRINK ME' beautifully printed on it in large letters. It was all very well to say 'Drink me,' but the wise little Alice was not going to do THAT in a hurry. 'No, I'll look first,' she said, 'and see whether it's marked "poison" or not'; for she had read several nice little histories about children who had got burnt, and eaten up by wild beasts and other unpleasant things, all because they WOULD not remember the simple rules their friends had taught them: such as, that a red-hot poker will burn you if you hold it too long; and that if you cut your finger VERY deeply with a knife, it usually bleeds; and she had never forgotten that, if you drink much from a bottle marked 'poison,' it is almost certain to disagree with you, sooner or later. However, this bottle was NOT marked 'poison,' so Alice ventured to taste it, and finding it very nice, (it had, in fact, a sort of mixed flavour of cherry-tart, custard, pine-apple, roast turkey, toffee, and hot buttered toast,) she very soon finished it off. """
frasi = re.findall("[A-Z].*?[\.!?]", testo, re.MULTILINE | re.DOTALL)
print("How many times this words are repeated in adjacent sentences:")
cnt2 = Counter()
for n, s in enumerate(frasi):
words = re.findall("\w+", s)
wordfound = []
for word in words:
try:
if word in frasi[n + 1]:
wordfound.append(word)
if wordfound.count(word) < 2:
cnt2[word] += 1
except IndexError:
pass
for k, v in cnt2.items():
print(k, v)
output
How many times this words are repeated in adjacent sentences:
had 1
hole 1
or 1
as 1
little 2
that 1
hot 1
large 1
it 5
to 5
a 6
not 3
and 2
s 1
me 1
bottle 1
is 1
no 1
the 6
how 1
Oh 1
she 2
at 1
marked 1
think 1
VERY 1
I 2
door 1
red 1
of 1
dear 1
see 1
could 2
in 2
so 1
was 1
poison 1
A 1
Alice 3
all 1
nice 1
rabbit 1

Related

Python How to count how many times each of the vocabulary words shows in the sentence? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
hey guys Im confused and very unsure why my code is not working. what I am doing in this code is trying to find certain words from a list in a sentence I have and output the number of times it is repeated within the sentence.
vocabulary =["in","on","to","www"]
numwords = [0,0,0,0]
mysentence = (" ʺAlso the sea tosses itself and breaks itself, and should any  sleeper fancying that he might find on the beach an answer to his doubts, a  sharer of his solitude, throw off his bedclothes and go down by himself to  walk on the sand, no image with semblance of serving and divine  promptitude comes readily to hand bringing the night to order and making  the world reflect the compass of the soul.ʺ)
for word in mysentence.split():
if (word == vocabulary):
else:
numwords[0] += 1
if(word == vocabulary):
else:
numwords[1] +=1
if (word == vocabulary):
else:
numwords [2] += 1
if (word == vocabulary):
else :
numwords [3] += 1
if (word == vocabulary):
else:
numwords [4] += 1
print "total number of words : " + str(len(mysentence))

The easiest way to do this is to use collections.Counter to count all the words in the sentence, and then look up the ones you're interested in.
from collections import Counter
vocabulary =["in","on","to","www"]
mysentence = "Also the sea tosses itself and breaks itself, and should any sleeper fancying that he might find on the beach an answer to his doubts, a sharer of his solitude, throw off his bedclothes and go down by himself to walk on the sand, no image with semblance of serving and divine promptitude comes readily to hand bringing the night to order and making the world reflect the compass of the soul."
mysentence = mysentence.split()
c = Counter(mysentence)
numwords = [c[i] for i in vocabulary]
print(numwords)

Presumably you could iterate through the list with a for loop checking if it's in the list and then incrementing the counter - an example implementation might look like
def find_word(word,string):
word_count = 0
for i in range(len(string)):
if list[i] == word:
word_count +=1
This might be a little inefficient, but I'm sure it might be easier to understand for you than collections.Counter :)

I would do it like this honestly to check:
for word in mysentence.split():
if word in vocabulary:
numwords[vocabulary.index(word)] += 1
Therefore your entire code would look like this:
vocabulary = ["in", "on", "to", "www"]
numwords = [0, 0, 0, 0]
mysentence = (" ʺAlso the sea tosses itself and breaks itself, and should any sleeper fancying that he might find on the beach an answer to his doubts, a sharer of his solitude, throw off his bedclothes and go down by himself to walk on the sand, no image with semblance of serving and divine promptitude comes readily to hand bringing the night to order and making the world reflect the compass of the soul.ʺ")
for word in mysentence.replace('.', '').replace(',', '').split():
if word in vocabulary:
numwords[vocabulary.index(word)] += 1
print("total number of words : " + str(len(mysentence)))
As #Jacob suggested, replacing the '.' and ',' characters can also be applied before the split, to avoid any possible conflicts.

Consider the issue that characters like “ and ” may not parse well unless an appropriate encoding scheme has been specified.
this_is_how_you_define_a_string = "The string goes here"
# and thus:
mysentence = "Also the sea tosses itself and breaks itself, and should any sleeper fancying that he might find on the beach an answer to his doubts, a sharer of his solitude, throw off his bedclothes and go down by himself to walk on the sand, no image with semblance of serving and divine promptitude comes readily to hand bringing the night to order and making the world reflect the compass of the soul."
for v in vocabulary:
v in mysentence # Notice the indentation of 4 spaces
This solution will return TRUE or FALSE if v i sin mysentence. I think I will leave as an exercise how to accumulate the values. Hint: TRUE == 1 and FALSE = 0. You need the sum of the true values for each word v.

how to search char from file in python

Actually i am coming from C++ and i am new here as well, I am having iteration problem.I am using python 2.7.8 and unable to solve which is what i am wanting. I have a file name called "foo.txt". Through code i am trying to find using how many "a e i o u" are in the file. I have created array: vowel[] = {'a','e','i','o',u} and my code should shd give me the combine count of all vowels. But i am facing
error:
TypeError: list indices must be integers, not str
file foo.txt
Chronobiology might sound a little futuristic – like something from a science fiction novel, perhaps – but it’s actually a field of study that concerns one of the oldest processes life on this planet has ever known: short-term rhythms of time and their effect on flora and fauna.
This can take many forms. Marine life, for example, is influenced by tidal patterns. Animals tend to be active or inactive depending on the position of the sun or moon. Numerous creatures, humans included, are largely diurnal – that is, they like to come out during the hours of sunlight. Nocturnal animals, such as bats and possums, prefer to forage by night. A third group are known as crepuscular: they thrive in the low-light of dawn and dusk and remain inactive at other hours.
When it comes to humans, chronobiologists are interested in what is known as the circadian rhythm. This is the complete cycle our bodies are naturally geared to undergo within the passage of a twenty-four hour day. Aside from sleeping at night and waking during the day, each cycle involves many other factors such as changes in blood pressure and body temperature. Not everyone has an identical circadian rhythm. ‘Night people’, for example, often describe how they find it very hard to operate during the morning, but become alert and focused by evening. This is a benign variation within circadian rhythms known as a chronotype.
my code:
fo = open("foo.txt", "r")
count = 0
for i in fo:
word = i
vowels = ['a','e','i','o','u','y']
word = word.lower().strip(".:;?!")
#print word
for j in word: # wanting that loop shd iterate till the end of file
for k in vowels: # wanting to index string array until **vowels.length()**
if (vowels[k] == word[j]):
count +=1
#print word[0]
print count

Python has a wonderful module called collections with a function Counter. You can use it like this:
import collections
with open('foo.txt') as f:
letters = collections.Counter(f.read())
vowels = ['a','e','i','o','u','y']
## you just want the sum
print(sum(letters[vowel] for vowel in vowels))
You can also do it without collections.Counter():
import itertools
vowels = {'a','e','i','o','u','y'}
with open("foo.txt") as f:
print(sum(1 for char in itertools.chain.from_iterable(f) if char in vowels))
Please note that the time complexity of a set {} lookup is O(1), whereas the time complexity for a list [] lookup is O(n) according to this page on wiki.python.org.
I tested both methods with the module timeit and as expected the first method using collections.Counter() is slightly faster:
0.13573385099880397
0.16710168996360153

Do in range(len()) instead, because if you use for k in vowels , k will be 'a' then 'b' then 'c'... etc. However, the syntax for getting objects via indexes is vowels[index_number], not vowels[content]. So, you have to iterate over the length of the array, and use vowels[0] to get 'a', then vowels[1]' to get 'b' etc.
fo = open("foo.txt", "r")
count = 0
for i in fo:
word = i
vowels = ['a','e','i','o','u','y']
word = word.lower().strip(".:;?!")
#print word
for j in range(len(word)): # wanting that loop shd iterate till the end of file
if (word[j] in vowels):
count +=1
#print word[0]
print count

Python prides itself on its abstraction and standard library data structures. Check out collections.Counter. It takes an iterable and returns a dict of value -> frequency.
with open('foo.txt') as f:
string = f.read()
counter = collections.Counter(string) # a string is an iterable of characters
vowel_counts = {vowel: counter[vowel] for vowel in "aeiou"}

Let Python take in sentence by sentence instead of word by word?

I have a series of strings, and I want Python to take it sentence by sentence when creating a tuple. For example:
string = [("I am a good boy"), ("I am a good girl")]
tuple = [("I am a good boy", -1), ("I am a good girl", -1)]
But apparently it's doing:
tuple = [("I", -1), ("am", -1), ("a", -1), ("good", -1), ("boy", -1).....]
What went wrong and how do I resolve it?
import re
def cleanedthings(trainset):
cleanedtrain = []
specialch = "!##$%^&*-=_+:;\".,/?`~][}{|)("
for line in trainset:
for word in line.split():
lowword = word.lower()
for ch in specialch:
if ch in lowword:
lowword = lowword.replace(ch,"")
if len(lowword) >= 3:
cleanedtrain.append(lowword)
return cleanedtrain
poslinesTrain = [('I just wanted to drop you a note to let you know how happy I am with my cabinet'), ('The end result is a truly amazing transformation!'), ('Who can I thank for this?'), ('For without his artistry and craftmanship this transformation would not have been possible.')]
neglinesTrain = [('I have no family and no friends, very little food, no viable job and very poor future prospects.'), ('I have therefore decided that there is no further point in continuing my life.'), ('It is my intention to drive to a secluded area, near my home, feed the car exhaust into the car, take some sleeping pills and use the remaining gas in the car to end my life.')]
poslinesTest = [('Another excellent resource from Teacher\'s Clubhouse!'), ('This cake tastes awesome! It\'s almost like I\'m in heaven already oh God!'), ('Don\'t worry too much, I\'ll always be here for you when you need me. We will be playing games or watching movies together everytime to get your mind off things!'), ('Hey, this is just a simple note for you to tell you that you\'re such a great friend to be around. You\'re always being the listening ear to us, and giving us good advices. Thanks!')]
neglinesTest = [('Mum, I could write you for days, but I know nothing would actually make a difference to you.'), ('You are much too ignorant and self-concerned to even attempt to listen or understand. Everyone knows that.'), ('If I were, your BITCHY comments that I\'m assuming were your attempt to help, wouldn\'t have.'), ('If I have stayed another minute I would have painted the walls and stained the carpets with my blood, so you could clean it up... I wish I were never born.')]
clpostrain = cleanedthings(poslinesTrain)
clnegtrain = cleanedthings(neglinesTrain)
clpostest = cleanedthings(poslinesTest)
clnegtest = cleanedthings(neglinesTest)
trainset= [(x,1) for x in clpostrain] + [(x,-1) for x in clnegtrain]
testset= [(x,1) for x in clpostest] + [(x,-1) for x in clnegtest]
print testset

You joined the final result by words instead by sentences. Adding a variable for every sentence will fix your error
def cleanedthings(trainset):
cleanedtrain = []
specialch = "!##$%^&*-=_+:;\".,/?`~][}{|)("
for line in trainset:
#will append the clean word of the current sentence in this var
sentence = []
for word in line.split():
lowword = word.lower()
for ch in specialch:
if ch in lowword:
lowword = lowword.replace(ch,"")
if len(lowword) >= 3:
sentence.append(lowword)
#once we check all words, recreate the sentence joining by white space
#and append to the list of cleaned sentences
cleanedtrain.append(' '.join(sentence))
return cleanedtrain

Function that insert words into text

I have a text that goes like this:
text = "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."
How do I write a function hedging(text) that processes my text and produces a new version that inserts the word "like" in the every third word of the text?
The outcome should be like that:
text2 = "All human beings like are born free like and equal in like..."
Thank you!

Instead of giving you something like
solution=' like '.join(map(' '.join, zip(*[iter(text.split())]*3)))
I'm posting a general advice on how to approach the problem. The "algorithm" is not particularly "pythonic", but hopefully easy to understand:
words = split text into words
number of words processed = 0
for each word in words
output word
number of words processed += 1
if number of words processed is divisible by 3 then
output like
Let us know if you have questions.

You could go with something like that:
' '.join([n + ' like' if i % 3 == 2 else n for i, n in enumerate(text.split())])

NER naive algorithm

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()

From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]

What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.