Python: Creating a function counting specific words in a textfile

Python: Creating a function counting specific words in a textfile - python

I want to create a function that returns the value of word count of a specific word in a text file.
Here's what I currently have:
def Word_Counter(Text_File, Word):
Data = open(Text_File, 'r').read().lower()
count = Data.count(Word)
print(Word, "; ", count)
Word_Counter('Example.txt', "the")
Which returns: "the ; 35"
That is pretty much what I want it to do. But what if I want to test a text for a range of words. I want the words (key) and values in say a list or dictionary. What's a way of doing that without using modules?
Say if I tested the function with this list of words: [time, when, left, I, do, an, who, what, sometimes].
The results I would like would be something like:
Word Counts = {'time': 1, 'when': 4, 'left': 0, 'I': 5, 'do': 2, 'an': 0, 'who': 1, 'what': 3, 'sometimes': 1}
I have been able to create a dictionary which does a word count for every word, like example below.
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
I'd like to do a similar style but only targeting specific words, any suggestions?

From your given code, I did not test this.
def Word_Counter(Text_File, word_list):
Data = open(Text_File, 'r').read().lower()
output = {}
for word in word_list:
output[word] = Data.count(Word)
Or you can do this
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1

UPDATE
Try the following:
keywords = ['the', 'that']
worddict = {}
with open('out.txt', 'r') as f:
text = f.read().split(' ') # or f.read().split(',')
for word in text:
worddict[word] = worddict[word]+1 if word in worddict else 1
print([{x, worddict[x]} for x in keywords])

Related

Word Frequency HW

Write a program that asks a user for a file name, then reads in the file. The program should then determine how frequently each word in the file is used. The words should be counted regardless of case, for example Spam and spam would both be counted as the same word. You should disregard punctuation. The program should then output the the words and how frequently each word is used. The output should be sorted by the most frequent word to the least frequent word.
Only problem I am having is getting the code to count "The" and "the" as the same thing. The code counts them as different words.
userinput = input("Enter a file to open:")
if len(userinput) < 1 : userinput = 'ran.txt'
f = open(userinput)
di = dict()
for lin in f:
lin = lin.rstrip()
wds = lin.split()
for w in wds:
di[w] = di.get(w,0) + 1
lst = list()
for k,v in di.items():
newtup = (v, k)
lst.append(newtup)
lst = sorted(lst, reverse=True)
print(lst)
Need to count "the" and "The" as on single word.

We start by getting the words in a list, updating the list so that all words are in lowercase. You can disregard punctuation by replacing them from the string with an empty character
punctuations = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
s = "I want to count how many Words are there.i Want to Count how Many words are There"
for punc in punctuations:
s = s.replace(punc,' ')
words = s.split(' ')
words = [word.lower() for word in words]
We then iterate through the list, and update a frequency map.
freq = {}
for word in words:
if word in freq:
freq[word] += 1
else:
freq[word] = 1
print(freq)
#{'i': 2, 'want': 2, 'to': 2, 'count': 2, 'how': 2, 'many': 2,
#'words': 2, 'are': #2, 'there': 2}

You can use counter and re like this,
from collections import Counter
import re
sentence = 'Egg ? egg Bird, Goat afterDoubleSpace\nnewline'
# some punctuations (you can add more here)
punctuationsToBeremoved = ",|\n|\?"
#to make all of them in lower case
sentence = sentence.lower()
#to clean up the punctuations
sentence = re.sub(punctuationsToBeremoved, " ", sentence)
# getting the word list
words = sentence.split()
# printing the frequency of each word
print(Counter(words))

Python - Locating Duplicate Words in a Text File

I was wondering if you could help me with a python programming issue? I'm currently trying to write a program that reads a text file and output "word 1 True" if the word had already occurred in that file before or "word 1 False" if this is the first time the word appeared.
Here's what I came up with:
fh = open(fname)
lst = list ()
for line in fh:
words = line.split()
for word in words:
if word in words:
print("word 1 True", word)
else:
print("word 1 False", word)
However, it only returns "word 1 True"
Please advise.
Thanks!

A simple (and fast) way to implement this would be with a python dictionary. These can be thought of like an array, but the index-key is a string rather than a number.
This gives some code fragments like:
found_words = {} # empty dictionary
words1 = open("words1.txt","rt").read().split(' ') # TODO - handle punctuation
for word in words1:
if word in found_words:
print(word + " already in file")
else:
found_words[word] = True # could be set to anything
Now when processing your words, simply checking to see if the word already exists in the dictionary indicates that it was seen already.

You might also want to track previous locations, something like this:
with open(fname) as fh:
vocab = {}
for i, line in enumerate(fh):
words = line.split()
for j, word in enumerate(words):
if word in vocab:
locations = vocab[word]
print word "occurs at", locations
locations.append((i, j))
else:
vocab[word] = [(i, j)]
# print "First occurrence of", word

This snipped code doesn't use the file, but it's easy to test and study. The main difference is that you must load the file and read per line as you did in your example
example_file = """
This is a text file example
Let's see how many time example is typed.
"""
result = {}
words = example_file.split()
for word in words:
# if the word is not in the result dictionary, the default value is 0 + 1
result[word] = result.get(word, 0) + 1
for word, occurence in result.items():
print("word:%s; occurence:%s" % (word, occurence))
UPDATE:
As suggested by #khachik a better solution is using the Counter.
>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

Following your route you could do this:
with open('tyger.txt', 'r') as f:
lines = (f.read()).split()
for word in lines:
if lines.count(word) > 1:
print(f"{word}: True")
else:
print(f"{word}: Flase")
Output
(xenial)vash#localhost:~/python/stack_overflow$ python3.7 read_true.py
When: Flase
the: True
stars: Flase
threw: Flase
down: Flase
their: True
spears: Flase
...
You could also count every word:
with open('tyger.txt', 'r') as f:
count = {}
lines = f.read()
lines = lines.split()
for i in lines:
count[i] = lines.count(i)
print(count)
Output
{'When': 1, 'the': 2, 'stars': 1, 'threw': 1, 'down': 1, 'their': 2,
'spears': 1, 'And': 1, "water'd": 1, 'heaven': 1, 'with': 1, 'tears:':
1, 'Did': 2, 'he': 2, 'smile': 1, 'his': 1, 'work': 1, 'to': 1,
'see?': 1, 'who': 1, 'made': 1, 'Lamb': 1, 'make': 1, 'thee?': 1}
You can use the dictionary like so:
for k in count:
if count[k] > 1:
print(f"{k}: True")
else:
print(f"{k}: False")
Ouput
When: False
the: True
stars: False
threw: False
down: False
their: True
spears: False

Change a file to a list to a dictionary

I am trying to write a code that takes the text from a novel and converts it to a dictionary where the keys are each unique word and the values are the number of occurrences of the word in the text.
For example it could look like: {'the': 25, 'girl': 59...etc}
I have been trying to make the text first into a list and then use the Counter function to make a dictionary of all the words:
source = open('novel.html', 'r', encoding = "UTF-8")
soup = BeautifulSoup(source, 'html.parser')
#make a list of all the words in file, get rid of words that aren't content
mylist = []
mylist.append(soup.find_all('p'))
newlist = filter(None, mylist)
cnt = collections.Counter()
for line in newlist:
try:
if line is not None:
words = line.split(" ")
for word in line:
cnt[word] += 1
except:
pass
print(cnt)
This code doesn't work because of an error with "NoneType" or it just prints an empty list. I'm wondering if there is an easier way to do what I'm trying to do or how I can fix this code so it won't have this error.

import collections
from bs4 import BeautifulSoup
with open('novel.html', 'r', encoding='UTF-8') as source:
soup = BeautifulSoup(source, 'html.parser')
cnt = collections.Counter()
for tag in soup.find_all('p'):
for word in tag.string.split():
word = ''.join(ch for ch in word.lower() if ch.isalnum())
if word != '':
cnt[word] += 1
print(cnt)
with statement is simply a safer way to open the file
soup.find_all returns a list of Tag's
tag.string.split() gets all the words (separated by spaces) from the Tag
word = ''.join(ch for ch in word.lower() if ch.isalnum()) removes punctuation and convertes to lowercase so that 'Hello' and 'hello!' count as the same word

For the counter just do a
from collections import Counter
cnt = Counter(mylist)
Are you sure your list is getting items to begin with? After what step are you getting an empty list?

Once you've converted your page to a list, try something like this out:
#create dictionary and fake list
d = {}
x = ["hi", "hi", "hello", "hey", "hi", "hello", "hey", "hi"]
#count the times a unique word occurs and add that pair to your dictionary
for word in set(x):
count = x.count(word)
d[word] = count
Output:
{'hello': 2, 'hey': 2, 'hi': 4}

How can I create a list that is created using the indexes and items of two other lists?

I have two lists. One is made up of positions from a sentence and the other is made up of words that make up the sentence. I want to recreate the variable sentence using poslist and wordlist.
recreate = []
sentence = "This and that, and this and this."
poslist = [1, 2, 3, 2, 4, 2, 5]
wordlist = ['This', 'and', 'that', 'this', 'this.']
I wanted to use a for loop to go through poslist and if the item in poslist was equal to the position of a word in wordlist it would append it to a new list, recreating the original list. My first try was:
for index in poslist:
recreate.append(wordlist[index])
print (recreate)
I had to make the lists strings to write the lists into a text file. When I tried splitting them again and using the code shown above it does not work. It said that the indexes needed to be slices or integers or slices not in a list. I would like a solution to this problem. Thank you.
The list of words is gotten using:
sentence = input("Enter a sentence >>") #asking the user for an input
sentence_lower = sentence.lower() #making the sentence lower case
wordlist = [] #creating a empty list
sentencelist = sentence.split() #making the sentence into a list
for word in sentencelist: #for loop iterating over the sentence as a list
if word not in wordlist:
wordlist.append(word)
txtfile = open ("part1.txt", "wt")
for word in wordlist:
txtfile.write(word +"\n")
txtfile.close()
txtfile = open ("part1.txt", "rt")
for item in txtfile:
print (item)
txtfile.close()
print (wordlist)
And the positions are gotten using:
poslist = []
textfile = open ("part2.txt", "wt")
for word in sentencelist:
poslist.append([position + 1 for position, i in enumerate(wordlist) if i == word])
print (poslist)
str1 = " ".join(str(x) for x in poslist)
textfile = open ("part2.txt", "wt")
textfile.write (str1)
textfile.close()

Lists are 0-indexed (the first item has the index 0, the second the index 1, ...), so you have to substract 1 from the indexes if you want to use "human" indexes in the poslist:
for index in poslist:
recreate.append(wordlist[index-1])
print (recreate)
Afterwards, you can glue them together again and write them to a file:
with open("thefile.txt", "w") as f:
f.write("".join(recreate))

First, your code can be simplified to:
sentence = input("Enter a sentence >>") #asking the user for an input
sentence_lower = sentence.lower() #making the sentence lower case
wordlist = [] #creating a empty list
sentencelist = sentence.split() #making the sentence into a list
with open ("part1.txt", "wt") as txtfile:
for word in sentencelist: #for loop iterating over the sentence as a list
if word not in wordlist:
wordlist.append(word)
txtfile.write(word +"\n")
poslist = [wordlist.index(word) for word in sentencelist]
print (poslist)
str1 = " ".join(str(x) for x in poslist)
with open ("part2.txt", "wt") as textfile:
textfile.write (str1)
In your original code, poslist was a list of lists instead of a list of integers.
Then, if you want to reconstruct your sentence from poslist (which is now a list of int and not a list of lists as in the code you provided) and wordlist, you can do the following:
sentence = ' '.join(wordlist[pos] for pos in poslist)

You can also do it using a generator expression and the string join method:
sentence = ' '.join(wordlist[pos-1] for pos in poslist if pos if pos <= len(wordlist))
# 'This and that, and this and this.'

You can use operator.itemgetter() for this.
from operator import itemgetter
poslist = [0, 1, 2, 1, 3, 1, 4]
wordlist = ['This', 'and', 'that', 'this', 'this.']
print(' '.join(itemgetter(*poslist)(wordlist)))
Note that I had to subtract one from all of the items in poslist, as Python is a zero-indexed language. If you need to programmatically change poslist, you could do poslist = (n - 1 for n in poslist) right after you declare it.

Trying to Find Most Occurring Name in a File

I have 4 text files that I want to read and find the top 5 most occurring names of. The text files have names in the following format "Rasmus,M,11". Below is my code which right now is able to call all of the text files and then read them. Right now, this code prints out all of the names in the files.
def top_male_names ():
for x in range (2008, 2012):
txt = "yob" + str(x) + ".txt"
file_handle = open(txt, "r", encoding="utf-8")
file_handle.seek(0)
line = file_handle.readline().strip()
while line != "":
print (line)
line = file_handle.readline().strip()
top_male_names()
My question is, how can I keep track of all of these names, and find the top 5 that occur the most? The only way I could think of was creating a variable for each name, but that wouldn't work because there are 100s of entries in each text file, probably with 100s of different of names.

This is the gist of it:
from collections import Counter
counter = Counter()
for line in file_handle:
name, gender, age = line.split(',')
counter[name] += 1
print counter.most_common()
You can adapt it to your program.

If you need to count a number of words in a text, use regex.
For example
import re
my_string = "Wow! Is this true? Really!?!? This is crazy!"
words = re.findall(r'\w+', my_string) #This finds words in the document
Output::
>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']
"Is" and "is" are two different words. So we can just capitalize all the words, and then count them.
from collections import Counter
cap_words = [word.upper() for word in words] #capitalizes all the words
word_counts = Counter(cap_words) #counts the number each time a word appears
Output:
>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})
Now reading a file :
import re
from collections import Counter
with open('file.txt') as f: text = f.read()
words = re.findall(r'\w+', text )
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
Then you only have to sort the dict containing all the words, for the values not for keys and see the top 5 words.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Creating a function counting specific words in a textfile - python

UPDATE Try the following: keywords = ['the', 'that'] worddict = {} with open('out.txt', 'r') as f: text = f.read().split(' ') # or f.read().split(',') for word in text: worddict[word] = worddict[word]+1 if word in worddict else 1 print([{x, worddict[x]} for x in keywords])

Related

Word Frequency HW

Python - Locating Duplicate Words in a Text File

Change a file to a list to a dictionary

How can I create a list that is created using the indexes and items of two other lists?

Trying to Find Most Occurring Name in a File

Categories

Resources