Trying to Find Most Occurring Name in a File - python

I have 4 text files that I want to read and find the top 5 most occurring names of. The text files have names in the following format "Rasmus,M,11". Below is my code which right now is able to call all of the text files and then read them. Right now, this code prints out all of the names in the files.
def top_male_names ():
for x in range (2008, 2012):
txt = "yob" + str(x) + ".txt"
file_handle = open(txt, "r", encoding="utf-8")
file_handle.seek(0)
line = file_handle.readline().strip()
while line != "":
print (line)
line = file_handle.readline().strip()
top_male_names()
My question is, how can I keep track of all of these names, and find the top 5 that occur the most? The only way I could think of was creating a variable for each name, but that wouldn't work because there are 100s of entries in each text file, probably with 100s of different of names.

This is the gist of it:
from collections import Counter
counter = Counter()
for line in file_handle:
name, gender, age = line.split(',')
counter[name] += 1
print counter.most_common()
You can adapt it to your program.

If you need to count a number of words in a text, use regex.
For example
import re
my_string = "Wow! Is this true? Really!?!? This is crazy!"
words = re.findall(r'\w+', my_string) #This finds words in the document
Output::
>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']
"Is" and "is" are two different words. So we can just capitalize all the words, and then count them.
from collections import Counter
cap_words = [word.upper() for word in words] #capitalizes all the words
word_counts = Counter(cap_words) #counts the number each time a word appears
Output:
>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})
Now reading a file :
import re
from collections import Counter
with open('file.txt') as f: text = f.read()
words = re.findall(r'\w+', text )
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
Then you only have to sort the dict containing all the words, for the values not for keys and see the top 5 words.

Related

Matching 2 words in 2 lines and +1 to the matching pair?

So Ive got a variable list which is always being fed a new line
And variable words which is a big list of single word strings
Every time list updates I want to compare it to words and see if any strings from words are in list
If they do match, lets say the word and is in both of them, I then want to print "And : 1". Then if next sentence has that as well, to print "And : 2", etc. If another word comes in like The I want to print +1 to that
So far I have split the incoming text into an array with text.split() - unfortunately that is where im stuck. I do see some use in [x for x in words if x in list] but dont know how I would use that. Also how I would extract the specific word that is matching
You can use a collections.Counter object to keep a tally for each of the words that you are tracking. To improve performance, use a set for your word list (you said it's big). To keep things simple assume there is no punctuation in the incoming line data. Case is handled by converting all incoming words to lowercase.
from collections import Counter
words = {'and', 'the', 'in', 'of', 'had', 'is'} # words to keep counts for
word_counts = Counter()
lines = ['The rabbit and the mole live in the ground',
'Here is a sentence with the word had in it',
'Oh, it also had in in it. AND the and is too']
for line in lines:
tracked_words = [w for word in line.split() if (w:=word.lower()) in words]
word_counts.update(tracked_words)
print(*[f'{word}: {word_counts[word]}'
for word in set(tracked_words)], sep=', ')
Output
the: 3, and: 1, in: 1
the: 4, in: 2, is: 1, had: 1
the: 5, and: 3, in: 4, is: 2, had: 2
Basically this code takes a line of input, splits it into words (assuming no punctuation), converts these words to lowercase, and discards any words that are not in the main list of words. Then the counter is updated. Finally the current values of the relevant words is printed.
This does the trick:
sentence = "Hello this is a sentence"
list_of_words = ["this", "sentence"]
dict_of_counts = {} #This will hold all words that have a minimum count of 1.
for word in sentence.split(): #sentence.split() returns a list with each word of the sentence, and we loop over it.
if word in list_of_words:
if word in dict_of_counts: #Check if the current sentence_word is in list_of_words.
dict_of_counts[word] += 1 #If this key already exists in the dictionary, then add one to its value.
else:
dict_of_counts[word] = 1 #If key does not exists, create it with value of 1.
print(f"{word}: {dict_of_counts[word]}") #Print your statement.
The total count is kept in dict_of_counts and would look like this if you print it:
{'this': 1, 'sentence': 1}
You should use defaultdict here for the fastest processing.
from collections import defaultdict
input_string = "This is an input string"
list_of_words = ["input", "is"]
counts = defaultdict(int)
for word in input_string.split():
if word in list_of_words:
counts[word] +=1

Python: Creating a function counting specific words in a textfile

I want to create a function that returns the value of word count of a specific word in a text file.
Here's what I currently have:
def Word_Counter(Text_File, Word):
Data = open(Text_File, 'r').read().lower()
count = Data.count(Word)
print(Word, "; ", count)
Word_Counter('Example.txt', "the")
Which returns: "the ; 35"
That is pretty much what I want it to do. But what if I want to test a text for a range of words. I want the words (key) and values in say a list or dictionary. What's a way of doing that without using modules?
Say if I tested the function with this list of words: [time, when, left, I, do, an, who, what, sometimes].
The results I would like would be something like:
Word Counts = {'time': 1, 'when': 4, 'left': 0, 'I': 5, 'do': 2, 'an': 0, 'who': 1, 'what': 3, 'sometimes': 1}
I have been able to create a dictionary which does a word count for every word, like example below.
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
I'd like to do a similar style but only targeting specific words, any suggestions?
From your given code, I did not test this.
def Word_Counter(Text_File, word_list):
Data = open(Text_File, 'r').read().lower()
output = {}
for word in word_list:
output[word] = Data.count(Word)
Or you can do this
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
UPDATE
Try the following:
keywords = ['the', 'that']
worddict = {}
with open('out.txt', 'r') as f:
text = f.read().split(' ') # or f.read().split(',')
for word in text:
worddict[word] = worddict[word]+1 if word in worddict else 1
print([{x, worddict[x]} for x in keywords])

Word Frequency HW

Write a program that asks a user for a file name, then reads in the file. The program should then determine how frequently each word in the file is used. The words should be counted regardless of case, for example Spam and spam would both be counted as the same word. You should disregard punctuation. The program should then output the the words and how frequently each word is used. The output should be sorted by the most frequent word to the least frequent word.
Only problem I am having is getting the code to count "The" and "the" as the same thing. The code counts them as different words.
userinput = input("Enter a file to open:")
if len(userinput) < 1 : userinput = 'ran.txt'
f = open(userinput)
di = dict()
for lin in f:
lin = lin.rstrip()
wds = lin.split()
for w in wds:
di[w] = di.get(w,0) + 1
lst = list()
for k,v in di.items():
newtup = (v, k)
lst.append(newtup)
lst = sorted(lst, reverse=True)
print(lst)
Need to count "the" and "The" as on single word.
We start by getting the words in a list, updating the list so that all words are in lowercase. You can disregard punctuation by replacing them from the string with an empty character
punctuations = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
s = "I want to count how many Words are there.i Want to Count how Many words are There"
for punc in punctuations:
s = s.replace(punc,' ')
words = s.split(' ')
words = [word.lower() for word in words]
We then iterate through the list, and update a frequency map.
freq = {}
for word in words:
if word in freq:
freq[word] += 1
else:
freq[word] = 1
print(freq)
#{'i': 2, 'want': 2, 'to': 2, 'count': 2, 'how': 2, 'many': 2,
#'words': 2, 'are': #2, 'there': 2}
You can use counter and re like this,
from collections import Counter
import re
sentence = 'Egg ? egg Bird, Goat afterDoubleSpace\nnewline'
# some punctuations (you can add more here)
punctuationsToBeremoved = ",|\n|\?"
#to make all of them in lower case
sentence = sentence.lower()
#to clean up the punctuations
sentence = re.sub(punctuationsToBeremoved, " ", sentence)
# getting the word list
words = sentence.split()
# printing the frequency of each word
print(Counter(words))

Change a file to a list to a dictionary

I am trying to write a code that takes the text from a novel and converts it to a dictionary where the keys are each unique word and the values are the number of occurrences of the word in the text.
For example it could look like: {'the': 25, 'girl': 59...etc}
I have been trying to make the text first into a list and then use the Counter function to make a dictionary of all the words:
source = open('novel.html', 'r', encoding = "UTF-8")
soup = BeautifulSoup(source, 'html.parser')
#make a list of all the words in file, get rid of words that aren't content
mylist = []
mylist.append(soup.find_all('p'))
newlist = filter(None, mylist)
cnt = collections.Counter()
for line in newlist:
try:
if line is not None:
words = line.split(" ")
for word in line:
cnt[word] += 1
except:
pass
print(cnt)
This code doesn't work because of an error with "NoneType" or it just prints an empty list. I'm wondering if there is an easier way to do what I'm trying to do or how I can fix this code so it won't have this error.
import collections
from bs4 import BeautifulSoup
with open('novel.html', 'r', encoding='UTF-8') as source:
soup = BeautifulSoup(source, 'html.parser')
cnt = collections.Counter()
for tag in soup.find_all('p'):
for word in tag.string.split():
word = ''.join(ch for ch in word.lower() if ch.isalnum())
if word != '':
cnt[word] += 1
print(cnt)
with statement is simply a safer way to open the file
soup.find_all returns a list of Tag's
tag.string.split() gets all the words (separated by spaces) from the Tag
word = ''.join(ch for ch in word.lower() if ch.isalnum()) removes punctuation and convertes to lowercase so that 'Hello' and 'hello!' count as the same word
For the counter just do a
from collections import Counter
cnt = Counter(mylist)
Are you sure your list is getting items to begin with? After what step are you getting an empty list?
Once you've converted your page to a list, try something like this out:
#create dictionary and fake list
d = {}
x = ["hi", "hi", "hello", "hey", "hi", "hello", "hey", "hi"]
#count the times a unique word occurs and add that pair to your dictionary
for word in set(x):
count = x.count(word)
d[word] = count
Output:
{'hello': 2, 'hey': 2, 'hi': 4}

How to check for words in a list and then return a count?

I am relatively new to python and have a question:
I am trying to write a script that will read a .txt file and check if words are in a list that I've provided and then return a count as to how many words were in that list.
So far,
import string
#this is just an example of a list
list = ['hi', 'how', 'are', 'you']
filename="hi.txt"
infile=open(filename, "r")
lines = infile.readlines()
for line in lines:
words = line.split()
for word in words:
word = word.strip(string.punctuation)
I've tried to split the file into lines and then the lines into words without punctuation.
I am not sure where to go after this. I would like ultimately for the output to be something like this:
"your file has x words that are in the list".
Thank you!
You can split your file to words using the following command :
words=reduce(lambda x,y:x+y,[line.split() for line in f])
Then count the number of words in your word list with loop over it and using count function :
w_list = ['hi', 'how', 'are', 'you']
with open("hi.txt", "r") as f :
words=reduce(lambda x,y:x+y,[line.split() for line in f])
for w in w_list:
print "your file has {} {}".format(words.count(w),w)
# words to search for;
# (stored as a set so `word in search_for` is O(1))
search_for = set(["hi", "how", "are", "you"])
# get search text
# (no need to split into lines)
with open("hi.txt") as inf:
text = inf.read().lower()
# create translation table
# - converts non-word chars to spaces (this maintains appropriate word-breaks)
# - keeps apostrophe (for words like "don't" or "couldn't")
trans = str.maketrans(
"0123456789abcdefghijklmnopqrstuvwxyz'!#$%&()*+,-./:;<=>?#[\\]^_`{|}~\"\\",
" abcdefghijklmnopqrstuvwxyz' "
)
# apply translation table and split into words
words = text.translate(trans).split()
# count desired words
word_count = sum(word in search_for for word in words)
# show result
print("your file has {} words that are in the list".format(word_count))
Read file content by with statement and open() method.
Remove punctuation from the file content by string module.
Split file content by split() method and iterate every word by for loop.
Check if word is present in the input list or not and increment count value according yo that.
input file: hi.txt
hi, how are you?
hi, how are you?
code:
import string
input_list = ['hi', 'how', 'are', 'you']
filename="hi.txt"
count = 0
with open(filename, "rb") as fp:
data = fp.read()
data = data.translate(string.maketrans("",""), string.punctuation)
for word in data.split():
if word in input_list:
count += 1
print "Total number of word present in file from the list are %d"%(count)
Output:
vivek#vivek:~/Desktop/stackoverflow$ python 18.py
Total number of word present in file from the list are 8
vivek#vivek:~/Desktop/stackoverflow$
Do not use variable names which already define by python interpreter
e.g. list in your code.
>>> list
<type 'list'>
>>> a = list([1,2,3])
>>> a
[1, 2, 3]
>>> list = ["hi", "how"]
>>> b = list([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'list' object is not callable
>>>
use the len()
for example for a list;
enter code here
myList = ["c","b","a"]
len(myList)
it will return 3 meaning there are three items on your list.

Categories

Resources