Change a file to a list to a dictionary - python

I am trying to write a code that takes the text from a novel and converts it to a dictionary where the keys are each unique word and the values are the number of occurrences of the word in the text.
For example it could look like: {'the': 25, 'girl': 59...etc}
I have been trying to make the text first into a list and then use the Counter function to make a dictionary of all the words:
source = open('novel.html', 'r', encoding = "UTF-8")
soup = BeautifulSoup(source, 'html.parser')
#make a list of all the words in file, get rid of words that aren't content
mylist = []
mylist.append(soup.find_all('p'))
newlist = filter(None, mylist)
cnt = collections.Counter()
for line in newlist:
try:
if line is not None:
words = line.split(" ")
for word in line:
cnt[word] += 1
except:
pass
print(cnt)
This code doesn't work because of an error with "NoneType" or it just prints an empty list. I'm wondering if there is an easier way to do what I'm trying to do or how I can fix this code so it won't have this error.

import collections
from bs4 import BeautifulSoup
with open('novel.html', 'r', encoding='UTF-8') as source:
soup = BeautifulSoup(source, 'html.parser')
cnt = collections.Counter()
for tag in soup.find_all('p'):
for word in tag.string.split():
word = ''.join(ch for ch in word.lower() if ch.isalnum())
if word != '':
cnt[word] += 1
print(cnt)
with statement is simply a safer way to open the file
soup.find_all returns a list of Tag's
tag.string.split() gets all the words (separated by spaces) from the Tag
word = ''.join(ch for ch in word.lower() if ch.isalnum()) removes punctuation and convertes to lowercase so that 'Hello' and 'hello!' count as the same word

For the counter just do a
from collections import Counter
cnt = Counter(mylist)
Are you sure your list is getting items to begin with? After what step are you getting an empty list?

Once you've converted your page to a list, try something like this out:
#create dictionary and fake list
d = {}
x = ["hi", "hi", "hello", "hey", "hi", "hello", "hey", "hi"]
#count the times a unique word occurs and add that pair to your dictionary
for word in set(x):
count = x.count(word)
d[word] = count
Output:
{'hello': 2, 'hey': 2, 'hi': 4}

Related

Python: Creating a function counting specific words in a textfile

I want to create a function that returns the value of word count of a specific word in a text file.
Here's what I currently have:
def Word_Counter(Text_File, Word):
Data = open(Text_File, 'r').read().lower()
count = Data.count(Word)
print(Word, "; ", count)
Word_Counter('Example.txt', "the")
Which returns: "the ; 35"
That is pretty much what I want it to do. But what if I want to test a text for a range of words. I want the words (key) and values in say a list or dictionary. What's a way of doing that without using modules?
Say if I tested the function with this list of words: [time, when, left, I, do, an, who, what, sometimes].
The results I would like would be something like:
Word Counts = {'time': 1, 'when': 4, 'left': 0, 'I': 5, 'do': 2, 'an': 0, 'who': 1, 'what': 3, 'sometimes': 1}
I have been able to create a dictionary which does a word count for every word, like example below.
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
I'd like to do a similar style but only targeting specific words, any suggestions?
From your given code, I did not test this.
def Word_Counter(Text_File, word_list):
Data = open(Text_File, 'r').read().lower()
output = {}
for word in word_list:
output[word] = Data.count(Word)
Or you can do this
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
UPDATE
Try the following:
keywords = ['the', 'that']
worddict = {}
with open('out.txt', 'r') as f:
text = f.read().split(' ') # or f.read().split(',')
for word in text:
worddict[word] = worddict[word]+1 if word in worddict else 1
print([{x, worddict[x]} for x in keywords])

Word Frequency HW

Write a program that asks a user for a file name, then reads in the file. The program should then determine how frequently each word in the file is used. The words should be counted regardless of case, for example Spam and spam would both be counted as the same word. You should disregard punctuation. The program should then output the the words and how frequently each word is used. The output should be sorted by the most frequent word to the least frequent word.
Only problem I am having is getting the code to count "The" and "the" as the same thing. The code counts them as different words.
userinput = input("Enter a file to open:")
if len(userinput) < 1 : userinput = 'ran.txt'
f = open(userinput)
di = dict()
for lin in f:
lin = lin.rstrip()
wds = lin.split()
for w in wds:
di[w] = di.get(w,0) + 1
lst = list()
for k,v in di.items():
newtup = (v, k)
lst.append(newtup)
lst = sorted(lst, reverse=True)
print(lst)
Need to count "the" and "The" as on single word.
We start by getting the words in a list, updating the list so that all words are in lowercase. You can disregard punctuation by replacing them from the string with an empty character
punctuations = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
s = "I want to count how many Words are there.i Want to Count how Many words are There"
for punc in punctuations:
s = s.replace(punc,' ')
words = s.split(' ')
words = [word.lower() for word in words]
We then iterate through the list, and update a frequency map.
freq = {}
for word in words:
if word in freq:
freq[word] += 1
else:
freq[word] = 1
print(freq)
#{'i': 2, 'want': 2, 'to': 2, 'count': 2, 'how': 2, 'many': 2,
#'words': 2, 'are': #2, 'there': 2}
You can use counter and re like this,
from collections import Counter
import re
sentence = 'Egg ? egg Bird, Goat afterDoubleSpace\nnewline'
# some punctuations (you can add more here)
punctuationsToBeremoved = ",|\n|\?"
#to make all of them in lower case
sentence = sentence.lower()
#to clean up the punctuations
sentence = re.sub(punctuationsToBeremoved, " ", sentence)
# getting the word list
words = sentence.split()
# printing the frequency of each word
print(Counter(words))

I m trying to append multiple values to key in a dictionary in python

I am trying to read the file and converting it into dictionary .after reading i have to take a word and word first character as a key and word itself as a value. If another word with same character comes it should append the values to existing key itself.
import io
file1 = open("text.txt")
line = file1.read()
words = line.split()
Dict={}
for w in words:
if w[0] in Dict.keys():
key1=w[0]
wor=str(w)
Dict.setdefault(key1,[])
Dict[key1].append(wor)
else:
Dict[w[0]] = w
print Dict
Just simplified your code. There is no point in having a else condition if using set_default
words = 'hello how are you'.split()
dictionary = {}
for word in words:
key = word[0]
dictionary.setdefault(key, []).append(word)
print dictionary
In order to get rid of set_default use default_dict
from collections import defaultdict
words = 'hello how are you'.split()
dictionary = defaultdict(list)
for word in words:
key = word[0]
dictionary[key].append(word)
print dictionary.items()

Prompts the user for a word and prints all words in the file containing all characters of the word in python

For example, I have a list of words from in a file.(listed below)
aback
abacus
abandon
abandoned
logo
loincloth
loiter
loll
and some other more,a really big list of words! now the user can enter a word
for example "go", then it will show all words contain the charter 'g' and 'o', "go", "logo", "goo" , and so on.
And I have to make the file into a dictionary type first, I really have no idea, how to do it.
This is something I have done, I was trying to make all the words from the same letter go to together,
for example:
words = {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e':['elephant']}
import operator
file = open("d1.txt","r")
words = {}
for line in file:
line = line.strip()
first_char = line[0]
if first_char not in words:
words[first_char] = []
words[first_char].append(line)
sorted_words = sorted(words.items(),key = operator.itemgetter(1))
print(sorted_words)
user_input = str(input("Pleae enter a ward: "))
v1 = words[user_input]
print(v1)
Unfortunately, this is all I have done, can anyone help me out please!
That looks somewhat strange, but anyway it will be easier to do something like this
word_to_search = 'gosh' # assume that this is user input
letters_list = list(word_to_search)
result = []
for letter in letters_list:
for word in file.read().split('\n'): #here you choose separator by which your words splitted
if letter in word:
result.append(word) #here you'll get a list of all words with matching letters
Note that there will be duplicates, to get rid of them you can just
result = set(result) #here you will get list of only unique words
If you want to go with dictionary
import string
alphabet = list(string.ascii_lowercase)
words_list = file.read().split('\n')
words_dict = dict((letter, dict()) for letter in alphabet)
for letter in alphabet:
for word in words_list:
if word.startswith(letter):
words_dict[letter].append(word)
This will give you dict with alphabet letters as keys and lists of words as values
Hope you can figure out how to iterate over lists in your dict.
Hint: you can join values of dict and iterate over them

Trying to Find Most Occurring Name in a File

I have 4 text files that I want to read and find the top 5 most occurring names of. The text files have names in the following format "Rasmus,M,11". Below is my code which right now is able to call all of the text files and then read them. Right now, this code prints out all of the names in the files.
def top_male_names ():
for x in range (2008, 2012):
txt = "yob" + str(x) + ".txt"
file_handle = open(txt, "r", encoding="utf-8")
file_handle.seek(0)
line = file_handle.readline().strip()
while line != "":
print (line)
line = file_handle.readline().strip()
top_male_names()
My question is, how can I keep track of all of these names, and find the top 5 that occur the most? The only way I could think of was creating a variable for each name, but that wouldn't work because there are 100s of entries in each text file, probably with 100s of different of names.
This is the gist of it:
from collections import Counter
counter = Counter()
for line in file_handle:
name, gender, age = line.split(',')
counter[name] += 1
print counter.most_common()
You can adapt it to your program.
If you need to count a number of words in a text, use regex.
For example
import re
my_string = "Wow! Is this true? Really!?!? This is crazy!"
words = re.findall(r'\w+', my_string) #This finds words in the document
Output::
>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']
"Is" and "is" are two different words. So we can just capitalize all the words, and then count them.
from collections import Counter
cap_words = [word.upper() for word in words] #capitalizes all the words
word_counts = Counter(cap_words) #counts the number each time a word appears
Output:
>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})
Now reading a file :
import re
from collections import Counter
with open('file.txt') as f: text = f.read()
words = re.findall(r'\w+', text )
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
Then you only have to sort the dict containing all the words, for the values not for keys and see the top 5 words.

Categories

Resources