I need to count all unique five letter words in a txt file and ignore any word with an
apostrophe. I'm new to python so I am quite confused trying to get just the five letter words and not sure how to ignore words that have an ' .
what I wrote so far seemed to work for filtering the unique words but not for just five letter words.
with open ("names.txt", 'r') as f: #open the file
words = f.read().lower().split() #read the contents into a sting, made all the character in string lower case and split string into list of words
print(words)
unique_words = set(words) #get unique words
print(len(unique_words))
for w in unique_words:
if len(w) == 5:
print(unique_words)
else:
pass
Your code looks good. I think the only bit you did wrong was to print(unique_words) instead of print(w) when you found a word w of length 5.
To ignore the words containing ' you can add this condition:
for w in unique_words:
if len(w) == 5 and "'" not in w:
print(w)
B.t.w. you don't need the pass statement if you are already at the end of the for loop.
This should do the trick
with open("names.txt", 'r') as f: #open the file
words = f.read().lower().split() #read the contents into a sting, made all the character in string lower case and split string into list of words
print(words)
unique_words = set() #Create empty set
for w in words:
if len(w) == 5 and "'" not in w:
unique_words.add(w) #add words to set
print(len(unique_words))
Related
I was given a .txt file with a text. I have already cleaned the text (removed punctuation, uppercase, symbols), and now I have a string with the words.
I am now trying to get the count of characters len() of each item on the string. Then make a plot where N of characters is on the X-axis and the Y-axis is the number of words that have such N len() of characters
So far I have:
text = "sample.txt"
def count_chars(txt):
result = 0
for char in txt:
result += 1 # same as result = result + 1
return result
print(count_chars(text))
So far this is looking for the total len() of the text instead of by word.
I would like to get something like the function Counter Counter() this returns the word with the count of how many times it repeated throughout the text.
from collections import Counter
word_count=Counter(text)
I want to get the # of characters per word. Once we have such a count the plotting should be easier.
Thanks and anything helps!
Okay, first of all you need to open the sample.txt file.
with open('sample.txt', 'r') as text_file:
text = text_file.read()
or
text = open('sample.txt', 'r').read()
Now we can count the words in the text and put it, for example, in a dict.
counter_dict = {}
for word in text.split(" "):
counter_dict[word] = len(word)
print(counter_dict)
It looks like the accepted answer doesn't solve the problem as it was posed by the querent
Then make a plot where N of characters is on the X-axis and the Y-axis is the number of words that have such N len() of characters
import matplotlib.pyplot as plt
# ch10 = ... the text of "Moby Dick"'s chapter 10, as found
# in https://www.gutenberg.org/files/2701/2701-h/2701-h.htm
# split chap10 into a list of words,
words = [w for w in ch10.split() if w]
# some words are joined by an em-dash
words = sum((w.split('—') for w in words), [])
# remove suffixes and one prefix
for suffix in (',','.',':',';','!','?','"'):
words = [w.removesuffix(suffix) for w in words]
words = [w.removeprefix('"') for w in words]
# count the different lenghts using a dict
d = {}
for w in words:
l = len(w)
d[l] = d.get(l, 0) + 1
# retrieve the relevant info from the dict
lenghts, counts = zip(*d.items())
# plot the relevant info
plt.bar(lenghts, counts)
plt.xticks(range(1, max(lenghts)+1))
plt.xlabel('Word lengths')
plt.ylabel('Word counts')
# what is the longest word?
plt.title(' '.join(w for w in words if len(w)==max(lenghts)))
# T H E E N D
plt.show()
I'm writing a program that read lines (words of a dictionary)
, and I want to exclude words that don't contain, for example, 8 characters minimum.
I tried to search on Google, but I didn't find it.
This is the kind of program I would like to do:
with open('words.txt', 'r') as f:
lines = f.read().split('\n')
lenght = 8
#and now the part i struggle with
If lenght under 8:
exclude it
else:
print(goodword)
You can use len(line) to find the length of the string.
with open('words.txt', 'r') as f:
lines = f.read().split('\n')
length = 8
goodwords = [w for w in lines if len(w) >= length]
print(*goodwords, sep='\n')
The line goodwords = [... is a list comprehension that can be replaced with a standard for loop:
goodwords = [] # initiate your list
for word in lines: # evaluate each word
if len(word) >= length: # word is accepted
goodwords.append(word)
# else:
# no need for an else clause
# in the event that word has less than 8 letters
# the code will just continue with the next word
I'm trying to find whenever one of some specific words is used in a TXT file and then count what number word in the file the word is. My code returns the number for some but not all of the words, and I have no idea why.
My code right now goes through the file word by word with a counter and returns the number if the word matches one of the words I want.
def wordnumber(file, filewrite, word1, word2, word3):
import os
wordlist = [word1, word2, word3]
infile = open(file, 'r')
g = open(filewrite, 'w')
g.write("start")
g.write(os.linesep)
lines = infile.read().splitlines()
infile.close()
wordsString = ' '.join(lines)
words = wordsString.split()
n = 1
for w in words:
if w in wordlist:
g.write(str(n))
g.write(os.linesep)
n = n+1
This works sometimes, but for some text files it only returns some of the numbers and leaves others blank.
If you want find the first occurence of the word in your words, just use
wordIndex = words.index(w) if w in words else None
and for all occurences use
wordIndexes = [i for i,x in enumerate(words) if x==word]
(taken from Python: Find in list)
But beware: if your text is "cat, dog, mouse", your code wouldn't find index of "cat" or "dog". Because "cat, dog, mouse".split() returns ['cat,', 'dog,', 'mouse'], and 'cat,' is not 'cat'.
The program correctly identifies the words regardless of punctuation. I am having trouble integrate this into spam_indicator(text).
def spam_indicator(text):
text=text.split()
w=0
s=0
words=[]
for char in string.punctuation:
text = text.replace(char, '')
return word
for word in text:
if word.lower() not in words:
words.append(word.lower())
w=w+1
if word.lower() in SPAM_WORDS:
s=s+1
return float("{:.2f}".format(s/w))
enter image description here
The second block is wrong. I am trying to remove punctuations to run the function.
Try removing the punctuation first, then split the text into words.
def spam_indicator(text):
for char in string.punctuation:
text = text.replace(char, ' ') # N.B. replace with ' ', not ''
text = text.split()
w = 0
s = 0
words = []
for word in text:
if word.lower() not in words:
words.append(word.lower())
w=w+1
if word.lower() in SPAM_WORDS:
s=s+1
return float("{:.2f}".format(s/w))
There are many improvements that could be made to your code.
Use a set for words rather than a list. Since a set can not contain duplicates you don't need to check whether you've already seen the word before adding it to the set.
Use str.translate() to remove the punctuation. You want to replace punctuation with whitespace so that the split() will split the text into words.
Use round() instead of converting to a string then to a float.
Here is an example:
import string
def spam_indicator(text):
trans_table = {ord(c): ' ' for c in string.punctuation}
text = text.translate(trans_table).lower()
text = text.split()
word_count = 0
spam_count = 0
words = set()
for word in text:
if word not in SPAM_WORDS:
words.add(word)
word_count += 1
else:
spam_count += 1
return round(spam_count / word_count, 2)
You need to take care not to divide by 0 if there are no non-spam words. Anyway, I'm not sure what you want as the spam indicator value. Perhaps it should be the number of spam words divided by the total number of words (both spam and non-spam) to make it a value between 0 and 1?
I need to create a word list from a text file. The list is going to be used in a hangman code and needs to exclude the following from the list:
duplicate words
words containing less than 5 letters
words that contain 'xx' as a substring
words that contain upper case letters
the word list then needs to be output into file so that every word appears on its own line.
The program also needs to output the number of words in the final list.
This is what I have, but it's not working properly.
def MakeWordList():
infile=open(('possible.rtf'),'r')
whole = infile.readlines()
infile.close()
L=[]
for line in whole:
word= line.split(' ')
if word not in L:
L.append(word)
if len(word) in range(5,100):
L.append(word)
if not word.endswith('xx'):
L.append(word)
if word == word.lower():
L.append(word)
print L
MakeWordList()
You're appending the word many times with this code,
You arn't actually filtering out the words at all, just adding them a different number of timed depending on how many if's they pass.
you should combine all the if's:
if word not in L and len(word) >= 5 and not 'xx' in word and word.islower():
L.append(word)
Or if you want it more readable you can split them:
if word not in L and len(word) >= 5:
if not 'xx' in word and word.islower():
L.append(word)
But don't append after each one.
Think about it: in your nested if-statements, ANY word that is not already in the list will make it through on your first line. Then if it is 5 or more characters, it will get added again (I bet), and again, etc. You need to rethink your logic in the if statements.
Improved code:
def MakeWordList():
with open('possible.rtf','r') as f:
data = f.read()
return set([word for word in data if len(word) >= 5 and word.islower() and not 'xx' in word])
set(_iterable_) returns a set-type object that has no duplicates (all set items must be unique). [word for word...] is a list comprehension which is a shorter way of creating simple lists. You can iterate over every word in 'data' (this assumes each word is on a separate line). if len(word) >= 5 and word.islower() and not 'xx' in word accomplishes the final three requirements (must be more than 5 letters, have only lowercase letters, and cannot contain 'xx').