I'm writing a program that read lines (words of a dictionary)
, and I want to exclude words that don't contain, for example, 8 characters minimum.
I tried to search on Google, but I didn't find it.
This is the kind of program I would like to do:
with open('words.txt', 'r') as f:
lines = f.read().split('\n')
lenght = 8
#and now the part i struggle with
If lenght under 8:
exclude it
else:
print(goodword)
You can use len(line) to find the length of the string.
with open('words.txt', 'r') as f:
lines = f.read().split('\n')
length = 8
goodwords = [w for w in lines if len(w) >= length]
print(*goodwords, sep='\n')
The line goodwords = [... is a list comprehension that can be replaced with a standard for loop:
goodwords = [] # initiate your list
for word in lines: # evaluate each word
if len(word) >= length: # word is accepted
goodwords.append(word)
# else:
# no need for an else clause
# in the event that word has less than 8 letters
# the code will just continue with the next word
Related
I need to count all unique five letter words in a txt file and ignore any word with an
apostrophe. I'm new to python so I am quite confused trying to get just the five letter words and not sure how to ignore words that have an ' .
what I wrote so far seemed to work for filtering the unique words but not for just five letter words.
with open ("names.txt", 'r') as f: #open the file
words = f.read().lower().split() #read the contents into a sting, made all the character in string lower case and split string into list of words
print(words)
unique_words = set(words) #get unique words
print(len(unique_words))
for w in unique_words:
if len(w) == 5:
print(unique_words)
else:
pass
Your code looks good. I think the only bit you did wrong was to print(unique_words) instead of print(w) when you found a word w of length 5.
To ignore the words containing ' you can add this condition:
for w in unique_words:
if len(w) == 5 and "'" not in w:
print(w)
B.t.w. you don't need the pass statement if you are already at the end of the for loop.
This should do the trick
with open("names.txt", 'r') as f: #open the file
words = f.read().lower().split() #read the contents into a sting, made all the character in string lower case and split string into list of words
print(words)
unique_words = set() #Create empty set
for w in words:
if len(w) == 5 and "'" not in w:
unique_words.add(w) #add words to set
print(len(unique_words))
def inverted_index(doc):
words = word_count(doc)
ln = 0
for word in words:
temp = []
with open(doc) as file:
for line in file:
ln += 1
li = line.split()
if word in li:
temp.append(ln)
words[word] = temp
return words
I am trying to create an inverted index from a text file, where words is a dictionary with all the 19000 unique words in the file. The text file has around 5000+ lines. I want to iterate through the file and dictionary to create the inverted index that has the word followed by line numbers that the word appears but it is taking too long to compile as it is nested for loop. So is there a more efficient way to do this?
Here is my approach to solve this, please read the notes below code for some pragmatic tips.
def inverted_index(doc):
# this will open the file
file = open(doc, encoding='utf8')
f = file.read()
file.seek(0)
# Get number of lines in file
lines = 1
for word in f:
if word == '\n':
lines += 1
print("Number of lines in file is: ", lines) # Just for debuggin, please remove in PROD version
d = {}
for i in range(lines):
line = file.readline()
l = line.lower().split(' ')
for item in l:
if item not in d:
d[item] = [i+1]
if item in d:
d[item].append(i+1)
return d
print(inverted_index('file.txt'))
I would suggest removing stopwords first before creating the inverted index for any meaningful analysis. You can use nltk package for that.
I was given a .txt file with a text. I have already cleaned the text (removed punctuation, uppercase, symbols), and now I have a string with the words.
I am now trying to get the count of characters len() of each item on the string. Then make a plot where N of characters is on the X-axis and the Y-axis is the number of words that have such N len() of characters
So far I have:
text = "sample.txt"
def count_chars(txt):
result = 0
for char in txt:
result += 1 # same as result = result + 1
return result
print(count_chars(text))
So far this is looking for the total len() of the text instead of by word.
I would like to get something like the function Counter Counter() this returns the word with the count of how many times it repeated throughout the text.
from collections import Counter
word_count=Counter(text)
I want to get the # of characters per word. Once we have such a count the plotting should be easier.
Thanks and anything helps!
Okay, first of all you need to open the sample.txt file.
with open('sample.txt', 'r') as text_file:
text = text_file.read()
or
text = open('sample.txt', 'r').read()
Now we can count the words in the text and put it, for example, in a dict.
counter_dict = {}
for word in text.split(" "):
counter_dict[word] = len(word)
print(counter_dict)
It looks like the accepted answer doesn't solve the problem as it was posed by the querent
Then make a plot where N of characters is on the X-axis and the Y-axis is the number of words that have such N len() of characters
import matplotlib.pyplot as plt
# ch10 = ... the text of "Moby Dick"'s chapter 10, as found
# in https://www.gutenberg.org/files/2701/2701-h/2701-h.htm
# split chap10 into a list of words,
words = [w for w in ch10.split() if w]
# some words are joined by an em-dash
words = sum((w.split('—') for w in words), [])
# remove suffixes and one prefix
for suffix in (',','.',':',';','!','?','"'):
words = [w.removesuffix(suffix) for w in words]
words = [w.removeprefix('"') for w in words]
# count the different lenghts using a dict
d = {}
for w in words:
l = len(w)
d[l] = d.get(l, 0) + 1
# retrieve the relevant info from the dict
lenghts, counts = zip(*d.items())
# plot the relevant info
plt.bar(lenghts, counts)
plt.xticks(range(1, max(lenghts)+1))
plt.xlabel('Word lengths')
plt.ylabel('Word counts')
# what is the longest word?
plt.title(' '.join(w for w in words if len(w)==max(lenghts)))
# T H E E N D
plt.show()
I have a file.txt with thousands of words, and I need to create a new file based on certain parameters, and then sort them a certain way.
Assuming the user imports the proper libraries when they test, what is wrong with my code? (There are 3 separate functions)
For the first, I must create a file with words containing certain letters, and sort them lexicographically, then put them into a new file list.txt.
def getSortedContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
The second is similar, but must be sorted reverse lexicographically, if the string inputted is NOT in the word.
def getReverseSortedNotContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s not in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
For the third, I must sort words that contain a certain amount of integers, and sort lexicographically by the last character in each word.
def getRhymeSortedCount(n, ifile, ofile):
toWrite = ""
for line in ifile:
word = line[:-1] #gets rid of \n
if len(word) == n:
toWrite += word + "\n"
reversetoWrite = toWrite[::-1]
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
reversetoWrite = toWrites[::-1]
ofile.write(reversetoWrites[:-1])
Could someone please point me in the right direction for these? Right now they are not sorting as they're supposed to.
There is a lot of stuff that is unclear here so I'll try my best to clean this up.
You're concatenating strings together into one big string then appending that one big string into a list. You then tried to sort your 1-element list. This obviously will do nothing. Instead put all the strings into a list and then sort that list
IE: for your first example do the following:
def getSortedContain(s,ifile,ofile):
words = [word for word in ifile if s in words]
words.sort()
ofile.write("\n".join(words))
I would like to define a function scaryDict() which takes one parameter (a textfile) and returns the words from the textfile in alphabetical order, basically produce a dictionary but does not print any one or two letter words.
Here is what I have so far...it isn't much but I don't know the next step
def scaryDict(fineName):
inFile = open(fileName,'r')
lines = inFile.read()
line = lines.split()
myDict = {}
for word in inFile:
myDict[words] = []
#I am not sure what goes between the line above and below
for x in lines:
print(word, end='\n')
You are doing fine till line = lines.split(). But your for loop must loop through the line array, not the inFile.
for word in line:
if len(word) > 2: # Make sure to check the word length!
myDict[word] = 'something'
I'm not sure what you want with the dictionary (maybe get the word count?), but once you have it, you can get the words you added to it by,
allWords = myDict.keys() # so allWords is now a list of words
And then you can sort allWords to get them in alphabetical order.
allWords.sort()
I would store all of the words into a set (to eliminate dups), then sort that set:
#!/usr/bin/python3
def scaryDict(fileName):
with open(fileName) as inFile:
return sorted(set(word
for line in inFile
for word in line.split()
if len(word) > 2))
scaryWords = scaryDict('frankenstein.txt')
print ('\n'.join(scaryWords))
Also keep in mind as of 2.5 the 'with' file contains an enter and exit methods which can prevent some issues (such as that file never getting closed)
with open(...) as f:
for line in f:
<do something with line>
Unique set
Sort the set
Now you can put it all together.
sorry that i am 3 years late : ) here is my version
def scaryDict():
infile = open('filename', 'r')
content = infile.read()
infile.close()
table = str.maketrans('.`/()|,\';!:"?=-', 15 * ' ')
content = content.translate(table)
words = content.split()
new_words = list()
for word in words:
if len(word) > 2:
new_words.append(word)
new_words = list(set(new_words))
new_words.sort()
for word in new_words:
print(word)