Recreating a sentence using text files and lists - python

I am relatively new to Python and I am currently working on a compression program that uses lists containing positions of words in a lists and a list of words that make up the sentence. So far I have written my program inside two functions, the first function; 'compression', gets the words that make up the sentence and the positions of those words. My second function is called 'recreate', this function uses he lists to recreate the sentence. The recreated senetence is then stored in a file called recreate.txt. My issue is that the positions of words and the words that make up the sentence are not being written to their respective files and the 'recreate' file is not being created and written to. Any help would be greatly appreciated. Thanks :)
sentence = input("Input the sentence that you wish to be compressed")
sentence.lower()
sentencelist = sentence.split()
d = {}
plist = []
wds = []
def compress():
for i in sentencelist:
if i not in wds:
wds.append(i)
for i ,j in enumerate(sentencelist):
if j in (d):
plist.append(d[j])
else:
plist.append(i)
print (plist)
tsk3pos = open ("tsk3pos.txt", "wt")
for item in plist:
tsk3pos.write("%s\n" % item)
tsk3pos.close()
tsk3wds = open ("tsk3wds.txt", "wt")
for item in wds:
tsk3wds.write("%s\n" % item)
tsk3wds.close()
print (wds)
def recreate(compress):
compress()
num = list()
wds = list()
with open("tsk3wds.txt", "r") as txt:
for line in txt:
words += line.split()
with open("tsk3pos.txt", "r") as txt:
for line in txt:
num += [int(i) for i in line.split()]
recreate = ' '.join(words[pos] for pos in num)
with open("recreate.txt", "wt") as txt:
txt.write(recreate)
UPDATED
I have fixed all other problems except the recreate function which will not make the 'recreate' file and will not recreate the sentence with the words, although
it recreates the sentence with the positions.
def recreate(compress): #function that will be used to recreate the compressed sentence.
compress()
num = list()
wds = list()
with open("words.txt", "r") as txt: #with statement opening the word text file
for line in txt: #iterating over each line in the text file.
words += line.split() #turning the textfile into a list and appending it to num
with open("tsk3pos.txt", "r") as txt:
for line in txt:
num += [int(i) for i in line.split()]
recreate = ' '.join(wds[pos] for pos in num)
with open("recreate.txt", "wt") as txt:
txt.write(recreate)
main()
def main():
print("Do you want to compress an input or recreate a compressed input?")
user = input("Type 'a' if you want to compress an input. Type 'b' if you wan to recreate an input").lower()
if user not in ("a","b"):
print ("That's not an option. Please try again")
elif user == "a":
compress()
elif user == "b":
recreate(compress)
main()
main()

A simpler ( yet less efficient ) approach :
recreate_file_object = open ( "C:/FullPathToWriteFolder/recreate.txt" , "w" )
recreate_file_object.write ( recreate )
recreate_file_object.close ( )

Related

Is there a more efficient way to create an inverted index from a large text file?

def inverted_index(doc):
words = word_count(doc)
ln = 0
for word in words:
temp = []
with open(doc) as file:
for line in file:
ln += 1
li = line.split()
if word in li:
temp.append(ln)
words[word] = temp
return words
I am trying to create an inverted index from a text file, where words is a dictionary with all the 19000 unique words in the file. The text file has around 5000+ lines. I want to iterate through the file and dictionary to create the inverted index that has the word followed by line numbers that the word appears but it is taking too long to compile as it is nested for loop. So is there a more efficient way to do this?
Here is my approach to solve this, please read the notes below code for some pragmatic tips.
def inverted_index(doc):
# this will open the file
file = open(doc, encoding='utf8')
f = file.read()
file.seek(0)
# Get number of lines in file
lines = 1
for word in f:
if word == '\n':
lines += 1
print("Number of lines in file is: ", lines) # Just for debuggin, please remove in PROD version
d = {}
for i in range(lines):
line = file.readline()
l = line.lower().split(' ')
for item in l:
if item not in d:
d[item] = [i+1]
if item in d:
d[item].append(i+1)
return d
print(inverted_index('file.txt'))
I would suggest removing stopwords first before creating the inverted index for any meaningful analysis. You can use nltk package for that.

find words in txt files Python 3

I'd like to create a program in python 3 to find how many time a specific words appears in txt files and then to built an excel tabel with these values.
I made this function but at the end when I recall the function and put the input, the progam doesn't work. Appearing this sentence: unindent does not match any outer indentation level
def wordcount(filename, listwords):
try:
file = open( filename, "r")
read = file.readlines()
file.close()
for x in listwords:
y = x.lower()
counter = 0
for z in read:
line = z.split()
for ss in line:
l = ss.lower()
if y == l:
counter += 1
print(y , counter)
Now I try to recall the function with a txt file and the word to find
wordcount("aaa.txt" , 'word' )
Like output I'd like to watch
word 4
thanks to everybody !
Here is an example you can use to find the number of time a specific word is in a text file;
def searching(filename,word):
counter = 0
with open(filename) as f:
for line in f:
if word in line:
print(word)
counter += 1
return counter
x = searching("filename","wordtofind")
print(x)
The output will be the word you try to find and the number of time it occur.
As short as possible:
def wordcount(filename, listwords):
with open(filename) as file_object:
file_text = file_object.read()
return {word: file_text.count(word) for word in listwords}
for word, count in wordcount('aaa.txt', ['a', 'list', 'of', 'words']).items():
print("Count of {}: {}".format(word, count))
Getting back to mij's comment about passing listwofwords as an actual list: If you pass a string to code that expects a list, python will interpret the string as a list of characters, which can be confusing if this behaviour is unfamiliar.

Sorting words in a text file (with parameters) and writing them into a new file with Python

I have a file.txt with thousands of words, and I need to create a new file based on certain parameters, and then sort them a certain way.
Assuming the user imports the proper libraries when they test, what is wrong with my code? (There are 3 separate functions)
For the first, I must create a file with words containing certain letters, and sort them lexicographically, then put them into a new file list.txt.
def getSortedContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
The second is similar, but must be sorted reverse lexicographically, if the string inputted is NOT in the word.
def getReverseSortedNotContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s not in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
For the third, I must sort words that contain a certain amount of integers, and sort lexicographically by the last character in each word.
def getRhymeSortedCount(n, ifile, ofile):
toWrite = ""
for line in ifile:
word = line[:-1] #gets rid of \n
if len(word) == n:
toWrite += word + "\n"
reversetoWrite = toWrite[::-1]
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
reversetoWrite = toWrites[::-1]
ofile.write(reversetoWrites[:-1])
Could someone please point me in the right direction for these? Right now they are not sorting as they're supposed to.
There is a lot of stuff that is unclear here so I'll try my best to clean this up.
You're concatenating strings together into one big string then appending that one big string into a list. You then tried to sort your 1-element list. This obviously will do nothing. Instead put all the strings into a list and then sort that list
IE: for your first example do the following:
def getSortedContain(s,ifile,ofile):
words = [word for word in ifile if s in words]
words.sort()
ofile.write("\n".join(words))

Python create a sorted list from file using append remove duplicate words

I am currently working in a python course and I am lost on this after 6 hrs +. Assignment directs student to create a program where the user enters a file name and python opens the file and builds a sorted word list with out duplicates. Directions are very clear that For loop and append must be used. " For each word on each line check to see if the word is already in the list and if not append it to the list."
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.strip()
words = line.split()
for words in fh:
if words in 1st:continue
elif 1st.append
1st.sort()
print 1st
It would be easier to just use the set() by itself, but this would be a good implementation per the assignment instructions. It's really fast compared to a list only version!
from collections import Set
def get_exclusive_list(fname):
words = []
with open(fname.txt, 'r') as file_d:
data = file_d.read()
[words.extend(li.split(' ')) for li in data.splitlines()]
data = []
already_seen = set()
for thing in words:
if thing not in already_seen:
data.append(thing)
already_seen.add(thing)
return data
# The better implementation
def get_exclusive_list_improved(fname):
words = []
with open(fname.txt, 'r') as file_d:
data = file_d.read()
[words.extend(li.split(' ')) for li in data.splitlines()]
return list(set(words))
Not sure what the following loop is supposed to do -
for words in fh:
if words in 1st:continue
elif 1st.append
The above does not do anything because you have already exhausted the file fh before control reaches this part.
You should put an inner loop inside - for line in fh: - that goes over the words in words list one by one and appends to lst if its not already there.
Also, you should do lst.append(word)
Also, i do not think your if..elif block is valid syntax either.
You should be doing something like xample -
for line in fh:
line = line.strip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)

python dictionary function, textfile

I would like to define a function scaryDict() which takes one parameter (a textfile) and returns the words from the textfile in alphabetical order, basically produce a dictionary but does not print any one or two letter words.
Here is what I have so far...it isn't much but I don't know the next step
def scaryDict(fineName):
inFile = open(fileName,'r')
lines = inFile.read()
line = lines.split()
myDict = {}
for word in inFile:
myDict[words] = []
#I am not sure what goes between the line above and below
for x in lines:
print(word, end='\n')
You are doing fine till line = lines.split(). But your for loop must loop through the line array, not the inFile.
for word in line:
if len(word) > 2: # Make sure to check the word length!
myDict[word] = 'something'
I'm not sure what you want with the dictionary (maybe get the word count?), but once you have it, you can get the words you added to it by,
allWords = myDict.keys() # so allWords is now a list of words
And then you can sort allWords to get them in alphabetical order.
allWords.sort()
I would store all of the words into a set (to eliminate dups), then sort that set:
#!/usr/bin/python3
def scaryDict(fileName):
with open(fileName) as inFile:
return sorted(set(word
for line in inFile
for word in line.split()
if len(word) > 2))
scaryWords = scaryDict('frankenstein.txt')
print ('\n'.join(scaryWords))
Also keep in mind as of 2.5 the 'with' file contains an enter and exit methods which can prevent some issues (such as that file never getting closed)
with open(...) as f:
for line in f:
<do something with line>
Unique set
Sort the set
Now you can put it all together.
sorry that i am 3 years late : ) here is my version
def scaryDict():
infile = open('filename', 'r')
content = infile.read()
infile.close()
table = str.maketrans('.`/()|,\';!:"?=-', 15 * ' ')
content = content.translate(table)
words = content.split()
new_words = list()
for word in words:
if len(word) > 2:
new_words.append(word)
new_words = list(set(new_words))
new_words.sort()
for word in new_words:
print(word)

Categories

Resources