Compression function for words.. python

Compression function for words.. python - python

This is supposed to be a compression function. We're supposed to read in a text file with just words, and sort them by frequencies and by the amount of words. Upper and lower case. I don't an answer to solve it, I just want help.
for each word in the input list
if the word is new to the list of unique words
insert it into the list and set its frequency to 1
otherwise
increase its frequency by 1
sort unique word list by frequencies (function)
open input file again
open output file with appropriate name based on input filename (.cmp)
write out to the output file all the words in the unique words list
for each line in the file (delimited by newlines only!)
split the line into words
for each word in the line
look up each word in the unique words list
and write its location out to the output file
don't output a newline character
output a newline character after the line is finished
close both files
tell user compression is done
This is my next step:
def compression():
for line in infile:
words = line.split()
def get_file():
opened = False
fn = input("enter a filename ")
while not opened:
try:
infile = open(fn, "r")
opended = True
except:
print("Won't open")
fn = input("enter a filename ")
return infile
def compression():
get_file()
data = infile.readlines()
infile.close()
for line in infile:
words = line.split()
words = []
word_frequencies = []
def main():
input("Do you want to compress or decompress? Enter 'C' or 'D' ")
main()

So I'm not going to do an entire assignment for you, but I can try my best and walk you through it one by one.
It would seem the first step is to create an array of all the words in the text file (assuming you know file reading methods). For this, you should look into Python's split function (regular expressions can be used for more complex variations of this). So you need to store each word somewhere, and pair that value with the amount of times it appears. Sounds like the job of a dictionary, to me. That should get you off on the right track.

Thanks to the pseudocode I can more or less figure out what this should look like. I have to do some guesswork since you say you're restricted to what you've covered in class, and we have no way to know what that includes and what it doesn't.
I am assuming you can handle opening the input file and splitting it into words. That's pretty basic stuff - if your class is asking you to handle input files it must have covered it. If not, the main thing to look back to is that you can iterate over a file and get its lines:
with open('example.txt') as infile:
for line in infile:
words = line.split()
Now, for each word you need to keep track of two things - the word itself and its frequency. Your problem requires you to use lists to store your information. This means you have to either use two different lists, one storing the words and one storing their frequencies, or use one list that stores two facts for each of its positions. There are disadvantages either way - lists are not the best data structure to use for this, but your problem definition puts the better tools off limits for now.
I would probably use two different lists, one containing the words and one containing the frequency counts for the word in the same position in the word list. The advantage there is that that would allow using the index method on the word list to find the list position of a given word, and then increment the matching frequency count. That will be much faster than searching a list that stores both the word and its frequency count using a for loop. The down side is that sorting is harder - you have to find a way to retain the list position of each frequency when you sort, or to combine the words and their frequencies before sorting. In this approach, you probably need to build a list data that stores two pieces of information - the frequency count and then either the word list index or the word itself - and then sort that list by the frequency count.
I hope that part of the point of this exercise is to help drive home how useful the better data structures are when you're allowed to use them.
edited
So the inner loop is going to look something like this:
words = []
word_frequencies = []
for line in infline:
for word in line.split():
try:
word_position = words.index(word)
except ValueError:
# word is not in words
words.append(word)
# what do you think should happen to word_frequencies here?
else:
# now word_position is a number giving the position in both words and word_frequencies for the word
# what do you think should happen to word_frequences here?

Related

Python word counting program for .txt files keeps on showing string index out of range as an error code

Im pretty new to this and i was trying to write a program which counts the words in txt files. There is probably a better way of doing this, but this was the idea i came up with, so i wanted to go through with it. I just don´t understand, why i, or any variable, does´nt work for as an index for the string of the page, that i´m counting on...
Do you guys have a solution or should i just take a different approach?
page = open("venv\harrry_potter.txt", "r")
alphabet = "qwertzuiopüasdfghjklöäyxcvbnmßQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM"
# Counting the characters
list_of_lines = page.readlines()
characternum = 0
textstr = "" # to convert the .txt file to string
for line in list_of_lines:
for character in line:
characternum += 1
textstr += character
# Counting the words
i = 0
wordnum = 1
while i <= characternum:
if textstr[i] not in alphabet and textstr[i+1] in alphabet:
wordnum += 1
i += 1
print(wordnum)
page.close()
Counting the characters and converting the .txt file to string is done a bit weird, because i thought the other way could be the source of the problem...
Can you help me please?

Typically you want to use split for simplistically counting words. They way you are doing it you will get right-minded as two words, or don't as 2 words. If you can just rely on spaces then you can just use split like this:
book = "Hello, my name is Inigo Montoya, you killed my father, prepare to die."
words = book.split()
print(f'word count = {len(words)}')
you can also use parameters to split to add more options if the given doesn't suit you.
https://pythonexamples.org/python-count-number-of-words-in-text-file/

You want to get the word count of a text file
The shortest code is this (that I could come up with):
with open('lorem.txt', 'r') as file:
print(len(file.read().split()))
First of for smaller files this is fine but this loads all of the data into the memory so not that great for large files. First of use a context manager (with), it helps with error handling an other stuff. What happens is you print the length of the whole file read and split by space so file.read() reads the whole file and returns a string, so you use .split() on it and it splits the whole string by space and returns a list of each word in between spaces so you get the lenght of that.
A better approach would be this:
word_count = 0
with open('lorem.txt', 'r') as file:
for line in file:
word_count += len(line.split())
print(word_count)
Because here the whole file is not saved into memory, you read each line separately and overwrite the previous in the memory. Here again for each line you split it by space and measure the length of the returned list, then add to the total word count. At the end simply print out the total word count.
Useful sources:
about with
Context Managers - Efficiently Managing Resources (to learn how they work a bit in detail) by Corey Schafer
.split() "docs"

Take tokens from a text file, calculate their frequency, and return them in a new text file in Python

After a long time researching and asking friends, I am still a dumb-dumb and don't know how to solve this.
So, for homework, we are supposed to define a function which accesses two files, the first of which is a text file with the following sentence, from which we are to calculate the word frequencies:
In a Berlin divided by the Berlin Wall , two angels , Damiel and Cassiel , watch the city , unseen and unheard by its human inhabitants .
We are also to include commas and periods: each single item has already been tokenised (individual items are surrounded by whitespaces - including the commas and periods). Then, the word frequencies must be entered into a new txt-file as "word:count", and in the order in which the words appear, i.e.:
In:1
a:1
Berlin:2
divided:1
etc.
I have tried the following:
def find_token_frequency(x, y):
with open(x, encoding='utf-8') as fobj_1:
with open(y, 'w', encoding='utf-8') as fobj_2:
fobj_1list = fobj_1.split()
unique_string = []
for i in fobj_1list:
if i not in unique_string:
unique_string.append(i)
for i in range(0, len(unique_string)):
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
I am not sure I need to actually use .split() at all, but I don't know what else to do, and it does not work anyway, since it tells me I cannot split that object.
I am told:
Traceback (most recent call last):
[...]
fobj_1list = fobj_1.split()
AttributeError: '_io.TextIOWrapper' object has no attribute 'split'
When I remove the .split(), the displayed error is:
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
AttributeError: '_io.TextIOWrapper' object has no attribute 'count'

Let's divide your problem into smaller problems so we can more easily solve this.
First we need to read a file, so let's do so and save it into a variable:
with open("myfile.txt") as fobj_1:
sentences = fobj_1.read()
Ok, so now we have your file as a string stored in sentences. Let's turn it into a list and count the occurrence of each word:
words = sentence.split(" ")
frequency = {word:words.count(word) for word in set(words)}
Here frequency is a dictionary where each word in the sentences is a key with the value being how many times they appear on the sentence. Note the usage of set(words). A set does not have repeated elements, that's why we are iterating over the set of words and not the word list.
Finally, we can save the word frequencies into a file
with open("results.txt", 'w') as fobj_2:
for word in frequency: fobj_2.write(f"{word}:{frequency[word]}\n")
Here we use f strings to format each line into the desired output. Note that f-strings are available for python3.6+.

I'm unable to comment as I don't have the required reputation, but the reason split() isn't working is because you're calling it on the file object itself, not a string. Try calling:
fobj_1list = fobj_1.readline().split()
instead. Also, when I ran this locally, I got an error saying that TypeError: 'encoding' is an invalid keyword argument for this function. You may want to remove the encoding argument from your function calls.
I think that should be enough to get you going.

The following script should do what you want.
#!/usr/local/bin/python3
def find_token_frequency(inputFileName, outputFileName):
# wordOrderList to maintain order
# dict to keep track of count
wordOrderList = []
wordCountDict = dict()
# read the file
inputFile = open(inputFileName, encoding='utf-8')
lines = inputFile.readlines()
inputFile.close()
# iterate over all lines in the file
for line in lines:
# and split them into words
words = line.split()
# now, iterate over all words
for word in words:
# and add them to the list and dict
if word not in wordOrderList:
wordOrderList.append(word)
wordCountDict[word] = 1
else:
# or increment their count
wordCountDict[word] = wordCountDict[word] +1
# store result in outputFile
outputFile = open(outputFileName, 'w', encoding='utf-8')
for index in range(0, len(wordOrderList)):
word = wordOrderList[index]
outputFile.write(f'{word}:{wordCountDict[word]}\n')
outputFile.close()
find_token_frequency("input.txt", "output.txt")
I changed your variable names a bit to make the code more readable.

Word Count from File: Is it having problems opening the file, or have I coded it incorrectly?

Problem: Program seems to get stuck opening a file to read.
My problem is that at the very beginning the program seems to be broken. It just displays
[(1, 'C:\Users\....\Desktop\Sense_and_Sensibility.txt')]
over and over, never-ending.
(NOTE: .... is a replacement for the purpose of posting because my computer username is my full name).
I'm not sure if I've coded this completely incorrectly, or if it's having problems opening the file. Any help is appreciated.
The program should:
1: open a file, replace all punctuation with spaces, change all words to lowercase, then store them in a dictionary.
2: look at a list of words (stop words) that will be removed from the original dictionary.
3: count the remaining words and sort based on frequency.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" # file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" # words to delete
with open(fname) as file: # have the program run the file
for line in file: # loop through
fname.replace('-.,"!?', " ") # replace punc. with space
words = fname.lower() # make all words lowercase
word_list = fname.split() # separate the words, store
word_dict = {} # create a dictionary
with open(swfilename) as delete: # open stop word list
for line in delete:
sw_list = swfilename.split() # separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) # delete common words
for word in word_list: # loop through
word_dict[word] = word_dict.get(word, 0) + 1 # count frequency
word_freq = [] # create index
for key, value in word_dict.items(): # count occurrences
word_freq.append((value, key)) # append freq list
word_freq.sort(reverse=True) # sort the words by freq
print(word_freq) # print most to least

Importing files in windows using python is some what different when compared to Mac and Linux OS
Just change the path of file from fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt"
To fname = "C:\\Users\\....\\Desktop\\Sense_and_Sensibility.txt"
Use double slashes

There are a couple of issues with your code. I would only discuss the most obvious one, given that it is impossible to reproduce your exact observations because the input you are using is not accessible to the readers.
I will first report your code verbatim and mark weak points with ??? followed by a number, which I will address after the code.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" #file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" #words to delete
with open(fname) as file: #???(1) have the program run the file
for line in file: #loop through
fname.replace ('-.,"!?', " ") #???(2) replace punc. with space
words = fname.lower() #???(3) make all words lowercase
word_list = fname.split() #separate the words, store
word_dict = {} #???(4) create a dictionary
with open(swfilename) as delete: #open stop word list
for line in delete:
sw_list = swfilename.split() #separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) #???(5) delete common words
for word in word_list: #???(6) loop through
word_dict[word] = word_dict.get(word, 0) + 1 #???(7) count frequency
word_freq = [] #???(8)create index
for key, value in word_dict.items(): #count occurrences
word_freq.append((value, key)) #append freq list
word_freq.sort(reverse = True) #sort the words by freq
print(word_freq) #print most to least
(minor) file is a reserved word in Python, and it is a good practice not to use for custom purposes as you are doing
(major) .replace() will replace the exact string on the left with the exact string on the right, but what you would like to do is to perform some sort of multi_replace(), which you could implement yourself (for example as a function) by consecutive calls to .replace() for example in a loop (or using functools.reduce()).
(major) fname contains the file name (path, actually) and not the content of the file you want to work with.
(major) You are looping through the lines of the file, but if you create your word_list and word_dict for each line, you will "overwrite" the content at each iteration. Also, the word_dict is created empty and never filled.
(major) The logic you are trying to implement will not work on a dictionary, because dictionaries cannot contain multiple identical keys. A more effective approach would be to create a filtered_list from the word_list by excluding the stop_words. The dictionary can then be used to implement a counter. I do understand that at your level it may be worth learning how to implement a counter, but please keep in mind that the module collections.Counter() from the standard library (thus accessible using import collections) does exactly what you want.
(major) given that at this point there is nothing useful left from your code, but looping through the original list instead of through the filtered list will have no information about the stop words.
(major) dictionary[key] can be used both for accessing (which you do not do) and for writing (which you do) the value associated to a specific key in a dictionary.
(minor) Obviously, your approach for sorting according to word frequency would work, but a much better approach would be to use the parameter key of .sort() and sorted().
Hope this helps!

How does one efficiently replace strings in a massive CSV-based array, using dictionaries?

I have a very large array with many rows and many columns (called "self.csvFileArray") that is composed of rows that I've read from a CSV file, using the following code in a class that deals with CSV files...
with open(self.nounDef["Noun Source File Name"], 'rU') as csvFile:
for idx, row in enumerate(csv.reader(csvFile, delimiter=',')):
if idx == 0:
self.csvHeader = row
self.csvFileArray.append(row)
I have a very long dictionary of replacement mappings that I'd like to use for replacements...
replacements = {"str1a":"str1b", "str2a":"str2b", "str3a":"str3b", etc.}
I'd like to do this in a class method that looks as follows...
def m_globalSearchAndReplace(self, replacements):
# apply replacements dictionary to self.csvFileArray...
MY QUESTION: What is the most efficient way to replace strings throughout the array "self.csvFileArray", using the "replacements" dictionary?
NOTES FOR CLARIFICATION:
I took a look at this post but can't seem to get it to work for this case.
Also, I want to replace strings within words that match, not just entire words. So, working with a replacement mapping of "SomeCompanyName":"xyz", I may have a sentence like "The company SomeCompanyName has a patent for product called abcSomeCompanyNamedef." You'll notice that the string has to be replaced, twice, in the sentence... once as a whole word and once as an embedded string.

The following works with the above and has been fully tested...
def m_globalSearchAndReplace(self, dataMap):
replacements = dataMap.m_getMappingDictionary()
keys = replacements.keys()
for row in self.csvFileArray: # Loop through each row/list
for idx, w in enumerate(row): # Loop through each word in the row/list
for key in keys: # For every key in the dictionary...
if key != 'NULL' and key != '-' and key != '.' and key != '':
w = w.replace(key, replacements[key])
row[idx] = w
In short, loop through every row in the csvFileArray and get every word.
Then, for every word in the row, loop through the dictionary's (called "replacements") keys to access and apply each mapping.
Then (assuming the right conditions) replace the value with its mapped value (in the dictionary).
NOTE: While it works, I don't believe that the use of endless loops is the most efficient way to solve the problem and I believe there has to be a better way, using regular expressions. So, I'll leave this open for a bit to see if anyone can improve on the answer.

In a big loop? You could just load the csv file as string so you only have to look through your list once instead of for every item. Though its not really very efficient as python strings are immutable, your still facing the same problem either way.
According to this answer Optimizing find and replace over large files in Python (re the efficiency), Maybe line by line would work better so you don't have the giant string in memory if that actually becomes a problem.
edit: So something like this...
# open original and new file.
with open(old_file, 'r') as old_f, open(new_file, 'w') as new_f:
# loop through each line of the original file (old file)
for old_line in old_f:
new_line = old_line
# loop through your dictionary of replacements and make them.
for r in replacements:
new_line = new_line.replace(r, replacements[r])
# write each line to the new file.
new_f.write(new_line)
Anyway I would forget the file is a csv file and just treat it like a big collections of lines or characters.

How to input a line word by word in Python?

I have multiple files, each with a line with, say ~10M numbers each. I want to check each file and print a 0 for each file that has numbers repeated and 1 for each that doesn't.
I am using a list for counting frequency. Because of the large amount of numbers per line I want to update the frequency after accepting each number and break as soon as I find a repeated number. While this is simple in C, I have no idea how to do this in Python.
How do I input a line in a word-by-word manner without storing (or taking as input) the whole line?
EDIT: I also need a way for doing this from live input rather than a file.

Read the line, split the line, copy the array result into a set. If the size of the set is less than the size of the array, the file contains repeated elements
with open('filename', 'r') as f:
for line in f:
# Here is where you do what I said above
To read the file word by word, try this
import itertools
def readWords(file_object):
word = ""
for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
if ch.isspace():
if word: # In case of multiple spaces
yield word
word = ""
continue
word += ch
if word:
yield word # Handles last word before EOF
Then you can do:
with open('filename', 'r') as f:
for num in itertools.imap(int, readWords(f)):
# Store the numbers in a set, and use the set to check if the number already exists
This method should also work for streams because it only reads one byte at a time and outputs a single space delimited string from the input stream.
After giving this answer, I've updated this method quite a bit. Have a look
<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>

The way you are asking it is not possible I guess. You can't read word by word as such in python . Something of this can be done:
f = open('words.txt')
for word in f.read().split():
print(word)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.