I CANNOT USE ANY IMPORTED LIBRARY. I have this task where I have some directories containing some files; every file contains, besides some words, the name of the next file to be opened, in its first line. Once every word of every files contained in a directory is opened, they have to be treated in a way that should return a single string; such string contains in its first position, the most frequent first letter of every word seen before, in its second position the most frequent second letter, and so on. I have managed to do this with a directory containing 3 files, but it's not using any type of chain-like mechanism, rather a passing of local variables. Some of my college colleagues suggested I had to use slicing of lists, but I can't figure out how. I CANNOT USE ANY IMPORTED LIBRARY.
This is what I got:
'''
The objective of the homework assignment is to design and implement a function
that reads some strings contained in a series of files and generates a new
string from all the strings read.
The strings to be read are contained in several files, linked together to
form a closed chain. The first string in each file is the name of another
file that belongs to the chain: starting from any file and following the
chain, you always return to the starting file.
Example: the first line of file "A.txt" is "B.txt," the first line of file
"B.txt" is "C.txt," and the first line of "C.txt" is "A.txt," forming the
chain "A.txt"-"B.txt"-"C.txt".
In addition to the string with the name of the next file, each file also
contains other strings separated by spaces, tabs, or carriage return
characters. The function must read all the strings in the files in the chain
and construct the string obtained by concatenating the characters with the
highest frequency in each position. That is, in the string to be constructed,
at position p, there will be the character with the highest frequency at
position p of each string read from the files. In the case where there are
multiple characters with the same frequency, consider the alphabetical order.
The generated string has a length equal to the maximum length of the strings
read from the files.
Therefore, you must write a function that takes as input a string "filename"
representing the name of a file and returns a string.
The function must construct the string according to the directions outlined
above and return the constructed string.
Example: if the contents of the three files A.txt, B.txt, and C.txt in the
directory test01 are as follows
test01/A.txt test01/B.txt test01/C.txt
-------------------------------------------------------------------------------
test01/B.txt test01/C.txt test01/A.txt
house home kite
garden park hello
kitchen affair portrait
balloon angel
surfing
the function most_frequent_chars ("test01/A.txt") will return "hareennt".
'''
def file_names_list(filename):
intermezzo = []
lista_file = []
a_file = open(filename)
lines = a_file.readlines()
for line in lines:
intermezzo.extend(line.split())
del intermezzo[1:]
lista_file.append(intermezzo[0])
intermezzo.pop(0)
return lista_file
def words_list(filename):
lista_file = []
a_file = open(filename)
lines = a_file.readlines()[1:]
for line in lines:
lista_file.extend(line.split())
return lista_file
def stuff_list(filename):
file_list = file_names_list(filename)
the_rest = words_list(filename)
second_file_name = file_names_list(file_list[0])
the_lists = words_list(file_list[0]) and
words_list(second_file_name[0])
the_rest += the_lists[0:]
return the_rest
def most_frequent_chars(filename):
huge_words_list = stuff_list(filename)
maxOccurs = ""
list_of_chars = []
for i in range(len(max(huge_words_list, key=len))):
for item in huge_words_list:
try:
list_of_chars.append(item[i])
except IndexError:
pass
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
list_of_chars.clear()
return maxOccurs
print(most_frequent_chars("test01/A.txt"))
This assignment is relatively easy, if the code has a good structure. Here is a full implementation:
def read_file(fname):
with open(fname, 'r') as f:
return list(filter(None, [y.rstrip(' \n').lstrip(' ') for x in f for y in x.split()]))
def read_chain(fname):
seen = set()
new = fname
result = []
while not new in seen:
A = read_file(new)
seen.add(new)
new, words = A[0], A[1:]
result.extend(words)
return result
def most_frequent_chars (fname):
all_words = read_chain(fname)
result = []
for i in range(max(map(len,all_words))):
chars = [word[i] for word in all_words if i<len(word)]
result.append(max(sorted(set(chars)), key = chars.count))
return ''.join(result)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
In the code above, we define 3 functions:
read_file: simple function to read the contents of a file and return a list of strings. The command x.split() takes care of any spaces or tabs used to separate words. The final command list(filter(None, arr)) makes sure that empty strings are erased from the solution.
read_chain: Simple routine to iterate through the chain of files, and return all the words contained in them.
most_frequent_chars: Easy routine, where the most frequent characters are counted carefully.
PS. This line of code you had is very interesting:
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
I edited my code to include it.
Space complexity optimization
The space complexity of the previous code can be improved by orders of magnitude, if the files are scanned without storing all the words:
def scan_file(fname, database):
with open(fname, 'r') as f:
next_file = None
for x in f:
for y in x.split():
if next_file is None:
next_file = y
else:
for i,c in enumerate(y):
while len(database) <= i:
database.append({})
if c in database[i]:
database[i][c] += 1
else:
database[i][c] = 1
return next_file
def most_frequent_chars (fname):
database = []
seen = set()
new = fname
while not new in seen:
seen.add(new)
new = scan_file(new, database)
return ''.join(max(sorted(d.keys()),key=d.get) for d in database)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
Now we scan the files tracking the frequency of the characters in database, without storing intermediate arrays.
Ok, here's my solution:
def parsi_file(filename):
visited_files = set()
words_list = []
# Getting words from all files
while filename not in visited_files:
visited_files.add(filename)
with open(filename) as f:
filename = f.readline().strip()
words_list += [line.strip() for line in f.readlines()]
# Creating dictionaries of letters:count for each index
letters_dicts = []
for word in words_list:
for i in range(len(word)):
if i > len(letters_dicts)-1:
letters_dicts.append({})
letter = word[i]
if letters_dicts[i].get(letter):
letters_dicts[i][letter] += 1
else:
letters_dicts[i][letter] = 1
# Sorting dicts and getting the "best" letter
code = ""
for dic in letters_dicts:
sorted_letters = sorted(dic, key = lambda letter: (-dic[letter],letter))
code += sorted_letters[0]
return code
We first get the words_list from all files.
Then, for each index, we create a dictionary of the letters in all words at that index, with their count.
Now we sort the dictionary keys by descending count (-count) then by alphabetical order.
Finally we get the first letter (thus the one with the max count) and add it to the "code" word for this test battery.
Edit: in terms of efficiency, parsing through all words for each index will get worse as the number of words grows, so it would be better to tweak the code to simultaneously create the dictionaries for each index and parse through the list of words only once. Done.
Related
I have a program that pulls out text around a specific keyword. I'm trying to modify it so that if two keywords are close enough together, it just shows one longer snippet of text instead of two individual snippets.
My current code, below, adds words after the keyword to a list and resets the counter if another keyword is found. However, I've found two problems with this. The first is that the data rate limit in my spyder notebook is exceeded, and I haven't been able to deal with that. The second is that though this would make a longer snippet, it wouldn't get rid of the duplicate.
Does anyone know a way to get rid of the duplicate snippet, or know how to merge the snippets in a way that doesn't exceed the data rate limit (or know how to change the spyder rate limit)? Thank you!!
def occurs(word1, word2, file, filewrite):
import os
infile = open(file,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(file, 'w')
f.write("start")
f.write(os.linesep)
g = open(filewrite,'w')
g.write("start")
g.write(os.linesep)
for item in wordlist: #multiple words
matches = [i for i, w in enumerate(words) if w.lower().find(item) != -1]
#above line goes through lines, finds the specific words we want
for m in matches: #next three lines find each instance of the word, print out surrounding words
list = []
s = ""
l = " ".join(words[m-20:m+1])
j = 0
while j < 20:
list.append(words[m+i])
j = j+1
if words[m+i] == word1 or words[m+i] == word2:
j = 0
print (list)
k = " ".join(list)
f.write(f"{s}...{l}{k}...") #writes the data to the external file
f.write(os.linesep)
g.write(str(m))
g.write(os.linesep)
f.close
g.close
I am making a mini-translator for Japanese words for a given file.
The script have an expandable dictionary file that includes 13k+ lines in this format:-
JapaneseWord<:to:>EnglishWord
So I have to pick a line from the dictionary, then do a .strip('') to make a list in this format:-
[JapaneseWord,EnglishWord]
then I have to pick a line from the given file, and find the first item in this list in the line and replace it with its English equivalent, and I have to make sure to repeat this process in the same line for the number of times that Japanese word appears with the .count() function.
the problem is that this takes a long time because like this, I have to read the file again and again for 14k+ times, and this will expand as I expand the dictionary size.
I tried looking for a way to add the whole dictionary in the memory, and then compare them all in the given file at the same time, so like this, I will have to read the file one time, but I couldn't do it.
Here's the function I am using right now, it takes a var that includes the file's lines as a list with the file.readlines() function:-
def replacer(text):
#Current Dictionary.
cdic = open(argv[4], 'r', encoding='utf-8')
#Part To Replace.
for ptorep in cdic:
ptorep = ptorep.strip('\n')
ptorep = ptorep.split('<:to:>')
for line in text:
for clone in range(0, line.count(ptorep[0])):
line = line.replace(ptorep[0], ptorep[1])
text = ''.join(text)
return text
This takes around 1 min for a single small file.
Dictionary Method:
import re
with open(argv[4], 'r', encoding='utf-8') as file:
translations = [line.strip('\n').split('<:to:>') for line in file.readlines()]
translations = {t[0]:t[1] for t in translations} # Convert to dictionary where the key is the english word and the value is the translation
output = []
for word in re.split('\W+'): # Split into words (may require tweaking)
output.append(translations.get(word, word)) # Search for the key `word`, in case it does not exist, use `word`
output = ''.join(output)
Original Method:
Maybe keep the full dictionary in memory as a list:
cdic = open(argv[4], 'r', encoding='utf-8')
translations = []
for line in cdic.readlines():
translations.append(line.strip('\n').split('<:to:>'))
# Note: I would use a list comprehension for this
with open(argv[4], 'r', encoding='utf-8') as file:
translations = [line.strip('\n').split('<:to:>') for line in file.readlines()]
And make the replacements off of that:
def replacer(text, translations):
for entry in translations:
text = text.replace(entry[0], entry[1])
return text
Problem: Program seems to get stuck opening a file to read.
My problem is that at the very beginning the program seems to be broken. It just displays
[(1, 'C:\Users\....\Desktop\Sense_and_Sensibility.txt')]
over and over, never-ending.
(NOTE: .... is a replacement for the purpose of posting because my computer username is my full name).
I'm not sure if I've coded this completely incorrectly, or if it's having problems opening the file. Any help is appreciated.
The program should:
1: open a file, replace all punctuation with spaces, change all words to lowercase, then store them in a dictionary.
2: look at a list of words (stop words) that will be removed from the original dictionary.
3: count the remaining words and sort based on frequency.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" # file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" # words to delete
with open(fname) as file: # have the program run the file
for line in file: # loop through
fname.replace('-.,"!?', " ") # replace punc. with space
words = fname.lower() # make all words lowercase
word_list = fname.split() # separate the words, store
word_dict = {} # create a dictionary
with open(swfilename) as delete: # open stop word list
for line in delete:
sw_list = swfilename.split() # separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) # delete common words
for word in word_list: # loop through
word_dict[word] = word_dict.get(word, 0) + 1 # count frequency
word_freq = [] # create index
for key, value in word_dict.items(): # count occurrences
word_freq.append((value, key)) # append freq list
word_freq.sort(reverse=True) # sort the words by freq
print(word_freq) # print most to least
Importing files in windows using python is some what different when compared to Mac and Linux OS
Just change the path of file from fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt"
To fname = "C:\\Users\\....\\Desktop\\Sense_and_Sensibility.txt"
Use double slashes
There are a couple of issues with your code. I would only discuss the most obvious one, given that it is impossible to reproduce your exact observations because the input you are using is not accessible to the readers.
I will first report your code verbatim and mark weak points with ??? followed by a number, which I will address after the code.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" #file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" #words to delete
with open(fname) as file: #???(1) have the program run the file
for line in file: #loop through
fname.replace ('-.,"!?', " ") #???(2) replace punc. with space
words = fname.lower() #???(3) make all words lowercase
word_list = fname.split() #separate the words, store
word_dict = {} #???(4) create a dictionary
with open(swfilename) as delete: #open stop word list
for line in delete:
sw_list = swfilename.split() #separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) #???(5) delete common words
for word in word_list: #???(6) loop through
word_dict[word] = word_dict.get(word, 0) + 1 #???(7) count frequency
word_freq = [] #???(8)create index
for key, value in word_dict.items(): #count occurrences
word_freq.append((value, key)) #append freq list
word_freq.sort(reverse = True) #sort the words by freq
print(word_freq) #print most to least
(minor) file is a reserved word in Python, and it is a good practice not to use for custom purposes as you are doing
(major) .replace() will replace the exact string on the left with the exact string on the right, but what you would like to do is to perform some sort of multi_replace(), which you could implement yourself (for example as a function) by consecutive calls to .replace() for example in a loop (or using functools.reduce()).
(major) fname contains the file name (path, actually) and not the content of the file you want to work with.
(major) You are looping through the lines of the file, but if you create your word_list and word_dict for each line, you will "overwrite" the content at each iteration. Also, the word_dict is created empty and never filled.
(major) The logic you are trying to implement will not work on a dictionary, because dictionaries cannot contain multiple identical keys. A more effective approach would be to create a filtered_list from the word_list by excluding the stop_words. The dictionary can then be used to implement a counter. I do understand that at your level it may be worth learning how to implement a counter, but please keep in mind that the module collections.Counter() from the standard library (thus accessible using import collections) does exactly what you want.
(major) given that at this point there is nothing useful left from your code, but looping through the original list instead of through the filtered list will have no information about the stop words.
(major) dictionary[key] can be used both for accessing (which you do not do) and for writing (which you do) the value associated to a specific key in a dictionary.
(minor) Obviously, your approach for sorting according to word frequency would work, but a much better approach would be to use the parameter key of .sort() and sorted().
Hope this helps!
I have a textfile with the following format:
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
... with 1 million items
But some of the word_forms contain an apostrophe ('), others do not, so I would like to count them as instances of the same word, that's to say I would like to merge lines like these two:
cup'board cup blabla 12
cupboard cup blabla2 10
into this one (frequencies added):
cupboard cup blabla2 22
I am searching a solution in Python 2.7 to do that, my first idea was to read the textfile, store in two different dictionaries the words with apostrophe and the words without, then go over the dictionary of words with apostrophe, test if these words are already in the dictionary without apostrophe, if they are actualise the frequency, if not simply add this line with apostrophe removed. Here is my code:
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if "'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos", "")
g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print "Number of words: ",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
This solution works in a reasonable time till the storage of the two dictionaries, whether I write in two textfiles line by line to avoid any storage or I store them as dict objects in the program. But the merging of the two dictionaries never ends!
The function 'update' for dictionaries would work but override one frequency count instead of adding the two. I saw some solutions of merging dictionaries
with addition with Counter:
Python: Elegantly merge dictionaries with sum() of values
Merge and sum of two dictionaries
How to sum dict elements
How to merge two Python dictionaries in a single expression?
Is there any pythonic way to combine two dicts (adding values for keys that appear in both)?
but they seem to work only when the dictionaries are of the form (word, count) whereas I want to carry the other fields in the dictionary as well.
I am open to all your ideas or reframing of the problem, since my goal is
to have this program running once only to obtain this merged list in a text file, thank you in advance!
Here's something that does more or less what you want. Just change the file names at the top. It doesn't modify the original file.
input_file_name = "input.txt"
output_file_name = "output.txt"
def custom_comp(s1, s2):
word1 = s1.split()[0]
word2 = s2.split()[0]
stripped1 = word1.translate(None, "'")
stripped2 = word2.translate(None, "'")
if stripped1 > stripped2:
return 1
elif stripped1 < stripped2:
return -1
else:
if "'" in word1:
return -1
else:
return 1
def get_word(line):
return line.split()[0].translate(None, "'")
def get_num(line):
return int(line.split()[-1])
print "Reading file and sorting..."
lines = []
with open(input_file_name, 'r') as f:
for line in sorted(f, cmp=custom_comp):
lines.append(line)
print "File read and sorted"
combined_lines = []
print "Combining entries..."
i = 0
while i < len(lines) - 1:
if get_word(lines[i]) == get_word(lines[i+1]):
total = get_num(lines[i]) + get_num(lines[i+1])
new_parts = lines[i+1].split()
new_parts[-1] = str(total)
combined_lines.append(" ".join(new_parts))
i += 2
else:
combined_lines.append(lines[i].strip())
i += 1
print "Entries combined"
print "Writing to file..."
with open(output_file_name, 'w+') as f:
for line in combined_lines:
f.write(line + "\n")
print "Finished"
It sorts the words and messes up the spacing a bit. If that's important, let me know and it can be adjusted.
Another thing is that it sorts the whole thing. For only a million lines, the probably won't take overly long, but again, let me know if that's an issue.
I have to search through a text file for anagrams of a given word. The text file has one word per line. So far I've managed to write a function, that makes a dictionary from a given word, with the key being a letter in the word, and it's value being the number of times the letter is in the word. The second function loops through each line of the text file, creates a second dictionary of the same keys and values, and compares the two. if the two are equal to each other, the function will add that word to a list. Once the function finishes looping through the text file, it should print the list of anagrams, but it's printing a blank list. Here is my code, I have no clue where it's going wrong.
this is for creating the dictionary of the given word.
word= input("Enter a word: ")
letterdict = {}
def count_letters(word,letterdict):
for letter in word:
letterdict[letter] = letterdict.get(letter,0) + 1
return letterdict
print(count_letters(word,letterdict))
this is for looping through the text file and comparing
def search():
count_letters(word,letterdict)
anagrams = []
letterdict2={}
f = open('EnglishWords.txt', 'r')
for letter in f:
letterdict2[letter] = letterdict2.get(letter,0) + 1
if letterdict == letterdict2:
anagrams.append[f]
letterdict2.clear()
f.close()
anagrams.sort() #put list in alphabetical order
return print(anagrams)
search()
Much faster algorithm (inside the loop, anyway): go through your entire dictionary just once, creating a new file with two words on each line; the first is the word with its letters alphabetized, and then the word itself, for example:
aaadkrrv aardvark
aabcsu abacus
. . .
Then, sort that file. Now, looking for all the anagrams of a word is a simple direct lookup into a sorted list.
This looks like a problem of not using the global keyword to access (and write to) the letterdict you create. Declare your variables locally and pass them through your program using function parameters. Python doesn't provide strong global variable support (it's there, but takes a lot of attention to detail to use).
Consider rewriting your functions:
def count_letters(word):
letterdict = dict()
for letter in word:
letterdict[letter] = letterdict.get(letter,0) + 1
return letterdict
def search(word):
letterdict = count_letters(word)
anagrams = []
letterdict2={}
with open('EnglishWords.txt', 'r') as f:
for line in f:
for letter in line:
letterdict2[letter] = letterdict2.get(letter,0) + 1
if letterdict == letterdict2:
anagrams.append[line]
letterdict2.clear()
anagrams.sort() #put list in alphabetical order
return anagrams
A few notes about the original code:
return print(anagrams) probably doesn't do anything and seems like a syntax error in your code.
for letter in f: loads lines of the file to letter
count_letters(word,letterdict) does nothing with the calculated value
You may/may not want to include spaces and numbers in your letterdict