Faster method for replacing multiple words in a file - python

I am making a mini-translator for Japanese words for a given file.
The script have an expandable dictionary file that includes 13k+ lines in this format:-
JapaneseWord<:to:>EnglishWord
So I have to pick a line from the dictionary, then do a .strip('') to make a list in this format:-
[JapaneseWord,EnglishWord]
then I have to pick a line from the given file, and find the first item in this list in the line and replace it with its English equivalent, and I have to make sure to repeat this process in the same line for the number of times that Japanese word appears with the .count() function.
the problem is that this takes a long time because like this, I have to read the file again and again for 14k+ times, and this will expand as I expand the dictionary size.
I tried looking for a way to add the whole dictionary in the memory, and then compare them all in the given file at the same time, so like this, I will have to read the file one time, but I couldn't do it.
Here's the function I am using right now, it takes a var that includes the file's lines as a list with the file.readlines() function:-
def replacer(text):
#Current Dictionary.
cdic = open(argv[4], 'r', encoding='utf-8')
#Part To Replace.
for ptorep in cdic:
ptorep = ptorep.strip('\n')
ptorep = ptorep.split('<:to:>')
for line in text:
for clone in range(0, line.count(ptorep[0])):
line = line.replace(ptorep[0], ptorep[1])
text = ''.join(text)
return text
This takes around 1 min for a single small file.

Dictionary Method:
import re
with open(argv[4], 'r', encoding='utf-8') as file:
translations = [line.strip('\n').split('<:to:>') for line in file.readlines()]
translations = {t[0]:t[1] for t in translations} # Convert to dictionary where the key is the english word and the value is the translation
output = []
for word in re.split('\W+'): # Split into words (may require tweaking)
output.append(translations.get(word, word)) # Search for the key `word`, in case it does not exist, use `word`
output = ''.join(output)
Original Method:
Maybe keep the full dictionary in memory as a list:
cdic = open(argv[4], 'r', encoding='utf-8')
translations = []
for line in cdic.readlines():
translations.append(line.strip('\n').split('<:to:>'))
# Note: I would use a list comprehension for this
with open(argv[4], 'r', encoding='utf-8') as file:
translations = [line.strip('\n').split('<:to:>') for line in file.readlines()]
And make the replacements off of that:
def replacer(text, translations):
for entry in translations:
text = text.replace(entry[0], entry[1])
return text

Related

File Names Chain in python

I CANNOT USE ANY IMPORTED LIBRARY. I have this task where I have some directories containing some files; every file contains, besides some words, the name of the next file to be opened, in its first line. Once every word of every files contained in a directory is opened, they have to be treated in a way that should return a single string; such string contains in its first position, the most frequent first letter of every word seen before, in its second position the most frequent second letter, and so on. I have managed to do this with a directory containing 3 files, but it's not using any type of chain-like mechanism, rather a passing of local variables. Some of my college colleagues suggested I had to use slicing of lists, but I can't figure out how. I CANNOT USE ANY IMPORTED LIBRARY.
This is what I got:
'''
The objective of the homework assignment is to design and implement a function
that reads some strings contained in a series of files and generates a new
string from all the strings read.
The strings to be read are contained in several files, linked together to
form a closed chain. The first string in each file is the name of another
file that belongs to the chain: starting from any file and following the
chain, you always return to the starting file.
Example: the first line of file "A.txt" is "B.txt," the first line of file
"B.txt" is "C.txt," and the first line of "C.txt" is "A.txt," forming the
chain "A.txt"-"B.txt"-"C.txt".
In addition to the string with the name of the next file, each file also
contains other strings separated by spaces, tabs, or carriage return
characters. The function must read all the strings in the files in the chain
and construct the string obtained by concatenating the characters with the
highest frequency in each position. That is, in the string to be constructed,
at position p, there will be the character with the highest frequency at
position p of each string read from the files. In the case where there are
multiple characters with the same frequency, consider the alphabetical order.
The generated string has a length equal to the maximum length of the strings
read from the files.
Therefore, you must write a function that takes as input a string "filename"
representing the name of a file and returns a string.
The function must construct the string according to the directions outlined
above and return the constructed string.
Example: if the contents of the three files A.txt, B.txt, and C.txt in the
directory test01 are as follows
test01/A.txt test01/B.txt test01/C.txt
-------------------------------------------------------------------------------
test01/B.txt test01/C.txt test01/A.txt
house home kite
garden park hello
kitchen affair portrait
balloon angel
surfing
the function most_frequent_chars ("test01/A.txt") will return "hareennt".
'''
def file_names_list(filename):
intermezzo = []
lista_file = []
a_file = open(filename)
lines = a_file.readlines()
for line in lines:
intermezzo.extend(line.split())
del intermezzo[1:]
lista_file.append(intermezzo[0])
intermezzo.pop(0)
return lista_file
def words_list(filename):
lista_file = []
a_file = open(filename)
lines = a_file.readlines()[1:]
for line in lines:
lista_file.extend(line.split())
return lista_file
def stuff_list(filename):
file_list = file_names_list(filename)
the_rest = words_list(filename)
second_file_name = file_names_list(file_list[0])
the_lists = words_list(file_list[0]) and
words_list(second_file_name[0])
the_rest += the_lists[0:]
return the_rest
def most_frequent_chars(filename):
huge_words_list = stuff_list(filename)
maxOccurs = ""
list_of_chars = []
for i in range(len(max(huge_words_list, key=len))):
for item in huge_words_list:
try:
list_of_chars.append(item[i])
except IndexError:
pass
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
list_of_chars.clear()
return maxOccurs
print(most_frequent_chars("test01/A.txt"))
This assignment is relatively easy, if the code has a good structure. Here is a full implementation:
def read_file(fname):
with open(fname, 'r') as f:
return list(filter(None, [y.rstrip(' \n').lstrip(' ') for x in f for y in x.split()]))
def read_chain(fname):
seen = set()
new = fname
result = []
while not new in seen:
A = read_file(new)
seen.add(new)
new, words = A[0], A[1:]
result.extend(words)
return result
def most_frequent_chars (fname):
all_words = read_chain(fname)
result = []
for i in range(max(map(len,all_words))):
chars = [word[i] for word in all_words if i<len(word)]
result.append(max(sorted(set(chars)), key = chars.count))
return ''.join(result)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
In the code above, we define 3 functions:
read_file: simple function to read the contents of a file and return a list of strings. The command x.split() takes care of any spaces or tabs used to separate words. The final command list(filter(None, arr)) makes sure that empty strings are erased from the solution.
read_chain: Simple routine to iterate through the chain of files, and return all the words contained in them.
most_frequent_chars: Easy routine, where the most frequent characters are counted carefully.
PS. This line of code you had is very interesting:
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
I edited my code to include it.
Space complexity optimization
The space complexity of the previous code can be improved by orders of magnitude, if the files are scanned without storing all the words:
def scan_file(fname, database):
with open(fname, 'r') as f:
next_file = None
for x in f:
for y in x.split():
if next_file is None:
next_file = y
else:
for i,c in enumerate(y):
while len(database) <= i:
database.append({})
if c in database[i]:
database[i][c] += 1
else:
database[i][c] = 1
return next_file
def most_frequent_chars (fname):
database = []
seen = set()
new = fname
while not new in seen:
seen.add(new)
new = scan_file(new, database)
return ''.join(max(sorted(d.keys()),key=d.get) for d in database)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
Now we scan the files tracking the frequency of the characters in database, without storing intermediate arrays.
Ok, here's my solution:
def parsi_file(filename):
visited_files = set()
words_list = []
# Getting words from all files
while filename not in visited_files:
visited_files.add(filename)
with open(filename) as f:
filename = f.readline().strip()
words_list += [line.strip() for line in f.readlines()]
# Creating dictionaries of letters:count for each index
letters_dicts = []
for word in words_list:
for i in range(len(word)):
if i > len(letters_dicts)-1:
letters_dicts.append({})
letter = word[i]
if letters_dicts[i].get(letter):
letters_dicts[i][letter] += 1
else:
letters_dicts[i][letter] = 1
# Sorting dicts and getting the "best" letter
code = ""
for dic in letters_dicts:
sorted_letters = sorted(dic, key = lambda letter: (-dic[letter],letter))
code += sorted_letters[0]
return code
We first get the words_list from all files.
Then, for each index, we create a dictionary of the letters in all words at that index, with their count.
Now we sort the dictionary keys by descending count (-count) then by alphabetical order.
Finally we get the first letter (thus the one with the max count) and add it to the "code" word for this test battery.
Edit: in terms of efficiency, parsing through all words for each index will get worse as the number of words grows, so it would be better to tweak the code to simultaneously create the dictionaries for each index and parse through the list of words only once. Done.

Take tokens from a text file, calculate their frequency, and return them in a new text file in Python

After a long time researching and asking friends, I am still a dumb-dumb and don't know how to solve this.
So, for homework, we are supposed to define a function which accesses two files, the first of which is a text file with the following sentence, from which we are to calculate the word frequencies:
In a Berlin divided by the Berlin Wall , two angels , Damiel and Cassiel , watch the city , unseen and unheard by its human inhabitants .
We are also to include commas and periods: each single item has already been tokenised (individual items are surrounded by whitespaces - including the commas and periods). Then, the word frequencies must be entered into a new txt-file as "word:count", and in the order in which the words appear, i.e.:
In:1
a:1
Berlin:2
divided:1
etc.
I have tried the following:
def find_token_frequency(x, y):
with open(x, encoding='utf-8') as fobj_1:
with open(y, 'w', encoding='utf-8') as fobj_2:
fobj_1list = fobj_1.split()
unique_string = []
for i in fobj_1list:
if i not in unique_string:
unique_string.append(i)
for i in range(0, len(unique_string)):
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
I am not sure I need to actually use .split() at all, but I don't know what else to do, and it does not work anyway, since it tells me I cannot split that object.
I am told:
Traceback (most recent call last):
[...]
fobj_1list = fobj_1.split()
AttributeError: '_io.TextIOWrapper' object has no attribute 'split'
When I remove the .split(), the displayed error is:
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
AttributeError: '_io.TextIOWrapper' object has no attribute 'count'
Let's divide your problem into smaller problems so we can more easily solve this.
First we need to read a file, so let's do so and save it into a variable:
with open("myfile.txt") as fobj_1:
sentences = fobj_1.read()
Ok, so now we have your file as a string stored in sentences. Let's turn it into a list and count the occurrence of each word:
words = sentence.split(" ")
frequency = {word:words.count(word) for word in set(words)}
Here frequency is a dictionary where each word in the sentences is a key with the value being how many times they appear on the sentence. Note the usage of set(words). A set does not have repeated elements, that's why we are iterating over the set of words and not the word list.
Finally, we can save the word frequencies into a file
with open("results.txt", 'w') as fobj_2:
for word in frequency: fobj_2.write(f"{word}:{frequency[word]}\n")
Here we use f strings to format each line into the desired output. Note that f-strings are available for python3.6+.
I'm unable to comment as I don't have the required reputation, but the reason split() isn't working is because you're calling it on the file object itself, not a string. Try calling:
fobj_1list = fobj_1.readline().split()
instead. Also, when I ran this locally, I got an error saying that TypeError: 'encoding' is an invalid keyword argument for this function. You may want to remove the encoding argument from your function calls.
I think that should be enough to get you going.
The following script should do what you want.
#!/usr/local/bin/python3
def find_token_frequency(inputFileName, outputFileName):
# wordOrderList to maintain order
# dict to keep track of count
wordOrderList = []
wordCountDict = dict()
# read the file
inputFile = open(inputFileName, encoding='utf-8')
lines = inputFile.readlines()
inputFile.close()
# iterate over all lines in the file
for line in lines:
# and split them into words
words = line.split()
# now, iterate over all words
for word in words:
# and add them to the list and dict
if word not in wordOrderList:
wordOrderList.append(word)
wordCountDict[word] = 1
else:
# or increment their count
wordCountDict[word] = wordCountDict[word] +1
# store result in outputFile
outputFile = open(outputFileName, 'w', encoding='utf-8')
for index in range(0, len(wordOrderList)):
word = wordOrderList[index]
outputFile.write(f'{word}:{wordCountDict[word]}\n')
outputFile.close()
find_token_frequency("input.txt", "output.txt")
I changed your variable names a bit to make the code more readable.

Spliting / Slicing Text File with Python

Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)

Replace words of a long document in Python

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work
A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Categories

Resources