Replace words of a long document in Python - python

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work

A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)

Related

save Python list to a txt file in nice order [duplicate]

I have a file that is like this:
word, number
word, number
[...]
and I want to take/keep just the words, again one word in new line
word
word
[...]
My code so far
f = open("new_file.txt", "w")
with open("initial_file.txt" , "r+") as l:
for line in l:
word = line.split(", ")[0]
f.write(word)
print word # debugging purposes
gives me all the words in one line in the new file
wordwordwordword[...]
Which is the pythonic and most optimized way to do this?
I tried to use f.write("\n".join(word)) but what I got was
wordw
ordw
[...]
You can just use f.write(str(word)+"\n") to do this. Here str is used to make sure we can add "\n".
If you're on Windows, it's better to use "\r\n" instead.

Faster method for replacing multiple words in a file

I am making a mini-translator for Japanese words for a given file.
The script have an expandable dictionary file that includes 13k+ lines in this format:-
JapaneseWord<:to:>EnglishWord
So I have to pick a line from the dictionary, then do a .strip('') to make a list in this format:-
[JapaneseWord,EnglishWord]
then I have to pick a line from the given file, and find the first item in this list in the line and replace it with its English equivalent, and I have to make sure to repeat this process in the same line for the number of times that Japanese word appears with the .count() function.
the problem is that this takes a long time because like this, I have to read the file again and again for 14k+ times, and this will expand as I expand the dictionary size.
I tried looking for a way to add the whole dictionary in the memory, and then compare them all in the given file at the same time, so like this, I will have to read the file one time, but I couldn't do it.
Here's the function I am using right now, it takes a var that includes the file's lines as a list with the file.readlines() function:-
def replacer(text):
#Current Dictionary.
cdic = open(argv[4], 'r', encoding='utf-8')
#Part To Replace.
for ptorep in cdic:
ptorep = ptorep.strip('\n')
ptorep = ptorep.split('<:to:>')
for line in text:
for clone in range(0, line.count(ptorep[0])):
line = line.replace(ptorep[0], ptorep[1])
text = ''.join(text)
return text
This takes around 1 min for a single small file.
Dictionary Method:
import re
with open(argv[4], 'r', encoding='utf-8') as file:
translations = [line.strip('\n').split('<:to:>') for line in file.readlines()]
translations = {t[0]:t[1] for t in translations} # Convert to dictionary where the key is the english word and the value is the translation
output = []
for word in re.split('\W+'): # Split into words (may require tweaking)
output.append(translations.get(word, word)) # Search for the key `word`, in case it does not exist, use `word`
output = ''.join(output)
Original Method:
Maybe keep the full dictionary in memory as a list:
cdic = open(argv[4], 'r', encoding='utf-8')
translations = []
for line in cdic.readlines():
translations.append(line.strip('\n').split('<:to:>'))
# Note: I would use a list comprehension for this
with open(argv[4], 'r', encoding='utf-8') as file:
translations = [line.strip('\n').split('<:to:>') for line in file.readlines()]
And make the replacements off of that:
def replacer(text, translations):
for entry in translations:
text = text.replace(entry[0], entry[1])
return text

How to append lines from file into a list while keeping the number of lines - python 3

I am trying to stem words from a file that contains about 90000 lines (each line has three to several hundred words.
I want to append the lines to a list after stemming the words. I was able to insert the stemmed words into a list, which contains one line. I want to insert the words into the list while keeping the 90000 lines. Any ideas?
clean_sentence = []
with open(folder_path+text_file_name, 'r', encoding='utf-8') as f:
for line in f:
sentence = line.split()
for word in sentence:
if word.endswith('er'):
clean_sentence.append(word[:-2])
else:
clean_sentence.append(word)
x = ' '.join(clean_sentence)
with open('StemmingOutFile.txt','w', encoding="utf8") as StemmingOutFile:
StemmingOutFile.write(x)
The file is not in English, but here is an example that illustrates the issue at hand: current code yields:
why don't you like to watch TV? are there any more fruits? why not?
I want the output file to be:
why don't you like to watch TV?
are there any more fruits?
why not?
Read the file in lines:
with open('file.txt','r') as f:
lines = f.read().splitlines()
and then do the stemming:
new_lines = []
for line in lines:
new_lines.append(' '.join[stemmed(word) for word in line])
where stemmed is a function as follows:
def stemmed(word):
return word[:-2] if word.endswith('er') else word
Then write each line of new_lines in StemmingOutFile.txt.

Python Unique words in text file

I am in need of inserting unique words into a text file. One word per line and that word will be unique in whole file.
Whenever a new word comes as variable "word" then I need a way to check if it exists in the file. If it exists then it will again pick another word. It will again check if it exists or not and will do so until an unique word comes.
How can I do that?
By the way, I was doing:
newword = "learn"
f = open('wordlist.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
while s.find(newword) != -1:
return newword
else:
return newword
Thanks in advance.
Assuming the file isn't too large, you could just read it all to memory, and check if the word is there:
def word_in_file(filename, word):
with open(filename) as f:
words = f.read().splitlines()
return word in words
If you're going to do this more than once, you'd better keep the words list and append to it each word you add to the file instead of reading it multiple times.
Additionally, creating a set from words should improve the search time and may be worth the "effort" if you're adding multiple words to the file.

compare two file and find matching words in python

I have a two file: the first one includes terms and their frequency:
table 2
apple 4
pencil 89
The second file is a dictionary:
abroad
apple
bread
...
I want to check whether the first file contains any words from the second file. For example both the first file and the second file contains "apple".
I am new to python.
I try something but it does not work. Could you help me ? Thank you
for line in dictionary:
words = line.split()
print words[0]
for line2 in test:
words2 = line2.split()
print words2[0]
Something like this:
with open("file1") as f1,open("file2") as f2:
words=set(line.strip() for line in f1) #create a set of words from dictionary file
#why sets? sets provide an O(1) lookup, so overall complexity is O(N)
#now loop over each line of other file (word, freq file)
for line in f2:
word,freq=line.split() #fetch word,freq
if word in words: #if word is found in words set then print it
print word
output:
apple
It may help you :
file1 = set(line.strip() for line in open('file1.txt'))
file2 = set(line.strip() for line in open('file2.txt'))
for line in file1 & file2:
if line:
print line
Here's what you should do:
First, you need to put all the dictionary words in some place where you can easily look them up. If you don't do that, you'd have to read the whole dictionary file every time you want to check one single word in the other file.
Second, you need to check if each word in the file is in the words you extracted from the dictionary file.
For the first part, you need to use either a list or a set. The difference between these two is that list keeps the order you put the items in it. A set is unordered, so it doesn't matter which word you read first from the dictionary file. Also, a set is faster when you look up an item, because that's what it is for.
To see if an item is in a set, you can do: item in my_set which is either True or False.
I have your first double list in try.txt and the single list in try_match.txt
f = open('try.txt', 'r')
f_match = open('try_match.txt', 'r')
print f
dictionary = []
for line in f:
a, b = line.split()
dictionary.append(a)
for line in f_match:
if line.split()[0] in dictionary:
print line.split()[0]

Categories

Resources