Removing extra words in a file using python - python

Hi I am learning Python and out of my curiosity, I have written a program to remove the extra words in a file.
I am comparing the test in file 'text1.txt. and 'text2.txt' and based upon the test in text1, I am removing the words which were extra in the test2.
# Bin/ Python
text1 = open('text1.txt','r')
text2 = open('text2.txt','r')
t_l1 = text1.readlines()
t_l2 = text2.readlines()
# printing to check if the file contents were read properly.
print ' Printing the file 1 contents:'
w_t1 = []
for i in range(len(t_l1)):
w_t1 = t_l1[i].split(' ')
for j in range(len(w_t1)):
print w_t1[j]
#printing to see if the contents were read properly.
print'File 2 contents:'
w_t2 = []
for i in range(len(t_l2)):
w_t2.extend(t_l2[i].split(' '))
for j in range(len(w_t2)):
print w_t2[j]
print 'comparing and deleting the excess variables.'
i = 1
while (i<=len(w_t1)):
if(w_t1[i-1] == w_t2[i-1]):
print w_t1[i-1]
i += 1
# I put all words of file1 in list w_t1 and file2 in list w_t2. Now I am checking if
# each word in w_t1 is same as word in same place of w_t2 if not, i am deleting the
# that word in w_t2 and continuing the while loop.
else:
w.append(str(w_t2[i-1]))
w_t2.remove(w_t2[i-1])
i = i
print 'The extra words are: '+str(w) +'\n'
print w
print 'The original words are: '+ str(w_t2) +'\n'
print 'The extra values are: '
for item in w:
print item
# opening the file out.txt to write the output.
out = open('out.txt', 'w')
out.write(str(w))
# I am closing the files
text1.close()
text2.close()
out.close()
say text1.txt file has the words "Happy birthday dear Friend"
and text2.txt has the words "Happy claps birthday to you my dear Best Friend"
The program should give out the extra words in text2.txt which are "claps, to, you, my, Best"
The above program works fine but what if I have to do this for a file containing millions of words, or million lines ?? Checking each and every word dosen't seems to be a good idea. Do we have any Python pre defined functions for that ??
P.S : Kindly bear with me if this is a wrong question, I am learning python. Very soon I'll stop asking these.

It seems a 'Set' problem. First add your words in a set structure:
textSet1 = set()
with open('text1.txt','r') as text1:
for line in text1:
for word in line.split(' '):
textSet1.add(word)
textSet2 = set()
with open('text2.txt','r') as text2:
for line in text2:
for word in line.split(' '):
textSet2.add(word)
then simply apply set difference operator
textSet2.difference(textSet1)
that give you this result
set(['claps', 'to', 'you', 'my', 'Best'])
You can obtain a list from previous structure in this way
list(textSet2.difference(textSet1))
['claps', 'to', 'you', 'my', 'Best']
Then, how you can read here you shouldn't worry about large files size because with the given loader
When the next line is read, the previous one will be garbage collected
unless you have stored a reference to it somewhere else
More about lazy file loading here.
Finally, in a real problem I suppose there is a first set (bad words) that have a relative small size and a second set with a huge amount of data. If this is the case then you can avoid the creation of second set:
diff = []
with open('text2.txt','r') as text2:
for line in text2:
for word in line.split(' '):
if word in textSet1:
diff.append(word)

Related

Number of replacements performed by str.replace [duplicate]

I am currently working on a beginner problem
(https://www.reddit.com/r/beginnerprojects/comments/1i6sax/challenge_count_and_fix_green_eggs_and_ham/).
The challenge is to read through a file, replacing lower case 'i' with 'I' and writing a new corrected file.
I am at a point where the program reads the input file, replaces the relevant lower case characters, and writes a new corrected file. However, I need to also count the number of corrections.
I have looked through the .replace() documentation and I cannot see that it is possible to find out the number of replacements made. Is it possible to count corrections using the replace method?
def capitalize_i(file):
file = file.replace('i ', 'I ')
file = file.replace('-i-', '-I-')
return file
with open("green_eggs.txt", "r") as f_open:
file_1 = f_open.read()
file_2 = open("result.txt", "w")
file_2.write(capitalize_i(file_1))
You can just use the count function:
i_count = file.count('i ')
file = file.replace('i ', 'I ')
i_count += file.count('-i-')
file = file.replace('-i-', '-I-')
i_count will have the total amount of replacements made. You can also separate them by creating new variables if you want.

Remove words from a subtitle file that aren't in a wordlist (of common words)

I have some subtitle files, and I'm not intending to learn every single word in these subtitles, there is no need to learn some hard terms like: cleidocranial, dysplasia...
I found this script here: Remove words from a cell that aren't in a list. But I have no idea how to modify it or run it. (I'm using linux)
Here is our example:
subtitle file (.srt):
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.
wordlist of 3000 common words (.txt):
...
people
with
are
good
...
Output we need (.srt):
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
Or just mark them if it's possible (.srt):
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
If there is a solution working just with plain texts (without timecodes), it's ok, just explain how to run it
Thank you.
The following processes the 3rd line only of every '.srt' file. It can be easily adapted to process other lines and/or other files.
import os
import re
from glob import glob
with open('words.txt') as f:
keep_words = {line.strip().lower() for line in f}
for filename_in in glob('*.srt'):
filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
with open(filename_in) as fin, open(filename_out, 'w') as fout:
for i, line in enumerate(fin):
if i == 2:
parts = re.split(r"([\w']+)", line.strip())
parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
line = ''.join(parts) + '\n'
fout.write(line)
Result (for the subtitle.rst you gave as example:
! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
Alternative: just add a '*' next to out-of-vocabulary words:
# replace:
# parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]
The output is then:
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
Explanation:
The first open is used to read in all wanted words, make sure they are in lowercase, and put them into a set (for fast membership test).
We use glob to find all filenames ending in '.srt'.
For each such file, we construct a new filename derived from it as '..._new.srt'.
We read in all lines, but modify only line i == 2 (i.e. the 3rd line, since enumerate by default starts at 0).
line.strip() removes the trailing newline.
We could have used line.strip().split() to split the line into words, but it would have left 'good.' as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"; it may or may not be what you want, adapt at will of course).
We use a capturing group split r"([\w']+)" instead of splitting on non-word chars, so that we have both words and what separates them in parts. For example, 'People, who are good.' becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.'].
The words themselves are every other element of parts, starting at index 1.
We replace the words by '*' if their lowercase form is not in keep_words.
Finally we re-assemble that line, and generally output all lines to the new file.
you could simply run a python script like this:
with open("words.txt", "rt") as words:
#create a list with every word
wordList = words.read().split("\n")
with open("subtitle.srt", "rt") as subtitles:
with open("subtitle_output.srt", "wt") as out:
for line in subtitles.readlines():
if line[0].isdigit():
#ignore the line as it starts with a digit
out.write(line)
continue
else:
for word in line.split():
if not word in wordList:
out.write(line.replace(word, f"*{word}*"))
this script will replace every word that's not in the common words file with the modified *word* keeping the original file and putting everything into a new output file

Calculating how many times sentence words are repeating in the file

I want to check how many times a word is repeating in the file. I have seen other codes on finding words in file but they won't solve my problem.From this I mean if I want to find "Python is my favourite language"The program will split the text will tell how many times it has repeated in the file.
def search_tand_export():
file = open("mine.txt")
#targetlist = list()
#targetList = [line.rstrip() for line in open("mine.txt")]
contentlist = file.read().split(" ")
string=input("search box").split(" ")
print(string)
fre={}
outputfile=open("outputfile.txt",'w')
for word in contentlist:
print(word)
for i in string:
# print(i)
if i == word:
print(f"'{string}' is in text file ")
outputfile.write(word)
print(word)
spl=tuple(string.split())
for j in range(0,len(contentist)):
if spl in contentlist:
fre[spl]+=1
else:
fre[spl]=1
sor_list=sorted(fre.items(),key =lambda x:x[1])
for x,y in sor_list:
print(f"Word\tFrequency")
print(f"{x}\t{y}")
else:
continue
print(f"The word or collection of word is not present")
search_tand_export()
I don't quite understand what you're trying to do.
But I suppose you are trying to find how many times every word from a given sentence is repeated in the file.
If this is the case, you can try something like this:
sentence = "Python is my favorite programming language"
words = sentence.split()
with open("file.txt") as fp:
file_data = fp.read()
for word in words:
print(f"{file_data.count(word)} occurence(s) of '{word}' found")
Note that the code above is case-sensitive (that is, "Python" and "python" are different words). To make it case-insensitive, you can bring file_data and every word during comparison to lowercase using str.lower().
sentence = "Python is my favorite programming language"
words = sentence.split()
with open("file.txt") as fp:
file_data = fp.read().lower()
for word in words:
print(f"{file_data.count(word.lower())} occurence(s) of '{word}' found")
A couple of things to note:
You are opening a file and even don't close it finally (although you should). It's better to use with open(...) as ... (context-manager), so the file is closed automatically.
Python strings (as well as lists, tuples etc.) have .count(what) method. It returns how many occurences of what are found in the object.
Read about PEP-8 coding style and give better names to variables. For example, it is not easy to understand what does fre means in your code. But if you name it as frequency, the code will become more readable, and it will be easier to work with it.
to be continued
Try this script. It finds word in file and counts how many times it is found in words:
file = open('hello.txt','r')
word = 'Python'
words = 0
for line in file:
for word in line:
words += 1
print('File contains ' + word + ' ' + str(words) + ' times' )

Replace words of a long document in Python

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work
A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)

python file manipulation

I have a file with entries such as:
26 1
33 2
.
.
.
and another file with sentences in english
I have to write a script to print the 1st word in sentence number 26
and the 2nd word in sentence 33.
How do I do it?
The following code should do the task. With assumptions that files are not too large. You may have to do some modification to deal with edge cases (like double space, etc)
# Get numers from file
num = []
with open('1.txt') as file:
num = file.readlines()
# Get text from file
text = []
with open('2.txt') as file:
text = file.readlines()
# Parse text into words list.
data = []
for line in text: # For each paragraoh in the text
sentences = l.strip().split('.') # Split it into sentences
words = []
for sentence in sentences: # For each sentence in the text
words = sentence.split(' ') # Split it into words list
if len(words) > 0:
data.append(words)
# get desired result
for i = range(0, len(num)/2):
print data[num[i+1]][num[i]]
Here's a general sketch:
Read the first file into a list (a numeric entry in each element)
Read the second file into a list (a sentence in each element)
Iterate over the entry list, for each number find the sentence and print its relevant word
Now, if you show some effort of how you tried to implement this in Python, you will probably get more help.
The big issue is that you have to decide what separates "sentences". For example, is a '.' the end of a sentence? Or maybe part of an abbreviation, e.g. the one I've just used?-) Secondarily, and less difficult, what separates "words", e.g., is "TCP/IP" one word, or two?
Once you have sharply defined these rules, you can easily read the file of text into a a list of "sentences" each of which is a list of "words". Then, you read the other file as a sequence of pairs of numbers, and use them as indices into the overall list and inside the sublist thus identified. But the problem of sentence and word separation is really the hard part.
In the following code, I am assuming that sentences end with '. '. You can modify it easily to accommodate other sentence delimiters as well. Note that abbreviations will therefore be a source of bugs.
Also, I am going to assume that words are delimited by spaces.
sentences = []
queries = []
english = ""
for line in file2:
english += line
while english:
period = english.find('.')
sentences += english[: period+1].split()
english = english[period+1 :]
q=""
for line in file1:
q += " " + line.strip()
q = q.split()
for i in range(0, len(q)-1, 2):
sentence = q[i]
word = q[i+1]
queries.append((sentence, query))
for s, w in queries:
print sentences[s-1][w-1]
I haven't tested this, so please let me know (preferably with the case that broke it) if it doesn't work and I will look into bugs
Hope this helps

Categories

Resources