Search Large file for text and write result to file - python

I have file one that is 2.4 millions lines (256mb) and file two that is 32 thousand lines (1.5mb).
I need to go through file two line by line and print matching line in file one.
Pseudocode:
open file 1, read
open file 2, read
open results, write
for line2 in file 2:
for line1 in file 1:
if line2 in line1:
write line1 to results
stop inner loop
My Code:
p = open("file1.txt", "r")
d = open("file2.txt", "r")
o = open("results.txt", "w")
for hash1 in p:
hash1 = hash1.strip('\n')
for data in d:
hash2 = data.split(',')[1].strip('\n')
if hash1 in hash2:
o.write(data)
o.close()
d.close()
p.close()
I am expecting 32k results.

Your file2 is not too large, so it is perfectly well to load it in memory.
Load file2.txt into a set to speed up search process and remove duplicates;
Remove empty line from a set;
Scan file1.txt line-by-line and write found matches in results.txt.
with open("file2.txt","r") as f:
lines = set(f.readlines())
lines.discard("\n")
with open("results.txt", "w") as o:
with open("file1.txt","r") as f:
for line in f:
if line in lines:
o.write(line)
If file2 was larger, we could have split it in chunks and repeat the same for every chunk, but in that case it would be harder to compile the results together

Related

How to add empty lines between lines in text.txt document?

The purpose of the code is to add an empty line between lines in text.txt document and write some words in those empty lines.
I tried looping through every line but the file should be in read mode only;
iushnaufihsnuesa
fsuhadnfuisgadnfuigasdf
asfhasndfusaugdf
suhdfnciusgenfuigsaueifcas
This is a sample of text.txt document
how can i implement this on this txt?
f = open("text.txt", 'w+')
for x in f:
f.write("\n Words between spacing")
f.close()
First i tried directly to just make a new line between each line and add couple of stuuf
I also thought of first making empty lines between each line and then add some words in the empty spaces but I didn't figure this out
Ok, for files in the region of 200 lines long you can store the whole file as a list of strings and add lines when re-writing the file:
with open("text.txt", 'r') as f:
data = [line for line in f]
with open("text.txt", 'w') as f:
for line in data:
f.write(line)
f.write("Words between spacing\n")
You can divide this operation in three steps.
In the first one, you read all the lines from the file into a list[str] using f.readlines():
with open("text.txt", "r") as f: # using "read" mode
lines = f.readlines()
Second is to join these lines inside the list using the "".join(...) function.
lines = "My line between the lines\n".join(lines)
On third step, write it down to the file:
with open("text.txt", "w") as f: # using "write" mode
f.write(lines)
Also, you can use f.read() in conjunction with text.replace("\n", ...):
with open("text.txt", "r") as f:
full_text = f.read()
full_text = full_text.replace("\n", "\nMy desirable text between the lines\n")
with open("text.txt", "w") as f:
f.write(full_text)
Initial text:
iushnaufihsnuesa
fsuhadnfuisgadnfuigasdf
asfhasndfusaugdf
suhdfnciusgenfuigsaueifcas
Final text:
iushnaufihsnuesa
My desirable text between the lines
fsuhadnfuisgadnfuigasdf
My desirable text between the lines
asfhasndfusaugdf
My desirable text between the lines
suhdfnciusgenfuigsaueifcas

Python wont read file line by line

fh=open('Spam.mbox',encoding='utf-8')
data=fh.read()
for line in data:
print(line)
When I execute the above code, python prints out the data one character at a time instead of line by line.
Please advise.
you can do that using the readlines() function.
with open('Spam.mbox',encoding='utf-8') as f:
data = f.readlines()
With the data variable you can iterate over it and print each line
for i in data:
print(i)
When reading files use the with statement because then the file will be closed after it has been processed.
Read line by line:
with open("textfile.txt", "r") as f:
for line in f:
print(line)
Read all the lines and then loop through the line:
with open("textfile.txt", "r") as f2:
lines = f2.readlines()
for ln in lines:
print(ln)

Replacing text from one file from another file

The f1.write(line2) works but it does not replace the text in the file, it just adds it to the file. I want the file1 to be identical to file2 if they are different by overwriting the text from file1 with the text from file2
Here is my code:
with open("file1.txt", "r+") as f1, open("file2.txt", "r") as f2:
for line1 in f1:
for line2 in f2:
if line1 == line2:
print("same")
else:
print("different")
f1.write(line2)
break
f1.close()
f2.close()
I would read both files create a new list with the different elements replaced and then write the entire list to the file
with open('file2.txt', 'r') as f:
content = [line.strip() for line in f]
with open('file1.txt', 'r') as j:
content_a = [line.strip() for line in j]
for idx, item in enumerate(content_a):
if content_a[idx] == content[idx]:
print('same')
pass
else:
print('different')
content_a[idx] = content[idx]
with open('file1.txt', 'w') as k:
k.write('\n'.join(content_a))
file1.txt before:
chrx#chrx:~/python/stackoverflow/9.28$ cat file1.txt
this
that
this
that
who #replacing
that
what
blah
code output:
same
same
same
same
different
same
same
same
file1.txt after:
chrx#chrx:~/python/stackoverflow/9.28$ cat file1.txt
this
that
this
that
vash #replaced who
that
what
blah
I want the file1 to be identical to file2
import shutil
with open('file2', 'rb') as f2, open('file1', 'wb') as f1:
shutil.copyfileobj(f2, f1)
This will be faster as you don't have to read file1.
Your code is not working because you'd have to position the file current pointer (with f1.seek() in the correct position to write the line.
In your code, you're reading a line first, and that positions the pointer after the line just read. When writing, the line data will be written in the file in that point, thus duplicating the line.
Since lines can have different sizes, making this work won't be easy, because even if you position the pointer correctly, if some line is modified to get bigger it would overwrite part of the next line inside the file when you write it. You would end up having to cache at least part of the file contents in memory anyway.
Better truncate the file (erase its contents) and write the other file data directly - then they will be identical. That's what the code in the answer does.

Split text file into lines, Python

I want to split a text file in python, using the following peice of code:
inputfile = open(sys.argv[1]).read()
for line in inputfile.strip().split("\n"):
print line
the problem is, that it's read the first 12 lines only!! the file is large more than 10 thousand lines!
What is the possible reason!
Thanks,
with open(sys.argv[1]) as inputfile:
for line in inputfile:
print(line)
Use readlines() which will generate list automatically and no need to read by "\n".
Try this:
text = r"C:\Users\Desktop\Test\Text.txt"
oFile = open(text, 'r')
line = oFile.readline()[:-1]
while line:
splitLine = line.split(' ')
print splitLine
line = oFile.readline()[:-1]
oFile.close()
I use this style to iterate through huge text files at work

Match lines in two text files

i have two text files, first file is 40GB (data2) second is around 50MB (data1)
i want to check if any line in file1 have a match in file2 so I've written a python script (below) to do so, the process with this script takes too much time as it takes the line from file1 then it checks the whole file2 line by line.
for line in open("data1.txt","r"):
for line2 in open("data2.txt","r"):
if line==line2:
print(line)
is there any way/code to make this fast? the script is running since 5 days and still didn't finish. is there a way also to show a % or current line number in process?
Use a set and reverse the logic, checking if any line from the large data file is in the set of lines of f2 which is the smaller 50mb file:
with open("data1.txt", "r") as f1, open("data2.txt", "r") as f2:
lines = set(f1) # efficient 0(1) lookups using a set
for line in f2: # single pass over large file
if line in lines:
print(line)
If you want the line number use enumerate:
with open("data1.txt", "r") as f1, open("data2.txt", "r") as f2:
lines = set(f1) # efficient 0(1) lookups using a set
for lined_no, line in enumerate(f2, 1): # single pass over large file
# print(line_no) # uncomment if you want to see every line number
if line in lines:
print(line,line_no)

Categories

Resources