How to compare 2 txt files in Python

How to compare 2 txt files in Python - python

I have written a program to compare file new1.txt with new2.txt and the lines which are there in new1.txt and not in new2.txt has to be written to difference.txt file.
Can someone please have a look and let me know what changes are required in the below given code. The code prints the same value multiple times.
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1 in file1:
for line2 in file2:
if line2 != line1:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()

Here's an example using the with statement, supposing the files are not too big to fit in the memory
# Open 'new1.txt' as f1, 'new2.txt' as f2 and 'diff.txt' as outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:
# Read the lines from 'new2.txt' and store them into a python set
lines = set(f2.readlines())
# Loop through each line in 'new1.txt'
for line in f1:
# If the line was not in 'new2.txt'
if line not in lines:
# Write the line to the output file
outf.write(line)
The with statement simply closes the opened file(s) automatically. These two pieces of code are equal:
with open('temp.log') as temp:
temp.write('Temporary logging.')
# equal to:
temp = open('temp.log')
temp.write('Temporary logging.')
temp.close()
Yet an other way using two sets, but this again isn't too memory effecient. If your files are big, this wont work:
# Again, open the three files as f1, f2 and outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:
# Read the lines in 'new1.txt' and 'new2.txt'
s1, s2 = set(f1.readlines()), set(f2.readlines())
# `s1 - s2 | s2 - s2` returns the differences between two sets
# Now we simply loop through the different lines
for line in s1 - s2 | s2 - s1:
# And output all the different lines
outf.write(line)
Keep in mind, that this last code might not keep the order of your lines

For example you got
file1:
line1
line2
and file2:
line1
line3
line4
When you compare line1 and line3, you write to your output file new line (line1), then you go to compare line1 and line4, again they do not equal, so again you print into your output file (line1)... You need to break both for s, if your condition is true. You can use some help variable to break outer for.

If your file is a big one .You could use this.for-else method:
the else method below the second for loop is executes only when the second for loop completes it's execution with out break that is if there is no match
Modification:
with open('new1.txt') as file1, open('diff.txt', 'w') as NewFile :
for line1 in file1:
with open('new2.txt') as file2:
for line2 in file2:
if line2 == line1:
break
else:
NewFile.write(line1)
For more on for-else method see this stack overflow question for-else

It is because of your for loops.
If I understand well, you want to see what lines in file1 are not present in file2.
So for each line in file1, you have to check if the same line appears in file2. But this is not what you do with your code : for each line in file1, you check every line in file2 (this is right), but each time the line in file2 is different from the line if file1, you print the line in file1! So you should print the line in file1 only AFTER having checked ALL the lines in file2, to be sure the line does not appear at least one time.
It could look like something as below:
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1 in file1:
if line1 not in file2:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()

I always find working with sets makes comparison of two collections easier. Especially because"does this collection contain this" operations runs i O(1), and most nested loops can be reduced to a single loop (easier to read in my opinion).
with open('test1.txt') as file1, open('test2.txt') as file2, open('diff.txt', 'w') as diff:
s1 = set(file1)
s2 = set(file2)
for e in s1:
if e not in s2:
diff.write(e)

Your loop is executed multiple times. To avoid that, use this:
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1, line2 in izip(file1, file2):
if line2 != line1:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()

Print to the NewFile, only after comparing with all lines of file2
present = False
for line2 in file2:
if line2 == line1:
present = True
if not present:
NewFile.write(line1)

You can use basic set operations for this:
with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
diffs.writelines(set(f1).difference(f2))
According to this reference, this will execute with O(n) where n is the number of lines in the first file. If you know that the second file is significantly smaller than the first you can optimise with set.difference_update(). This has complexity O(n) where n is the number of lines in the second file. For example:
with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
s = set(f1)
s.difference_update(f2)
diffs.writelines(s)

Related

How do I compare every line of the 1st text file to every line of the 2nd text file in Python?

I have 2 text files named f1 & f2 with 100k lines of names each. I want to compare the first line of f1 with every line of f2, then the second line of f1 with every line of f2, and so on. I already tried using nested for loop like code below but it doesn't work.
What am I doing wrong I can't seem to find? Please can someone tell me?
Thanks in advance.
old.txt
sourcreameggnest
saturnnixgreentea
saxophonedesertham
footballplumvirgo
soybeansthesting
cauliflowertornado
sourcreameggnest
saturnnixgreentea
new.txt
goldfishpebbleduck
saxophonedesertham
footballplumvirgo
abloomtheavengers
venisonflowersea
goodfellaswalker
saturnnixgreentea
Code:
with open('old.txt', 'r') as f1, open('new.txt', 'r') as f2:
for line1 in f1:
print('Line 1:- ' + line1, end='')
for line2 in f2:
print('Line 2:- ' + line2, end='')
if line1.strip() == line2:
print("Inside comparison" + line1, end='')
Output:
Line 1:- goldfishpebbleduck
Line 2:- sourcreameggnest
Line 2:- saturnnixgreentea
Line 2:- saxophonedesertham
Line 2:- footballplumvirgo
Line 2:- soybeansthesting
Line 2:- cauliflowertornado
Line 2:- sourcreameggnest
Line 2:- saturnnixgreentea
Line 1:- saxophonedesertham
Line 1:- footballplumvirgo
Line 1:- abloomtheavengers
Line 1:- venisonflowersea
Line 1:- goodfellaswalker
Line 1:- saturnnixgreentea

Considering the number of lines in the files I would entirely avoid the nested loop (O(n^2)) approach and load the lines of the second text file in a dictionary (if you care about the lines and/or the lines could be repeated), or in a set otherwise.
Then I would loop over the lines in the first file and check whether they are in the dictionary and act accordingly. This will use some extra space linear to the number of lines in the second file but reduce your time complexity to O(n) since dictionary lookups are constant.
As to your current solution's incorrectness, as pointed out by #Thierry Lathuille, the second iterator is exhausted after the first run of the outer loop, so it won't be checked for the remaining iterations. On mitigation is to read the lines of the file into a list where you can repeatedly loop over (lines1 = f1.readlines(); lines2 = f2.readlines()). Also, you use of strip is not correct if you intend to avoid whitespace lines. They will still be compared as empty strings with the added downside that stripping one line and not the other can create unwanted differences.
In any case, for such large numbers, an approach of quadratic time complexity is not feasible.

Combining the answers of #LukasNeugebauer and #Thierry Lathuille, here's what your code should look like:
with open('old.txt', 'r') as f1, open('new.txt', 'r') as f2:
lines1 = f1.readlines()
lines2 = f2.readlines()
for line1 in lines1:
print('Line 1:- ' + line1, end='')
if line1 in lines2:
print("Inside comparison" + line1, end='')
If you are wondering, whether using in check is faster then iterating through the second list and comparing each value with ==, I tested it. For both files containing 10,000 lines of random strings, it took ~2.8 seconds to process them fully with two loops and only ~0.8 using the in operator.
If your files are not bigger than a megabyte, I wouldn't really bother optimizing this, but otherwise you should really think about what you are actually comparing and what shortcuts can you use.
EDIT:
Some comments suggested making the second list of lines a set, (change 3rd line to lines2 = set(f2.readlines())) it would make the code much faster (the same example that I used above runs in only 4 miliseconds now, >200 times faster), but it may not actually solve the problem, since converting list to a set will remove all duplicates, so only use that if you are sure that you can discard duplicates.

You already read to the end of the file after the first outer loop. Btw, I didn't know you could just loop over an opened file. Just store the lines first. Also I don't see why you would strip the '\n' only from one of the lines.
with open('old.txt', 'r') as f1, open('new.txt', 'r') as f2:
lines1 = f1.readlines()
lines2 = f2.readlines()
for line1 in lines1:
print('Line 1:- ' + line1, end='')
for line2 in lines2:
print('Line 2:- ' + line2, end='')
if line1 == line2:
print("Inside comparison" + line1, end='')

with open('old.txt', 'r') as f1, open('new.txt', 'r') as f2:
lines2 = f2.readlines()
l2 = dict()
# fill dictionary for line2 with each name and the lines it occurs
for i in range(len(lines2)):
l2[lines[i]] += [i]
for line in f1.readlines():
if line in l2:
for j in l2[line]:
print(line, j, ...)
Each of these 2 loops should have a complexity of O(n) and the lookup via in should be O(n).

Comparing contents of two txt.files for deleted lines or changes in python

I'm trying to compare two .txt files for changes or deleted lines. If its deleted I want to output what the deleted line was and if it was changed I want to output the new line. I originally tried comparing line to line but when something was deleted it wouldn't work for my purpose:
for line1 in f1:
for line1 in f2:
if line1==line1:
print("SAME",file=x)
else:
print(f"Original:{line1} / New:{line1}", file=x)
Then I tried not comparing line to line so I could figure out if something was deleted but I'm not getting any output:
def check_diff(f1,f2):
check = {}
for file in [f1,f2]:
with open(file,'r') as f:
check[file] = []
for line in f:
check[file].append(line)
diff = set(check[f1]) - set(check[f2])
for line in diff:
print(line.rstrip(),file=x)
I tried combining a lot of other questions previously asked similar to my problem to get this far, but I'm new to python so I need a little extra help. Thanks! Please let me know if I need to add any additional information.

The concept is simple. Lets say file1,txt is the original file, and file2 is the one we need to see what changes were made to it:
with open('file1.txt', 'r') as f:
f1 = f.readlines()
with open('file2.txt', 'r') as f:
f2 = f.readlines()
deleted = []
added = []
for l in f1:
if l not in f2:
deleted.append(l)
for l in f2:
if l not in f1:
added.append(l)
print('Deleted lines:')
print(''.join(deleted))
print('Added lines:')
print(''.join(added))
For every line in the original file, if that line isn't in the other file, then that means that the line have been deleted. If it's the other way around, that means the line have been added.

I am not sure how you would quantify a changed line (since you could count it as one deleted plus one added line), but perhaps something like the below would be of some aid. Note that if your files are large, it might be faster to store the data in a set instead of a list, since the former has typically a search time complexity of O(1), while the latter has O(n):
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
file1 = set(f1.read().splitlines())
file2 = set(f2.read().splitlines())
changed_lines = [line for line in file1 if line not in file2]
deleted_lines = [line for line in file2 if line not in file1]
print('Changed lines:\n' + '\n'.join(changed_lines))
print('Deleted lines:\n' + '\n'.join(deleted_lines))

Replacing text from one file from another file

The f1.write(line2) works but it does not replace the text in the file, it just adds it to the file. I want the file1 to be identical to file2 if they are different by overwriting the text from file1 with the text from file2
Here is my code:
with open("file1.txt", "r+") as f1, open("file2.txt", "r") as f2:
for line1 in f1:
for line2 in f2:
if line1 == line2:
print("same")
else:
print("different")
f1.write(line2)
break
f1.close()
f2.close()

I would read both files create a new list with the different elements replaced and then write the entire list to the file
with open('file2.txt', 'r') as f:
content = [line.strip() for line in f]
with open('file1.txt', 'r') as j:
content_a = [line.strip() for line in j]
for idx, item in enumerate(content_a):
if content_a[idx] == content[idx]:
print('same')
pass
else:
print('different')
content_a[idx] = content[idx]
with open('file1.txt', 'w') as k:
k.write('\n'.join(content_a))
file1.txt before:
chrx#chrx:~/python/stackoverflow/9.28$ cat file1.txt
this
that
this
that
who #replacing
that
what
blah
code output:
same
same
same
same
different
same
same
same
file1.txt after:
chrx#chrx:~/python/stackoverflow/9.28$ cat file1.txt
this
that
this
that
vash #replaced who
that
what
blah

I want the file1 to be identical to file2
import shutil
with open('file2', 'rb') as f2, open('file1', 'wb') as f1:
shutil.copyfileobj(f2, f1)
This will be faster as you don't have to read file1.
Your code is not working because you'd have to position the file current pointer (with f1.seek() in the correct position to write the line.
In your code, you're reading a line first, and that positions the pointer after the line just read. When writing, the line data will be written in the file in that point, thus duplicating the line.
Since lines can have different sizes, making this work won't be easy, because even if you position the pointer correctly, if some line is modified to get bigger it would overwrite part of the next line inside the file when you write it. You would end up having to cache at least part of the file contents in memory anyway.
Better truncate the file (erase its contents) and write the other file data directly - then they will be identical. That's what the code in the answer does.

python, compare two files and get difference

I have two files, one is user input f1, and other one is database f2.I want to search if strings from f1 are in database(f2). If not print the ones that don't exist if f2. I have problem with my code, it is not working fine:
Here is f1:
rbs003491
rbs003499
rbs003531
rbs003539
rbs111111
Here is f2:
AHPTUR13,rbs003411
AHPTUR13,rbs003419
AHPTUR13,rbs003451
AHPTUR13,rbs003459
AHPTUR13,rbs003469
AHPTUR13,rbs003471
AHPTUR13,rbs003479
AHPTUR13,rbs003491
AHPTUR13,rbs003499
AHPTUR13,rbs003531
AHPTUR13,rbs003539
AHPTUR13,rbs003541
AHPTUR13,rbs003549
AHPTUR13,rbs003581
In this case it would return rbs11111, because it is not in f2.
Code is:
with open(c,'r') as f1:
s1 = set(x.strip() for x in f1)
print s1
with open("/tmp/ARNE/blt",'r') as f2:
for line in f2:
if line not in s1:
print line

If you only care about the second part of each line (rbs003411 from AHPTUR13,rbs003411):
with open(user_input_path) as f1, open('/tmp/ARNE/blt') as f2:
not_found = set(f1.read().split())
for line in f2:
_, found = line.strip().split(',')
not_found.discard(found) # remove found word
print not_found
# for x in not_found:
# print x

Your line variable in the for loop will contain something like "AHPTUR13,rbs003411", but you are only interested in the second part. You should do something like:
for line in f2:
line = line.strip().split(",")[1]
if line not in s1:
print line

you need to check the last part of your lines not all of them , you can split your lines from f2 with , then choose the last part (x.strip().split(',')[-1]) , Also if you want to search if strings from f1 are in database(f2) your LOGIC here is wrong you need to create your set from f2 :
with open(c,'r') as f1,open("/tmp/ARNE/blt",'r') as f2:
s1 = set(x.strip().split(',')[-1] for x in f2)
print s1
for line in f1:
if line.strip() not in s1:
print line

How to delete a line in file1 that appeared once or multiple times in file2 in python?

I have two text files: file1 has 40 lines and file2 has 1.3 million lines
I would like to compare every line in file1 with file2.
If a line in file1 appeared once or multiple times in file2,
this line(lines) should be deleted from file2 and remaining lines of file2 return to a third file3.
I could painfully delete one line in file1 from file2 by manually copying the line,
indicated as "unwanted_line" in my code. Does anyone knows how to do this in python.
Thanks in advance for your assistance.
Here's my code:
fname = open(raw_input('Enter input filename: ')) #file2
outfile = open('Value.txt','w')
unwanted_line = "222" #This is in file1
for line in fname.readlines():
if not unwanted_line in line:
# now remove unwanted_line from fname
data =line.strip("unwanted_line")
# write it to the output file
outfile.write(data)
print 'results written to:\n', os.getcwd()+'\Value.txt'
NOTE:
This is how I got it to work for me. I would like to thank everyone who contributed towards the solution. I took your ideas here.I used set(),where intersection (common lines) of file1 with file2 is removed, then, the unique lines in file2 are return to file3. It might not be most elegant way of doing it, but it works for me. I respect everyone of your ideas, there are great and wonderful, it makes me feel python is the only programming language in the whole world.
Thanks guys.
def diff_lines(filenameA,filenameB):
fnameA = set(filenameA)
fnameB = set(filenameB)
data = []
#identify lines not common to both files
#diff_line = fnameB ^ fnameA
diff_line = fnameA.symmetric_difference(fnameB)
data = list(diff_line)
data.sort()
return data

Read file1; put the lines into a set or dict (it'll have to be a dict if you're using a really old version of Python); now go through file2 and say something like if line not in things_seen_in_file_1: outfile.write(line) for each line.
Incidentally, in recent Python versions you shouldn't bother calling readlines: an open file is an iterator and you can just say for line in open(filename2): ... to process each line of the file.

Here is my version, but be aware that miniscule variations can cause line not to be considered same (like one space before new line).
file1, file2, file3 = 'verysmalldict.txt', 'uk.txt', 'not_small.txt'
drop_these = set(open(file1))
with open(file3, 'w') as outfile:
outfile.write(''.join(line for line in open(file2) if line not in drop_these))

with open(path1) as f1:
lines1 = set(f1)
with open(path2) as f2:
lines2 = tuple(f2)
lines3 = x for x in lines2 if x in lines1
lines2 = x for x in lines2 if x not in lines1
with open(path2, 'w') as f2:
f2.writelines(lines2)
with open(path3, 'w') as f3:
f3.writelines(lines3)
Closing f2 by using 2 separate with statements is a matter of personal preference/design choice.

what you can do is load file1 completely into memory (since it is small) and check each line in file2 if it matches a line in file1. if it doesn't then write it to file three. Sort of like this:
file1 = open('file1')
file2 = open('file2')
file3 = open('file3','w')
lines_from_file1 = []
# Read in all lines from file1
for line in file1:
lines_from_file1.append(line)
file1.close()
# Now iterate over lines of file2
for line2 in file2:
keep_this_line = True
for line1 in lines_from_file1:
if line1 == line2:
keep_this_line = False
break # break out of inner for loop
if keep_this_line:
# line from file2 is not in file1 so save it into file3
file3.write(line2)
file2.close()
file3.close()
Maybe not the most elegant solution, but if you don't have to do it ever 3 seconds, it should work.
EDIT: By the way, the question in the text somewhat differs from the title. I tried to answer the question in the text.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare 2 txt files in Python - python

Your loop is executed multiple times. To avoid that, use this: file1 = open("new1.txt",'r') file2 = open("new2.txt",'r') NewFile = open("difference.txt",'w') for line1, line2 in izip(file1, file2): if line2 != line1: NewFile.write(line1) file1.close() file2.close() NewFile.close()

Print to the NewFile, only after comparing with all lines of file2 present = False for line2 in file2: if line2 == line1: present = True if not present: NewFile.write(line1)

Related

How do I compare every line of the 1st text file to every line of the 2nd text file in Python?

Comparing contents of two txt.files for deleted lines or changes in python

Replacing text from one file from another file

python, compare two files and get difference

How to delete a line in file1 that appeared once or multiple times in file2 in python?

Categories

Resources