python, compare two files and get difference - python

I have two files, one is user input f1, and other one is database f2.I want to search if strings from f1 are in database(f2). If not print the ones that don't exist if f2. I have problem with my code, it is not working fine:
Here is f1:
rbs003491
rbs003499
rbs003531
rbs003539
rbs111111
Here is f2:
AHPTUR13,rbs003411
AHPTUR13,rbs003419
AHPTUR13,rbs003451
AHPTUR13,rbs003459
AHPTUR13,rbs003469
AHPTUR13,rbs003471
AHPTUR13,rbs003479
AHPTUR13,rbs003491
AHPTUR13,rbs003499
AHPTUR13,rbs003531
AHPTUR13,rbs003539
AHPTUR13,rbs003541
AHPTUR13,rbs003549
AHPTUR13,rbs003581
In this case it would return rbs11111, because it is not in f2.
Code is:
with open(c,'r') as f1:
s1 = set(x.strip() for x in f1)
print s1
with open("/tmp/ARNE/blt",'r') as f2:
for line in f2:
if line not in s1:
print line

If you only care about the second part of each line (rbs003411 from AHPTUR13,rbs003411):
with open(user_input_path) as f1, open('/tmp/ARNE/blt') as f2:
not_found = set(f1.read().split())
for line in f2:
_, found = line.strip().split(',')
not_found.discard(found) # remove found word
print not_found
# for x in not_found:
# print x

Your line variable in the for loop will contain something like "AHPTUR13,rbs003411", but you are only interested in the second part. You should do something like:
for line in f2:
line = line.strip().split(",")[1]
if line not in s1:
print line

you need to check the last part of your lines not all of them , you can split your lines from f2 with , then choose the last part (x.strip().split(',')[-1]) , Also if you want to search if strings from f1 are in database(f2) your LOGIC here is wrong you need to create your set from f2 :
with open(c,'r') as f1,open("/tmp/ARNE/blt",'r') as f2:
s1 = set(x.strip().split(',')[-1] for x in f2)
print s1
for line in f1:
if line.strip() not in s1:
print line

Related

Comparing contents of two txt.files for deleted lines or changes in python

I'm trying to compare two .txt files for changes or deleted lines. If its deleted I want to output what the deleted line was and if it was changed I want to output the new line. I originally tried comparing line to line but when something was deleted it wouldn't work for my purpose:
for line1 in f1:
for line1 in f2:
if line1==line1:
print("SAME",file=x)
else:
print(f"Original:{line1} / New:{line1}", file=x)
Then I tried not comparing line to line so I could figure out if something was deleted but I'm not getting any output:
def check_diff(f1,f2):
check = {}
for file in [f1,f2]:
with open(file,'r') as f:
check[file] = []
for line in f:
check[file].append(line)
diff = set(check[f1]) - set(check[f2])
for line in diff:
print(line.rstrip(),file=x)
I tried combining a lot of other questions previously asked similar to my problem to get this far, but I'm new to python so I need a little extra help. Thanks! Please let me know if I need to add any additional information.
The concept is simple. Lets say file1,txt is the original file, and file2 is the one we need to see what changes were made to it:
with open('file1.txt', 'r') as f:
f1 = f.readlines()
with open('file2.txt', 'r') as f:
f2 = f.readlines()
deleted = []
added = []
for l in f1:
if l not in f2:
deleted.append(l)
for l in f2:
if l not in f1:
added.append(l)
print('Deleted lines:')
print(''.join(deleted))
print('Added lines:')
print(''.join(added))
For every line in the original file, if that line isn't in the other file, then that means that the line have been deleted. If it's the other way around, that means the line have been added.
I am not sure how you would quantify a changed line (since you could count it as one deleted plus one added line), but perhaps something like the below would be of some aid. Note that if your files are large, it might be faster to store the data in a set instead of a list, since the former has typically a search time complexity of O(1), while the latter has O(n):
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
file1 = set(f1.read().splitlines())
file2 = set(f2.read().splitlines())
changed_lines = [line for line in file1 if line not in file2]
deleted_lines = [line for line in file2 if line not in file1]
print('Changed lines:\n' + '\n'.join(changed_lines))
print('Deleted lines:\n' + '\n'.join(deleted_lines))

remove specific lines if some specific word is found

Lets say I have a text file that contains following words
a
b
c
d
e>
f
g
h
I>
j
whenever I find a words that contains >, I would like to replace the last two lines from it and itself too.
For example, the output would be this.
a
b
f
j
Is it possible to achieve this ?. For simple replace, I can do this
with open ('Final.txt', 'w') as f2:
with open('initial.txt', 'r') as f1:
for line in f1:
f2.write(line.replace('>', ''))
But I am stuck on how do I go back and delete the last two lines and also the line where the replace happen.
This is one approach using a simple iteration and list slicing.
Ex:
res = []
with open('initial.txt') as infile:
for line in infile:
if ">" in line:
res = res[:-2]
else:
res.append(line)
with open('Final.txt', "w") as f2:
for line in res:
f2.write(line)
Output:
a
b
f
j
Use re.
Here I am assuming that your data is a flat list of lines.
import re
print(re.sub('.*\n.*\n.*>\n','',''.join(data)))

How to skip a lot of text or values in 2 files and do another task with the data

In the following code I wanted to skip the content (a lot of not usable content in files ex1.idl and ex2.idl) and get to the data with which to work. This data begins at the 905th value on each line of each file. Snippets of the files are:
ex1.idl
0.11158E-13 0.11195E-13 0.11233E-13 ...
ex2.idl
0.11010E-13 0.11070E-13 0.11117E-13 ...
I can successfully skip the unneeded values. I can also do some splitting, slicing and calculating. But when I combine the two, the code does not seem to work. The following is the combined code that I have:
with open('ex1.idl') as f1, open('ex2.idl') as f2:
with open('ex3.txt', 'w') as f3:
a = 905 #the first part
f1 = f1.readlines(905:)[a-1:] #the first part
f2 = f2.readlines(905:)[a-1:] #the first part
f1 = map(float, f1.read().strip().split()) #the second part
f2 = map(float, f2.read().strip().split()) #the second part
for result in map(lambda v: v[0]/v[1], zip(f1, f2)): #the second part
f3.write(str(result)+"\n") #the second part
This is the code where I just read the data and do the splitting and calculating alone. This works:
with open('primer1.idl') as f1, open('primer2.idl') as f2:
with open('primer3.txt', 'w') as f3:
f1 = map(float, f1.read().strip().split())
f2 = map(float, f2.read().strip().split())
for result in map(lambda v: v[0]/v[1], zip(f1, f2)):
f3.write(str(result)+"\n")
So I only want to add that the program starts the reading and computing at line 905.
Thanks in advance for the answer.
I have done some work here and found that this also works:
with open('ex1.idl') as f1, open('ex2.idl') as f2:
with open('ex3.txt', 'w') as f3:
start_line = 905 #reading from this line forward
for i in range(start_line - 1):
next(f1)
next(f2)
f1 = list(map(float, f1.read().split()))
f2 = list(map(float, f2.read().split()))
for result in map(lambda v : v[0]/v[1], zip(f1,f2)):
f3.write((str(result)) + '\n')
Try this:
from itertools import islice
with open('ex1.idl') as f1, open('ex2.idl') as f2:
with open('ex3.txt', 'w') as f3:
f1 = islice(f1, 905, None) # skip first 905 lines
f2 = islice(f2, 905, None) # skip first 905 lines
for f1_line, f2_line in zip(f1, f2):
f1_vals = map(float, f1_line.strip().split())
f2_vals = map(float, f2_line.strip().split())
for result in map(lambda v: v[0]/v[1], zip(f1_vals, f2_vals)):
f3.write(str(result)+"\n")
This ignores the first 904 values in each file. Then it zips the contents of the files together so that the corresponding lines in each file are put together in the same tuple. Then you can loop through this zipped data and split each line on the space, convert the values to floating point values and then do the division.
Please note that if a line in either file contains zeros, then you will likely get a float division by zero Exception. Please make sure you don't have zeros in the files.
If it is impossible to make sure that you do not have zeros in there, then you should handle that with a try-except and then skip:
with open('ex1.idl') as f1, open('ex2.idl') as f2:
with open('ex3.txt', 'w') as f3:
f1 = islice(f1, 905, None) # skip first 905 lines
f2 = islice(f2, 905, None) # skip first 905 lines
for f1_line, f2_line in zip(f1, f2):
f1_vals = map(float, f1_line.strip().split())
f2_vals = map(float, f2_line.strip().split())
for v1, v2 in zip(f1_vals, f2_vals):
try:
result = v1/v2
f3.write(str(result)+"\n")
except ZeroDivisionError:
print("Encountered a value equal to zero in the second file. Skipping...")
continue
You can do whatever else you like, other than skipping. That's up to you.

How to compare 2 txt files in Python

I have written a program to compare file new1.txt with new2.txt and the lines which are there in new1.txt and not in new2.txt has to be written to difference.txt file.
Can someone please have a look and let me know what changes are required in the below given code. The code prints the same value multiple times.
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1 in file1:
for line2 in file2:
if line2 != line1:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()
Here's an example using the with statement, supposing the files are not too big to fit in the memory
# Open 'new1.txt' as f1, 'new2.txt' as f2 and 'diff.txt' as outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:
# Read the lines from 'new2.txt' and store them into a python set
lines = set(f2.readlines())
# Loop through each line in 'new1.txt'
for line in f1:
# If the line was not in 'new2.txt'
if line not in lines:
# Write the line to the output file
outf.write(line)
The with statement simply closes the opened file(s) automatically. These two pieces of code are equal:
with open('temp.log') as temp:
temp.write('Temporary logging.')
# equal to:
temp = open('temp.log')
temp.write('Temporary logging.')
temp.close()
Yet an other way using two sets, but this again isn't too memory effecient. If your files are big, this wont work:
# Again, open the three files as f1, f2 and outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:
# Read the lines in 'new1.txt' and 'new2.txt'
s1, s2 = set(f1.readlines()), set(f2.readlines())
# `s1 - s2 | s2 - s2` returns the differences between two sets
# Now we simply loop through the different lines
for line in s1 - s2 | s2 - s1:
# And output all the different lines
outf.write(line)
Keep in mind, that this last code might not keep the order of your lines
For example you got
file1:
line1
line2
and file2:
line1
line3
line4
When you compare line1 and line3, you write to your output file new line (line1), then you go to compare line1 and line4, again they do not equal, so again you print into your output file (line1)... You need to break both for s, if your condition is true. You can use some help variable to break outer for.
If your file is a big one .You could use this.for-else method:
the else method below the second for loop is executes only when the second for loop completes it's execution with out break that is if there is no match
Modification:
with open('new1.txt') as file1, open('diff.txt', 'w') as NewFile :
for line1 in file1:
with open('new2.txt') as file2:
for line2 in file2:
if line2 == line1:
break
else:
NewFile.write(line1)
For more on for-else method see this stack overflow question for-else
It is because of your for loops.
If I understand well, you want to see what lines in file1 are not present in file2.
So for each line in file1, you have to check if the same line appears in file2. But this is not what you do with your code : for each line in file1, you check every line in file2 (this is right), but each time the line in file2 is different from the line if file1, you print the line in file1! So you should print the line in file1 only AFTER having checked ALL the lines in file2, to be sure the line does not appear at least one time.
It could look like something as below:
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1 in file1:
if line1 not in file2:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()
I always find working with sets makes comparison of two collections easier. Especially because"does this collection contain this" operations runs i O(1), and most nested loops can be reduced to a single loop (easier to read in my opinion).
with open('test1.txt') as file1, open('test2.txt') as file2, open('diff.txt', 'w') as diff:
s1 = set(file1)
s2 = set(file2)
for e in s1:
if e not in s2:
diff.write(e)
Your loop is executed multiple times. To avoid that, use this:
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1, line2 in izip(file1, file2):
if line2 != line1:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()
Print to the NewFile, only after comparing with all lines of file2
present = False
for line2 in file2:
if line2 == line1:
present = True
if not present:
NewFile.write(line1)
You can use basic set operations for this:
with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
diffs.writelines(set(f1).difference(f2))
According to this reference, this will execute with O(n) where n is the number of lines in the first file. If you know that the second file is significantly smaller than the first you can optimise with set.difference_update(). This has complexity O(n) where n is the number of lines in the second file. For example:
with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
s = set(f1)
s.difference_update(f2)
diffs.writelines(s)

Comparing some data of a file with another file in python

I do have a file f1 which Contains some text lets say "All is well".
In Another file f2 I have maybe 100 lines and one of them is "All is well".
Now I want to see if file f2 contains content of file f1.
I will appreciate if someone comes with a solution.
Thanks
with open("f1") as f1,open("f2") as f2:
if f1.read().strip() in f2.read():
print 'found'
Edit:
As python 2.6 doesn't support multiple context managers on single line:
with open("f1") as f1:
with open("f2") as f2:
if f1.read().strip() in f2.read():
print 'found'
template = file('your_name').read()
for i in file('2_filename'):
if template in i:
print 'found'
break
with open(r'path1','r') as f1, open(r'path2','r') as f2:
t1 = f1.read()
t2 = f2.read()
if t1 in t2:
print "found"
Using the other methods won't work if there's '\n' inside the string you want to search for.
fileOne = f1.readlines()
fileTwo = f2.readlines()
now fileOne and fileTwo are list of the lines in the files, now simply check
if set(fileOne) <= set(fileTwo):
print "file1 is in file2"

Categories

Resources