Comparing some data of a file with another file in python - python

I do have a file f1 which Contains some text lets say "All is well".
In Another file f2 I have maybe 100 lines and one of them is "All is well".
Now I want to see if file f2 contains content of file f1.
I will appreciate if someone comes with a solution.
Thanks

with open("f1") as f1,open("f2") as f2:
if f1.read().strip() in f2.read():
print 'found'
Edit:
As python 2.6 doesn't support multiple context managers on single line:
with open("f1") as f1:
with open("f2") as f2:
if f1.read().strip() in f2.read():
print 'found'

template = file('your_name').read()
for i in file('2_filename'):
if template in i:
print 'found'
break

with open(r'path1','r') as f1, open(r'path2','r') as f2:
t1 = f1.read()
t2 = f2.read()
if t1 in t2:
print "found"
Using the other methods won't work if there's '\n' inside the string you want to search for.

fileOne = f1.readlines()
fileTwo = f2.readlines()
now fileOne and fileTwo are list of the lines in the files, now simply check
if set(fileOne) <= set(fileTwo):
print "file1 is in file2"

Related

Comparing contents of two txt.files for deleted lines or changes in python

I'm trying to compare two .txt files for changes or deleted lines. If its deleted I want to output what the deleted line was and if it was changed I want to output the new line. I originally tried comparing line to line but when something was deleted it wouldn't work for my purpose:
for line1 in f1:
for line1 in f2:
if line1==line1:
print("SAME",file=x)
else:
print(f"Original:{line1} / New:{line1}", file=x)
Then I tried not comparing line to line so I could figure out if something was deleted but I'm not getting any output:
def check_diff(f1,f2):
check = {}
for file in [f1,f2]:
with open(file,'r') as f:
check[file] = []
for line in f:
check[file].append(line)
diff = set(check[f1]) - set(check[f2])
for line in diff:
print(line.rstrip(),file=x)
I tried combining a lot of other questions previously asked similar to my problem to get this far, but I'm new to python so I need a little extra help. Thanks! Please let me know if I need to add any additional information.
The concept is simple. Lets say file1,txt is the original file, and file2 is the one we need to see what changes were made to it:
with open('file1.txt', 'r') as f:
f1 = f.readlines()
with open('file2.txt', 'r') as f:
f2 = f.readlines()
deleted = []
added = []
for l in f1:
if l not in f2:
deleted.append(l)
for l in f2:
if l not in f1:
added.append(l)
print('Deleted lines:')
print(''.join(deleted))
print('Added lines:')
print(''.join(added))
For every line in the original file, if that line isn't in the other file, then that means that the line have been deleted. If it's the other way around, that means the line have been added.
I am not sure how you would quantify a changed line (since you could count it as one deleted plus one added line), but perhaps something like the below would be of some aid. Note that if your files are large, it might be faster to store the data in a set instead of a list, since the former has typically a search time complexity of O(1), while the latter has O(n):
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
file1 = set(f1.read().splitlines())
file2 = set(f2.read().splitlines())
changed_lines = [line for line in file1 if line not in file2]
deleted_lines = [line for line in file2 if line not in file1]
print('Changed lines:\n' + '\n'.join(changed_lines))
print('Deleted lines:\n' + '\n'.join(deleted_lines))

How to compare 2 txt files in Python

I have written a program to compare file new1.txt with new2.txt and the lines which are there in new1.txt and not in new2.txt has to be written to difference.txt file.
Can someone please have a look and let me know what changes are required in the below given code. The code prints the same value multiple times.
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1 in file1:
for line2 in file2:
if line2 != line1:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()
Here's an example using the with statement, supposing the files are not too big to fit in the memory
# Open 'new1.txt' as f1, 'new2.txt' as f2 and 'diff.txt' as outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:
# Read the lines from 'new2.txt' and store them into a python set
lines = set(f2.readlines())
# Loop through each line in 'new1.txt'
for line in f1:
# If the line was not in 'new2.txt'
if line not in lines:
# Write the line to the output file
outf.write(line)
The with statement simply closes the opened file(s) automatically. These two pieces of code are equal:
with open('temp.log') as temp:
temp.write('Temporary logging.')
# equal to:
temp = open('temp.log')
temp.write('Temporary logging.')
temp.close()
Yet an other way using two sets, but this again isn't too memory effecient. If your files are big, this wont work:
# Again, open the three files as f1, f2 and outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:
# Read the lines in 'new1.txt' and 'new2.txt'
s1, s2 = set(f1.readlines()), set(f2.readlines())
# `s1 - s2 | s2 - s2` returns the differences between two sets
# Now we simply loop through the different lines
for line in s1 - s2 | s2 - s1:
# And output all the different lines
outf.write(line)
Keep in mind, that this last code might not keep the order of your lines
For example you got
file1:
line1
line2
and file2:
line1
line3
line4
When you compare line1 and line3, you write to your output file new line (line1), then you go to compare line1 and line4, again they do not equal, so again you print into your output file (line1)... You need to break both for s, if your condition is true. You can use some help variable to break outer for.
If your file is a big one .You could use this.for-else method:
the else method below the second for loop is executes only when the second for loop completes it's execution with out break that is if there is no match
Modification:
with open('new1.txt') as file1, open('diff.txt', 'w') as NewFile :
for line1 in file1:
with open('new2.txt') as file2:
for line2 in file2:
if line2 == line1:
break
else:
NewFile.write(line1)
For more on for-else method see this stack overflow question for-else
It is because of your for loops.
If I understand well, you want to see what lines in file1 are not present in file2.
So for each line in file1, you have to check if the same line appears in file2. But this is not what you do with your code : for each line in file1, you check every line in file2 (this is right), but each time the line in file2 is different from the line if file1, you print the line in file1! So you should print the line in file1 only AFTER having checked ALL the lines in file2, to be sure the line does not appear at least one time.
It could look like something as below:
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1 in file1:
if line1 not in file2:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()
I always find working with sets makes comparison of two collections easier. Especially because"does this collection contain this" operations runs i O(1), and most nested loops can be reduced to a single loop (easier to read in my opinion).
with open('test1.txt') as file1, open('test2.txt') as file2, open('diff.txt', 'w') as diff:
s1 = set(file1)
s2 = set(file2)
for e in s1:
if e not in s2:
diff.write(e)
Your loop is executed multiple times. To avoid that, use this:
file1 = open("new1.txt",'r')
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')
for line1, line2 in izip(file1, file2):
if line2 != line1:
NewFile.write(line1)
file1.close()
file2.close()
NewFile.close()
Print to the NewFile, only after comparing with all lines of file2
present = False
for line2 in file2:
if line2 == line1:
present = True
if not present:
NewFile.write(line1)
You can use basic set operations for this:
with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
diffs.writelines(set(f1).difference(f2))
According to this reference, this will execute with O(n) where n is the number of lines in the first file. If you know that the second file is significantly smaller than the first you can optimise with set.difference_update(). This has complexity O(n) where n is the number of lines in the second file. For example:
with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
s = set(f1)
s.difference_update(f2)
diffs.writelines(s)

python, compare two files and get difference

I have two files, one is user input f1, and other one is database f2.I want to search if strings from f1 are in database(f2). If not print the ones that don't exist if f2. I have problem with my code, it is not working fine:
Here is f1:
rbs003491
rbs003499
rbs003531
rbs003539
rbs111111
Here is f2:
AHPTUR13,rbs003411
AHPTUR13,rbs003419
AHPTUR13,rbs003451
AHPTUR13,rbs003459
AHPTUR13,rbs003469
AHPTUR13,rbs003471
AHPTUR13,rbs003479
AHPTUR13,rbs003491
AHPTUR13,rbs003499
AHPTUR13,rbs003531
AHPTUR13,rbs003539
AHPTUR13,rbs003541
AHPTUR13,rbs003549
AHPTUR13,rbs003581
In this case it would return rbs11111, because it is not in f2.
Code is:
with open(c,'r') as f1:
s1 = set(x.strip() for x in f1)
print s1
with open("/tmp/ARNE/blt",'r') as f2:
for line in f2:
if line not in s1:
print line
If you only care about the second part of each line (rbs003411 from AHPTUR13,rbs003411):
with open(user_input_path) as f1, open('/tmp/ARNE/blt') as f2:
not_found = set(f1.read().split())
for line in f2:
_, found = line.strip().split(',')
not_found.discard(found) # remove found word
print not_found
# for x in not_found:
# print x
Your line variable in the for loop will contain something like "AHPTUR13,rbs003411", but you are only interested in the second part. You should do something like:
for line in f2:
line = line.strip().split(",")[1]
if line not in s1:
print line
you need to check the last part of your lines not all of them , you can split your lines from f2 with , then choose the last part (x.strip().split(',')[-1]) , Also if you want to search if strings from f1 are in database(f2) your LOGIC here is wrong you need to create your set from f2 :
with open(c,'r') as f1,open("/tmp/ARNE/blt",'r') as f2:
s1 = set(x.strip().split(',')[-1] for x in f2)
print s1
for line in f1:
if line.strip() not in s1:
print line

Fixing a CSV using Python

I am trying to clean up the formating on a CSV file in order to import it into a database, and I'm using the following to edit it:
f1 = open('visit_summary.csv', 'r')
f2 = open('clinics.csv', 'w')
for line in f1:
f2.write(line.replace('Calendar: ', ''))
f1.close()
f2.close()
This works fine if there is only 1 edit to make, however, I have to repeat this code 19 times in order to make all the changes required; opening and closing each file several times and having multiple placeholder fiels in order to use for intermediate steps ebtween the first and last edit). Is there a simpler way to do this? I tried adding more "f2.write(line.replace"... lines, however, this creates a final file with duplicated lines each of which has only 1 edit. I think I see my problem (I am writing each line multiple times with each edit), however, I cannot seem to find a solution. I am very new to python and am self teachign myself so any help, or direction to better resources would be appreciated.
There's no reason you can't do lots of things to the line before you write it:
with open('visit_summary.csv', 'r') as f1, open('clinics.csv', 'w') as f2:
for line in f1:
line = line.replace('Calendar: ', '')
line = line.replace('Something else', '')
f2.write(line)
(I also replaced open, close with the with statement)
f1 = open('visit_summary.csv', 'r')
f2 = open('clinics.csv', 'w')
for line in f1:
f2.write(line.replace('Calendar: ', '').replace('String2', '').replace('String3', ''))
f1.close()
f2.close()
Will this work? Although I don't think its very "pythonic". In this case, you have to be careful about the ordering!
import csv
file1 = csv.DictReader(open('visit_summary.csv', 'r'))
output =[]
for line in file1:
row= {}
row['calender'] = line.replace('Calender:','')
row['string'] = line.replace('string','')
output.append(row)
file = csv.DictReader(open('clinics.csv', 'w'))
fileWriter = csv.writer(file , delimiter=",",quotechar='"', quoting=csv.QUOTE_MINIMAL)
Header = ['something' , 'something']
fileWriter.writerow(Header)
for x in output:
temp = [x['something'],x['something']]
fileWriter.writerow(temp)
file.close()

How to delete a line in file1 that appeared once or multiple times in file2 in python?

I have two text files: file1 has 40 lines and file2 has 1.3 million lines
I would like to compare every line in file1 with file2.
If a line in file1 appeared once or multiple times in file2,
this line(lines) should be deleted from file2 and remaining lines of file2 return to a third file3.
I could painfully delete one line in file1 from file2 by manually copying the line,
indicated as "unwanted_line" in my code. Does anyone knows how to do this in python.
Thanks in advance for your assistance.
Here's my code:
fname = open(raw_input('Enter input filename: ')) #file2
outfile = open('Value.txt','w')
unwanted_line = "222" #This is in file1
for line in fname.readlines():
if not unwanted_line in line:
# now remove unwanted_line from fname
data =line.strip("unwanted_line")
# write it to the output file
outfile.write(data)
print 'results written to:\n', os.getcwd()+'\Value.txt'
NOTE:
This is how I got it to work for me. I would like to thank everyone who contributed towards the solution. I took your ideas here.I used set(),where intersection (common lines) of file1 with file2 is removed, then, the unique lines in file2 are return to file3. It might not be most elegant way of doing it, but it works for me. I respect everyone of your ideas, there are great and wonderful, it makes me feel python is the only programming language in the whole world.
Thanks guys.
def diff_lines(filenameA,filenameB):
fnameA = set(filenameA)
fnameB = set(filenameB)
data = []
#identify lines not common to both files
#diff_line = fnameB ^ fnameA
diff_line = fnameA.symmetric_difference(fnameB)
data = list(diff_line)
data.sort()
return data
Read file1; put the lines into a set or dict (it'll have to be a dict if you're using a really old version of Python); now go through file2 and say something like if line not in things_seen_in_file_1: outfile.write(line) for each line.
Incidentally, in recent Python versions you shouldn't bother calling readlines: an open file is an iterator and you can just say for line in open(filename2): ... to process each line of the file.
Here is my version, but be aware that miniscule variations can cause line not to be considered same (like one space before new line).
file1, file2, file3 = 'verysmalldict.txt', 'uk.txt', 'not_small.txt'
drop_these = set(open(file1))
with open(file3, 'w') as outfile:
outfile.write(''.join(line for line in open(file2) if line not in drop_these))
with open(path1) as f1:
lines1 = set(f1)
with open(path2) as f2:
lines2 = tuple(f2)
lines3 = x for x in lines2 if x in lines1
lines2 = x for x in lines2 if x not in lines1
with open(path2, 'w') as f2:
f2.writelines(lines2)
with open(path3, 'w') as f3:
f3.writelines(lines3)
Closing f2 by using 2 separate with statements is a matter of personal preference/design choice.
what you can do is load file1 completely into memory (since it is small) and check each line in file2 if it matches a line in file1. if it doesn't then write it to file three. Sort of like this:
file1 = open('file1')
file2 = open('file2')
file3 = open('file3','w')
lines_from_file1 = []
# Read in all lines from file1
for line in file1:
lines_from_file1.append(line)
file1.close()
# Now iterate over lines of file2
for line2 in file2:
keep_this_line = True
for line1 in lines_from_file1:
if line1 == line2:
keep_this_line = False
break # break out of inner for loop
if keep_this_line:
# line from file2 is not in file1 so save it into file3
file3.write(line2)
file2.close()
file3.close()
Maybe not the most elegant solution, but if you don't have to do it ever 3 seconds, it should work.
EDIT: By the way, the question in the text somewhat differs from the title. I tried to answer the question in the text.

Categories

Resources