Compare content of two files - python

I am comaring 2 files with same names in 2 directories as follows: This is my pseudocode and I have also included the code I wrote for this program.
dir1 = "/home/1"
dir2 = "/home/2"
loop through all the files in dir1 and dir2
if the name of file in dir1 is same as the name then:
for those particular files:
read all the lines in file1 in dir1 and file2 in dir2
#the file1.txt dir1 has below data( just an example my files data are different):
20 30 40
2 7 8
#file1.txt in dir2 has below data:
31 41 51
11 14 14
#I want to now compare these files in the following way:
#compare each line in file1.txt in dir with each line of
file1.txt in dir2
(i.e first line of file1.txt in dir1 with 1st, 2nd, 3rd..last
line of file1.txt in dir2....
second line with 1st,2nd,3rd...last line and so on)
If the difference of all the corresponding elements(i.e 20-
31<=10 and 30-41<=10 and 40-51<=10 then
print the 1st line of file1.txt of dir1)
do the same thing for each line of each files print out the
result.
This is my code:
dir1 = "/home/one"
dir2 = "/home/two"
for file in os.listdir(dir1):
file2 = os.path.join(dir2, file)
if os.path.exists(file2):
file1 = os.path.join(dir1, file)
print(file)
with open(file1, "r") as f1, open(file2, "r") as f2:
# how to do the comparison?
# how to compare first four element of line 1 of f1 with all the first four
#element of each line of f2 ?**
same = True
while True:
line1 = f1.readline().split()[:4]
print(line1)
line2 = f2.readline().split()[:4]
print(line2)
# one way to compare but is not very logical and thorough
if all(float(el1)-float(el2) < 11 for el1, el2 in zip(line2, line1)):
print(el1,el2)
same = True
if len(line1) == 0:
break
if same:
print(line1)
#print(el1, el2)
#print(line2)
else:
print("files are different")
I think I am missing something, because it doesn't print all the similar lines as it should.
If input file are as follows:
file1.txt in dir1:
10 20 30
100 200 300
1000 2000 3000
file1.txt in dir2:
15 30 40
120 215 315
27 25 35
Expected output:
10 20 30
(as 1st line of file1.txt in dir1 satisfies the condition with only the first line of file1.txt in dir2 where 10-15<=10 and 20-30<=10 and 30-40=10<=10 ( consider all modulus)

This problem is fairly tricky as there is a number of requirements. If I understand correctly, they are:
get a list of files in first directory
for each file find file with same name in second directory
get corresponding line in paired files
compare corresponding value in each line
if all values in a line match criteria print line in first file
It can be easier to create such a script if you break the problem down into smaller problems and test each piece. Python allows you to do this with functions for each piece.
As an example you could break your problem down as follows:
from pathlib import Path
def get_matching_file(known_file, search_dir):
f2_file = search_dir.joinpath(known_file.name)
if f2_file.exists():
return f2_file
raise FileNotFoundError
def get_lines(file_loc):
result = []
lines = file_loc.read_text().splitlines()
for row in lines:
row_as = []
for num in row.split():
row_as.append(int(num))
result.append(row_as)
return result
def compare_content(f1_rows, f2_rows):
result = []
for row1 in f1_rows:
for row2 in f2_rows:
row_result = []
for v1, v2 in zip(row1, row2):
# print(f'{v2} - {v1} = {v2 - v1}')
row_result.append(abs(v2 - v1) <= 10)
if all(row_result):
result.append(row1)
return result
def print_result(result):
for line in result:
print(' '.join(str(int(num)) for num in line))
def main(dir1, dir2):
for file_one in dir1.glob('*.txt'):
file_two = get_matching_file(file_one, dir2)
f1_content = get_lines(file_one)
f2_content = get_lines(file_two)
result = compare_content(f1_content, f2_content)
print_result(result)
if __name__ == '__main__':
main(dir1=Path(r'/Temp/text_compare/one'),
dir2=Path(r'/Temp/text_compare/two'))
The above gave me the output you were looking for of:
10 20 30
Structuring the code this way then allows you to test small parts of your script at a time. For example if I wanted to just test the compare_content function I could change the code at the bottom to read:
if __name__ == '__main__':
test_result = compare_content(f1_rows=[[10, 20, 30]],
f2_rows=[[15, 30, 40]])
print('Test 1:', test_result)
test_result = compare_content(f1_rows=[[100, 200, 300]],
f2_rows=[[120, 215, 315]])
print('Test 2:', test_result)
which would give the output:
Test 1: [[10, 20, 30]]
Test 2: []
Or test reading values from file:
if __name__ == '__main__':
test_result = get_lines(Path(r'/Temp/text_compare/one/log.txt'))
print(test_result)
Gave the output of:
[[10, 20, 30], [100, 200, 300], [1000, 2000, 3000]]
This also allows you to optimize just parts of your script. For example, you might want to use Python's csv module to simplify the reading of the files. This would require the change of just one function. For example:
def get_lines(file_loc):
with file_loc.open() as ssv:
space_reader = csv.reader(ssv,
delimiter=' ',
quoting=csv.QUOTE_NONNUMERIC)
return list(space_reader)

Related

How do I compare two text files from different folders?

Assume that I have two folders with 1000 text files in them, for example, folder 1 and folder 2.
Those two folders have text files with the same names, for example:
folder 1: ab.txt, bc.txt, cd.txt, ac.txt, etc.
folder 2: ab.txt, bc.txt, cd.txt, ac.txt, etc.
Each text file contain bunch of numbers. Here is an example of the text inside the text file, for example, ab.txt from folder 1 has:
5 0.796 0.440 0.407 0.399
24 0.973 0.185 0.052 0.070
3 0.91 0.11 0.12 0.1
and ab.txt from folder 2 has :
1 0.8 0.45 0.407 0.499
24 0.973 0.185 0.052 0.070
5 5.91 6.2 2.22 0.2
I want to read the text files inside of those two folders and compare the first column of the each pair of text files that has the same name (indicated above). For example, if the first columns of the two text files have different numbers, I want to move those from folder_1 to another folder called "output". Here is what I wrote. I can compare two text files. However, I wonder how do I compare similar text files located in two different folders?
import difflib
with open(r'path to txt file\') as folder_1:
    file_1_text = file_1.readlines()
with open(r'r'path to txt file\'') as folder_2:
    file_2_text = file_2.readlines()
# Find and print the diff:
for line in difflib.unified_diff(
        file_1_text, file_2_text, fromfile='file1.txt',
        tofile='file2.txt', lineterm=''):
    print(line)```
You can create a list of all files in a folder with os.listdir().
folder1_files = os.listdir(folder_path1)
folder2_files = os.listdir(folder_path2)
Than you can iterate over both lists and check if the file names are equal.
for file1 in folder1_files:
for file2 in folder2_files:
if file1 == file2:
...
Comparing the first line is also not that difficult. Read the lines of both files and check if they are different.
file1_path = os.path.join(folder_path1, file1)
file2_path = os.path.join(folder_path2, file2)
file1_file = open(file1_path, 'r')
file2_file = open(file2_path, 'r')
file1_lines = file1_file.readlines()
file2_lines = file2_file.readlines()
if file1_lines[0] != file2_lines[0]:
...
I would either use shutil.move or shutil.copy to move/copy the files.
shutil.copy(file1_path, "output/" + file1)
Closing the file descriptors
Note
The Term "file descriptor" might Not be 100% accurate in this context because open() creates a file object not a file descriptor. The basis of a file object is a file descriptor so file.close() is closing the file descriptor but I still think you can't say it like that. Read more here: what is the difference between os.open and os.fdopen in python
file1_file.close()
file2_file.close()
All together in a function:
def compare_files(folder_path1, folder_path2):
import os
import shutil
folder1_files = os.listdir(folder_path1)
folder2_files = os.listdir(folder_path2)
for file1 in folder1_files:
for file2 in folder2_files:
if file1 == file2:
file1_path = os.path.join(folder_path1, file1)
file2_path = os.path.join(folder_path2, file2)
file1_file = open(file1_path, 'r')
file2_file = open(file2_path, 'r')
file1_lines = file1_file.readlines()
file2_lines = file2_file.readlines()
output_path = "output"
if not os.path.exists(output_path):
os.makedirs(output_path)
if file1_lines[0] != file2_lines[0]:
shutil.copy(file1_path, output_path + "/" + file1)
file1_file.close()
file2_file.close()
compare_files("folder1", "folder2")
if you want to compare the numbers and e.g. 1 should be the same as 1.0 you can do the following.
l1 = file1_lines[0].split()
l2 = file2_lines[0].split()
for i in range(len(l1 if len(l1) < len(l2) else l2)):
if float(l1[i]) != float(l2[i]):
output_path = "output"
if not os.path.exists(output_path):
os.makedirs(output_path)
shutil.copy(file1_path, output_path)
break

python regex: Parsing file name

I have a text file (filenames.txt) that contains the file name with its file extension.
filename.txt
[AW] One Piece - 629 [1080P][Dub].mkv
EP.585.1080p.mp4
EP609.m4v
EP 610.m4v
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One_Piece_0745_Sons'_Cups!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One Piece - 621 1080P.mkv
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
these are the example filename and its extension. I need to rename filename with the episode number (without changing its extension).
Example:
Input:
``````
EP609.m4v
EP 610.m4v
EP.585.1080p.mp4
One Piece - 621 1080P.mkv
[AW] One Piece - 629 [1080P][Dub].mkv
One_Piece_0745_Sons'_Cups!.mp4
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
Expected Output:
````````````````
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4 (or) 0745.mp4
696.mp4 (or) 0696.mp4
591.m4v
577.mp4
Hope someone will help me parse and rename these filenames. Thanks in advance!!!
As you tagged python, I guess you are willing to use python.
(Edit: I've realized a loop in my original code is unnecessary.)
import re
with open('filename.txt', 'r') as f:
files = f.read().splitlines() # read filenames
# assume: an episode comprises of 3 digits possibly preceded by 0
p = re.compile(r'0?(\d{3})')
for file in files:
if m := p.search(file):
print(m.group(1) + '.' + file.split('.')[-1])
else:
print(file)
This will output
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4
696.mp4
591.m4v
577.mp4
Basically, it searches for the first 3-digit number, possibly preceded by 0.
I strongly advise you to check the output; in particular, you would want to run sort OUTPUTFILENAME | uniq -d to see whether there are duplicate target names.
(Original answer:)
p = re.compile(r'\d{3,4}')
for file in files:
for m in p.finditer(file):
ep = m.group(0)
if int(ep) < 1000:
print(ep.lstrip('0') + '.' + file.split('.')[-1])
break # go to next file if ep found (avoid the else clause)
else: # if ep not found, just print the filename as is
print(file)
Program to parse episode number and renaming it.
Modules used:
re - To parse File Name
os - To rename File Name
full/path/to/folder - is the path to the folder where your file lives
import re
import os
for file in os.listdir(path="full/path/to/folder/"):
# searches for the first 3 or 4 digit number less than 1000 for each line.
for match_obj in re.finditer(r'\d{3,4}', file):
episode = match_obj.group(0)
if int(episode) < 1000:
new_filename = episode.lstrip('0') + '.' + file.split('.')[-1]
old_name = "full/path/to/folder/" + file
new_name = "full/path/to/folder/" + new_filename
os.rename(old_name, new_name)
# go to next file if ep found (avoid the else clause)
break
else:
# if episode not found, just leave the filename as it is
pass

How do I print a range of lines after a specific pattern into separate files when this pattern appears several times in an input file

Sorry for my previous post, I had no idea what I was doing. I am trying to cut out certain ranges of lines in a given input file and print that range to a separate file. This input file looks like:
18
generated by VMD
C 1.514895 -3.887949 2.104134
C 2.371076 -2.780954 1.718424
C 3.561071 -3.004933 1.087316
C 4.080424 -4.331872 1.114878
C 3.289761 -5.434047 1.607808
C 2.018473 -5.142150 2.078551
C 3.997237 -6.725186 1.709355
C 5.235126 -6.905640 1.295296
C 5.923666 -5.844841 0.553037
O 6.955216 -5.826197 -0.042920
O 5.269004 -4.590026 0.590033
H 4.054002 -2.184680 0.654838
H 1.389704 -5.910354 2.488783
H 5.814723 -7.796634 1.451618
O 1.825325 -1.537706 1.986256
H 2.319215 -0.796042 1.550394
H 3.390707 -7.564847 2.136680
H 0.535358 -3.663175 2.483943
18
generated by VMD
C 1.519866 -3.892621 2.109595
I would like to print every 100th frame starting from the first frame into its own file named "snapshot0.xyz" (The first frame is frame 0).
For example, the above input shows two snapshots. I would like to print out lines 1:20 into its own file named snapshot0.xyz and then skip 100 (2000 lines) snapshots and print out snapshot1.xyz (with the 100th snapshot). My attempt was in python, but you can choose either grep, awk, sed, or Python.
My input file: frames.dat
1 #!/usr/bin/Python
2
3
4
5 mest = open('frames.dat', 'r')
6 test = mest.read().strip().split('\n')
7
8 for i in range(len(test)):
9 if test[i] == '18':
10 f = open("out"+`i`+".dat", "w")
11 for j in range(19):
12 print >> f, test[j]
13 f.close()
I suggest using the csv module for this input.
import csv
def strip_empty_columns(line):
return filter(lambda s: s.strip() != "", line)
def is_count(line):
return len(line) == 1 and line[0].strip().isdigit()
def is_float(s):
try:
float(s.strip())
return True
except ValueError:
return False
def is_data_line(line):
return len(line) == 4 and is_float(line[1]) and is_float(line[2]) and is_float(line[3])
with open('frames.dat', 'r') as mest:
r = csv.reader(mest, delimiter=' ')
current_count = 0
frame_nr = 0
outfile = None
for line in r:
line = strip_empty_columns(line)
if is_count(line):
if frame_nr % 100 == 0:
outfile = open("snapshot%d.xyz" % frame_nr, "w+")
elif outfile:
outfile.close()
outfile = None
frame_nr += 1 # increment the frame counter every time you see this header line like '18'
elif is_data_line(line):
if outfile:
outfile.write(" ".join(line) + "\n")
The opening post mentions to write every 100th frame to an output file named snapshot0.xyz. I assume the 0 should be a counter, ot you would continously overwrite the file. I updated the code with a frame_nr counter and a few lines which open/close an output file depending on the frame_nr and write data if an output file is open.
This might work for you (GNU sed and csplit):
sed -rn '/^18/{x;/x{100}/z;s/^/x/;x};G;/\nx$/P' file | csplit -f snapshot -b '%d.xyz' -z - '/^18/' '{*}'
Filter every 100th frame using sed and pass that file to csplit to create the individual files.

Creating column data from multiple sources with varying formats in python

So as part of my code, I'm reading file paths that have varying names, but tend to stick to the following format
p(number)_(temperature)C
What I've done with those paths is separate it into 2 columns (along with 2 more columns with actual data) so I end up with a row that looks like this:
p2 18 some number some number
However, I've found a few folders that use the following format:
p(number number)_(temperature)C
As it stands, for the first case, I use the following code to separate the file path into the proper columns:
def finale():
for root, dirs, files in os.walk('/Users/Bashe/Desktop/12/'):
file_name = os.path.join(root,"Graph_Info.txt")
file_name_out = os.path.join(root,"Graph.txt")
file = os.path.join(root, "StDev.txt")
if os.path.exists(os.path.join(root,"Graph_Info.txt")):
with open(file_name) as fh, open(file) as th, open(file_name_out,"w") as fh_out:
first_line = fh.readline()
values = eval(first_line)
for value, line in zip(values, fh):
first_column = value[0:2]
second_column = value[3:5]
third_column = line.strip()
fourth_column = th.readline().strip()
fh_out.write("%s\t%s\t%s\t%s\n" % (first_column, second_column, third_column, fourth_column))
else:
pass
I've played around with things and found that if I make the following changes, the program works properly.
first_column = value[0:3]
second_column = value[4:6]
Is there a way I can get the program to look and see what the file path is and act accordingly?
welcome to the fabulous world of regex.
import re
#..........
#case 0
if re.match(r"p\(\d+\).*", path) :
#stuff
#case 1
elif re.match(r"p\(\d+\s\d+\).*", path):
#other stuff
>>> for line in s.splitlines():
... first,second = re.search("p([0-9 ]+)_(\d+)C",line).groups()
... print first, " +",second
...
22 + 66
33 44 + 44
23 33 + 22

Testing each line in a file

I am trying to write a Python program that reads each line from an infile. This infile is a list of dates. I want to test each line with a function isValid(), which returns true if the date is valid, and false if it is not. If the date is valid, it is written into an output file. If it is not, invalid is written into the output file. I have the function, and all I want to know is the best way to test each line with the function. I know this should be done with a loop, I'm just uncertain how to set up the loop to test each line in the file one-by-one.
Edit: I now have a program that basically works. However, I am getting incorrect output to the output file. Perhaps someone will be able to explain why.
Ok, I now have a program that basically works, but I'm getting strange results in the output file. Hopefully those with Python 3 experience can help.
def main():
datefile = input("Enter filename: ")
t = open(datefile, "r")
c = t.readlines()
ofile = input("Enter filename: ")
o = open(ofile, "w")
for line in c:
b = line.split("/")
e = b[0]
f = b[1]
g = b[2]
text = str(e) + " " + str(f) + ", " + str(g)
text2 = "The date " + text + " is invalid"
if isValid(e,f,g) == True:
o.write(text)
else:
o.write(text2)
def isValid(m, d, y):
if m == 1 or m == 3 or m == 5 or m == 7 or m == 8 or m == 10 or m == 12:
if d is range(1, 31):
return True
elif m == 2:
if d is range(1,28):
return True
elif m == 4 or m == 6 or m == 9 or m == 11:
if d is range(1,30):
return True
else:
return False
This is the output I'm getting.
The date 5 19, 1998
is invalidThe date 7 21, 1984
is invalidThe date 12 7, 1862
is invalidThe date 13 4, 2000
is invalidThe date 11 40, 1460
is invalidThe date 5 7, 1970
is invalidThe date 8 31, 2001
is invalidThe date 6 26, 1800
is invalidThe date 3 32, 400
is invalidThe date 1 1, 1111
is invalid
In the most recent versions of Python you can use the context management features that are implicit for files:
results = list()
with open(some_file) as f:
for line in f:
if isValid(line, date):
results.append(line)
... or even more tersely with a list comprehension:
with open(some_file) as f:
results = [line for line in f if isValid(line, date)]
For progressively older versions of Python you might need to explicitly open and close the file (with simple implicit iteration over the file for line in file:) or add more explicit iteration over the file (f.readline() or f.readlines() (plural) depending on whether you want to "slurp" in the entire file (with the memory overhead implications of that) or iterate line-by-line).
Also note that you may wish to strip the trailing newlines off these file contents (perhaps by calling line.rstrip('\n') --- or possibly just line.strip() if you want to eliminate all leading and trailing whitespace from each line).
(Edit based on additional comment to previous answer):
The function signature isValid(m,d,y) suggests that you're passing a data to this function (month, day, year) but that doesn't make sense given that you must also, somehow, pass in the data to be validated (a line of text, a string, etc).
To help you further you'll have to provide more information (preferable the source or a relevant portion of the source to this "isValid()" function.
In my initial answer I was assuming that your "isValid()" function was merely scanning for any valid date in its single argument. I've modified my code examples to show how one might pass a specific date, as a single argument, to a function which used this calling signature: "isValid(somedata, some_date)."
with open(fname) as f:
for line in f.readlines():
test(line)

Categories

Resources