How do I compare two text files from different folders?

How do I compare two text files from different folders? - python

Assume that I have two folders with 1000 text files in them, for example, folder 1 and folder 2.
Those two folders have text files with the same names, for example:
folder 1: ab.txt, bc.txt, cd.txt, ac.txt, etc.
folder 2: ab.txt, bc.txt, cd.txt, ac.txt, etc.
Each text file contain bunch of numbers. Here is an example of the text inside the text file, for example, ab.txt from folder 1 has:
5 0.796 0.440 0.407 0.399
24 0.973 0.185 0.052 0.070
3 0.91 0.11 0.12 0.1
and ab.txt from folder 2 has :
1 0.8 0.45 0.407 0.499
24 0.973 0.185 0.052 0.070
5 5.91 6.2 2.22 0.2
I want to read the text files inside of those two folders and compare the first column of the each pair of text files that has the same name (indicated above). For example, if the first columns of the two text files have different numbers, I want to move those from folder_1 to another folder called "output". Here is what I wrote. I can compare two text files. However, I wonder how do I compare similar text files located in two different folders?
import difflib
with open(r'path to txt file\') as folder_1:
    file_1_text = file_1.readlines()
with open(r'r'path to txt file\'') as folder_2:
    file_2_text = file_2.readlines()
# Find and print the diff:
for line in difflib.unified_diff(
        file_1_text, file_2_text, fromfile='file1.txt',
        tofile='file2.txt', lineterm=''):
    print(line)```

You can create a list of all files in a folder with os.listdir().
folder1_files = os.listdir(folder_path1)
folder2_files = os.listdir(folder_path2)
Than you can iterate over both lists and check if the file names are equal.
for file1 in folder1_files:
for file2 in folder2_files:
if file1 == file2:
...
Comparing the first line is also not that difficult. Read the lines of both files and check if they are different.
file1_path = os.path.join(folder_path1, file1)
file2_path = os.path.join(folder_path2, file2)
file1_file = open(file1_path, 'r')
file2_file = open(file2_path, 'r')
file1_lines = file1_file.readlines()
file2_lines = file2_file.readlines()
if file1_lines[0] != file2_lines[0]:
...
I would either use shutil.move or shutil.copy to move/copy the files.
shutil.copy(file1_path, "output/" + file1)
Closing the file descriptors
Note
The Term "file descriptor" might Not be 100% accurate in this context because open() creates a file object not a file descriptor. The basis of a file object is a file descriptor so file.close() is closing the file descriptor but I still think you can't say it like that. Read more here: what is the difference between os.open and os.fdopen in python
file1_file.close()
file2_file.close()
All together in a function:
def compare_files(folder_path1, folder_path2):
import os
import shutil
folder1_files = os.listdir(folder_path1)
folder2_files = os.listdir(folder_path2)
for file1 in folder1_files:
for file2 in folder2_files:
if file1 == file2:
file1_path = os.path.join(folder_path1, file1)
file2_path = os.path.join(folder_path2, file2)
file1_file = open(file1_path, 'r')
file2_file = open(file2_path, 'r')
file1_lines = file1_file.readlines()
file2_lines = file2_file.readlines()
output_path = "output"
if not os.path.exists(output_path):
os.makedirs(output_path)
if file1_lines[0] != file2_lines[0]:
shutil.copy(file1_path, output_path + "/" + file1)
file1_file.close()
file2_file.close()
compare_files("folder1", "folder2")
if you want to compare the numbers and e.g. 1 should be the same as 1.0 you can do the following.
l1 = file1_lines[0].split()
l2 = file2_lines[0].split()
for i in range(len(l1 if len(l1) < len(l2) else l2)):
if float(l1[i]) != float(l2[i]):
output_path = "output"
if not os.path.exists(output_path):
os.makedirs(output_path)
shutil.copy(file1_path, output_path)
break

Related

Compare content of two files

I am comaring 2 files with same names in 2 directories as follows: This is my pseudocode and I have also included the code I wrote for this program.
dir1 = "/home/1"
dir2 = "/home/2"
loop through all the files in dir1 and dir2
if the name of file in dir1 is same as the name then:
for those particular files:
read all the lines in file1 in dir1 and file2 in dir2
#the file1.txt dir1 has below data( just an example my files data are different):
20 30 40
2 7 8
#file1.txt in dir2 has below data:
31 41 51
11 14 14
#I want to now compare these files in the following way:
#compare each line in file1.txt in dir with each line of
file1.txt in dir2
(i.e first line of file1.txt in dir1 with 1st, 2nd, 3rd..last
line of file1.txt in dir2....
second line with 1st,2nd,3rd...last line and so on)
If the difference of all the corresponding elements(i.e 20-
31<=10 and 30-41<=10 and 40-51<=10 then
print the 1st line of file1.txt of dir1)
do the same thing for each line of each files print out the
result.
This is my code:
dir1 = "/home/one"
dir2 = "/home/two"
for file in os.listdir(dir1):
file2 = os.path.join(dir2, file)
if os.path.exists(file2):
file1 = os.path.join(dir1, file)
print(file)
with open(file1, "r") as f1, open(file2, "r") as f2:
# how to do the comparison?
# how to compare first four element of line 1 of f1 with all the first four
#element of each line of f2 ?**
same = True
while True:
line1 = f1.readline().split()[:4]
print(line1)
line2 = f2.readline().split()[:4]
print(line2)
# one way to compare but is not very logical and thorough
if all(float(el1)-float(el2) < 11 for el1, el2 in zip(line2, line1)):
print(el1,el2)
same = True
if len(line1) == 0:
break
if same:
print(line1)
#print(el1, el2)
#print(line2)
else:
print("files are different")
I think I am missing something, because it doesn't print all the similar lines as it should.
If input file are as follows:
file1.txt in dir1:
10 20 30
100 200 300
1000 2000 3000
file1.txt in dir2:
15 30 40
120 215 315
27 25 35
Expected output:
10 20 30
(as 1st line of file1.txt in dir1 satisfies the condition with only the first line of file1.txt in dir2 where 10-15<=10 and 20-30<=10 and 30-40=10<=10 ( consider all modulus)

This problem is fairly tricky as there is a number of requirements. If I understand correctly, they are:
get a list of files in first directory
for each file find file with same name in second directory
get corresponding line in paired files
compare corresponding value in each line
if all values in a line match criteria print line in first file
It can be easier to create such a script if you break the problem down into smaller problems and test each piece. Python allows you to do this with functions for each piece.
As an example you could break your problem down as follows:
from pathlib import Path
def get_matching_file(known_file, search_dir):
f2_file = search_dir.joinpath(known_file.name)
if f2_file.exists():
return f2_file
raise FileNotFoundError
def get_lines(file_loc):
result = []
lines = file_loc.read_text().splitlines()
for row in lines:
row_as = []
for num in row.split():
row_as.append(int(num))
result.append(row_as)
return result
def compare_content(f1_rows, f2_rows):
result = []
for row1 in f1_rows:
for row2 in f2_rows:
row_result = []
for v1, v2 in zip(row1, row2):
# print(f'{v2} - {v1} = {v2 - v1}')
row_result.append(abs(v2 - v1) <= 10)
if all(row_result):
result.append(row1)
return result
def print_result(result):
for line in result:
print(' '.join(str(int(num)) for num in line))
def main(dir1, dir2):
for file_one in dir1.glob('*.txt'):
file_two = get_matching_file(file_one, dir2)
f1_content = get_lines(file_one)
f2_content = get_lines(file_two)
result = compare_content(f1_content, f2_content)
print_result(result)
if __name__ == '__main__':
main(dir1=Path(r'/Temp/text_compare/one'),
dir2=Path(r'/Temp/text_compare/two'))
The above gave me the output you were looking for of:
10 20 30
Structuring the code this way then allows you to test small parts of your script at a time. For example if I wanted to just test the compare_content function I could change the code at the bottom to read:
if __name__ == '__main__':
test_result = compare_content(f1_rows=[[10, 20, 30]],
f2_rows=[[15, 30, 40]])
print('Test 1:', test_result)
test_result = compare_content(f1_rows=[[100, 200, 300]],
f2_rows=[[120, 215, 315]])
print('Test 2:', test_result)
which would give the output:
Test 1: [[10, 20, 30]]
Test 2: []
Or test reading values from file:
if __name__ == '__main__':
test_result = get_lines(Path(r'/Temp/text_compare/one/log.txt'))
print(test_result)
Gave the output of:
[[10, 20, 30], [100, 200, 300], [1000, 2000, 3000]]
This also allows you to optimize just parts of your script. For example, you might want to use Python's csv module to simplify the reading of the files. This would require the change of just one function. For example:
def get_lines(file_loc):
with file_loc.open() as ssv:
space_reader = csv.reader(ssv,
delimiter=' ',
quoting=csv.QUOTE_NONNUMERIC)
return list(space_reader)

python regex: Parsing file name

I have a text file (filenames.txt) that contains the file name with its file extension.
filename.txt
[AW] One Piece - 629 [1080P][Dub].mkv
EP.585.1080p.mp4
EP609.m4v
EP 610.m4v
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One_Piece_0745_Sons'_Cups!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One Piece - 621 1080P.mkv
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
these are the example filename and its extension. I need to rename filename with the episode number (without changing its extension).
Example:
Input:
``````
EP609.m4v
EP 610.m4v
EP.585.1080p.mp4
One Piece - 621 1080P.mkv
[AW] One Piece - 629 [1080P][Dub].mkv
One_Piece_0745_Sons'_Cups!.mp4
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
Expected Output:
````````````````
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4 (or) 0745.mp4
696.mp4 (or) 0696.mp4
591.m4v
577.mp4
Hope someone will help me parse and rename these filenames. Thanks in advance!!!

As you tagged python, I guess you are willing to use python.
(Edit: I've realized a loop in my original code is unnecessary.)
import re
with open('filename.txt', 'r') as f:
files = f.read().splitlines() # read filenames
# assume: an episode comprises of 3 digits possibly preceded by 0
p = re.compile(r'0?(\d{3})')
for file in files:
if m := p.search(file):
print(m.group(1) + '.' + file.split('.')[-1])
else:
print(file)
This will output
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4
696.mp4
591.m4v
577.mp4
Basically, it searches for the first 3-digit number, possibly preceded by 0.
I strongly advise you to check the output; in particular, you would want to run sort OUTPUTFILENAME | uniq -d to see whether there are duplicate target names.
(Original answer:)
p = re.compile(r'\d{3,4}')
for file in files:
for m in p.finditer(file):
ep = m.group(0)
if int(ep) < 1000:
print(ep.lstrip('0') + '.' + file.split('.')[-1])
break # go to next file if ep found (avoid the else clause)
else: # if ep not found, just print the filename as is
print(file)

Program to parse episode number and renaming it.
Modules used:
re - To parse File Name
os - To rename File Name
full/path/to/folder - is the path to the folder where your file lives
import re
import os
for file in os.listdir(path="full/path/to/folder/"):
# searches for the first 3 or 4 digit number less than 1000 for each line.
for match_obj in re.finditer(r'\d{3,4}', file):
episode = match_obj.group(0)
if int(episode) < 1000:
new_filename = episode.lstrip('0') + '.' + file.split('.')[-1]
old_name = "full/path/to/folder/" + file
new_name = "full/path/to/folder/" + new_filename
os.rename(old_name, new_name)
# go to next file if ep found (avoid the else clause)
break
else:
# if episode not found, just leave the filename as it is
pass

How to split and copy files from multiple folder in train, test and validation folder

I have a dataset that is composed by 130 folders containing 32 photos each.
From each folder, I want to copy randomly that photos (26 for training, 3 for testing and 3 for validation) to the the respective subfolder (001, 002, 003...) in train, validation and test folder.
So I'll have something like this:
Train set
001 (folder contain 26 photo)
002
003
....
Validation set
001 (folder contain 3 photos)
002
003
....
Train set
001 (folder contain 3 photos)
002
003
....
This is the code:
import random
import shutil
n_photo_train = 26
n_photo_validation = 3
n_photo_test = 3
for idx in range(130):
source = '/Users/john/photodb_original/{d:03d}'.format(d=(idx + 1))
dest_train = '/Users/john/photodb_sets/Train/{d:03d}'.format(d=(idx + 1))
dest_validation = '/Users/john/photodb_sets/Validation/{d:03d}'.format(d=(idx + 1))
dest_test = '/Users/john/photodb_sets/Test/{d:03d}'.format(d=(idx + 1))
files = random.choice(os.listdir(source))
photo_train = files[:n_photo_train]
photo_test = files[26:29]
photo_val = files[29:]
shutil.copyfile(os.path.join(source, photo_train), dest_train)
shutil.copyfile(os.path.join(source, photo_val), dest_validation)
shutil.copyfile(os.path.join(source, photo_test), dest_test)
I get this error: IsADirectoryError: [Errno 21] Is a directory: '/Users/john/photodb_original/001/'.
Did I use wrongly shutil.copyfile? Otherwise is there a way to write the code in a more compact and clear way?

random.choice(os.listdir(source)) will only return a single element - when you try to index this string you will get an empty string, and the os.path.join will return the directory path - which causes your exception.
From your code it looks like you were aiming to use random.shuffle. Note that if you are using the shuffle, it mutates the list so your code should be split to two commands:
files = os.listdir(source)
random.shuffle(files)

I think you need to create directories to copy files inside them or when you have an exception about missing directory try to create it first and then try to copy file again. Anyway here is an example code i think that does what you are looking for.
import os
from random import shuffle
from shutil import copyfile, rmtree
org = os.path.realpath('org')
trn = os.path.realpath('trn')
tst = os.path.realpath('tst')
val = os.path.realpath('val')
# How split will be performed 26 3 3
rnd = [trn]*26+[tst]*3+[val]*3
rmtree(trn)
rmtree(tst)
rmtree(val)
rmtree(org)
# CREATE DUMMY DATA
for i in range(1, 131):
d = os.path.join(org, "{:03d}".format(i))
os.makedirs(d, exist_ok=True)
for f in range(1, 33):
f = os.path.join(d, "{:02d}".format(f))
open(f, 'a').close()
# ACTUAL STUFF
for d in os.listdir(org):
os.makedirs(os.path.join(trn, d))
os.makedirs(os.path.join(tst, d))
os.makedirs(os.path.join(val, d))
files = os.listdir(os.path.join(org,d))
shuffle(rnd)
for f, trg in zip(os.listdir(os.path.join(org,d)),rnd):
scr = os.path.join(org,d,f)
dst = os.path.join(trg,d,f)
copyfile(scr,dst)

Creating column data from multiple sources with varying formats in python

So as part of my code, I'm reading file paths that have varying names, but tend to stick to the following format
p(number)_(temperature)C
What I've done with those paths is separate it into 2 columns (along with 2 more columns with actual data) so I end up with a row that looks like this:
p2 18 some number some number
However, I've found a few folders that use the following format:
p(number number)_(temperature)C
As it stands, for the first case, I use the following code to separate the file path into the proper columns:
def finale():
for root, dirs, files in os.walk('/Users/Bashe/Desktop/12/'):
file_name = os.path.join(root,"Graph_Info.txt")
file_name_out = os.path.join(root,"Graph.txt")
file = os.path.join(root, "StDev.txt")
if os.path.exists(os.path.join(root,"Graph_Info.txt")):
with open(file_name) as fh, open(file) as th, open(file_name_out,"w") as fh_out:
first_line = fh.readline()
values = eval(first_line)
for value, line in zip(values, fh):
first_column = value[0:2]
second_column = value[3:5]
third_column = line.strip()
fourth_column = th.readline().strip()
fh_out.write("%s\t%s\t%s\t%s\n" % (first_column, second_column, third_column, fourth_column))
else:
pass
I've played around with things and found that if I make the following changes, the program works properly.
first_column = value[0:3]
second_column = value[4:6]
Is there a way I can get the program to look and see what the file path is and act accordingly?

welcome to the fabulous world of regex.
import re
#..........
#case 0
if re.match(r"p\(\d+\).*", path) :
#stuff
#case 1
elif re.match(r"p\(\d+\s\d+\).*", path):
#other stuff

>>> for line in s.splitlines():
... first,second = re.search("p([0-9 ]+)_(\d+)C",line).groups()
... print first, " +",second
...
22 + 66
33 44 + 44
23 33 + 22

Get the file hierarchy of a directory by the file path with Python

Here is my question. I used os.walk to get all the file paths under a specific directory and stored the path in a file like this
/indexes/attachment/CCTBAU/CCTBAU-13/87009
/indexes/attachment/CCTBAU/CCTBAU-19/91961
/indexes/attachment/CCTBAU/CCTBAU-19/thumbs/_thumb_91961.png
/indexes/attachment/CCTBAU/CCTBAU-11/86413
/indexes/attachment/CCTBAU/CCTBAU-11/thumbs/_thumb_86412.png
/indexes/attachment/CCTBAU/CCTBAU-11/thumbs/_thumb_86413.png
/indexes/attachment/CCTBAU/CCTBAU-12/86614
/indexes/attachment/CCTBAU/CCTBAU-16/90240
/indexes/attachment/CCTBAU/CCTBAU-17/90241
/indexes/attachment/ACD/ACD-200/91345
/indexes/attachment/ACD/ACD-200/96305
/indexes/attachment/ACD/ACD-200/99169
/indexes/attachment/ACD/ACD-201/91344
/indexes/attachment/ACD/ACD-202/91346
/indexes/attachment/ACD/ACD-197/88916
/indexes/attachment/ACD/ACD-189/73799
/indexes/attachment/ACD/ACD-38/60709
/indexes/attachment/ACD/ACD-198/88918
Now, I want to get the file hierarchy by reading all paths in the file, which means that I read the file and get all the paths, then I can know the file hierarchy is
index
|--attachment
|-----ACD
| |---ACD-200
| |---...
|
|-----CCTBAU
|----CCTBAU-13
|----...
Who can help out of this? Thanks in advance!

I use os.listdir, and codes are as below:
1 import os
2
3 def PrintDir(dir, depth, prefix = ' '):
4 contents = os.listdir(dir)
5 paths = filter(lambda x : os.path.isdir(os.path.join(dir, x)), contents)
6 files = [x for x in contents if x not in paths]
7 if not paths and not files:
8 return
9
10 print depth * prefix + '|----' + os.path.basename(dir) \
if depth != 0 else os.path.basename(dir)
11 for subdir in paths:
12 PrintDir(os.path.join(dir, subdir), depth+1, prefix)
13 for filename in files:
14 print depth * prefix + '|----' + filename
15
16 PrintDir('~/testdir', 0)
You can also use os.walk to get what you want, as return value of os.walk is a tuple:
root, dirs, files.
test case is :
testdir/a/aa/aaa
testdir/b/bb/bbb
testdir/b/bb.txt
and aaa, bbb, bb.txt are files.
and output is:
testdir
|----a
|----aa
|----aaa
|----b
|----bb
|----bbb
|----bb.txt

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I compare two text files from different folders? - python

Related

Compare content of two files

python regex: Parsing file name

How to split and copy files from multiple folder in train, test and validation folder

Creating column data from multiple sources with varying formats in python

Get the file hierarchy of a directory by the file path with Python

Categories

Resources