Comparing directories in Python - python

I have two directories that I want to compare and I want to find the following using Python (while ignoring the structure of each directory):
files with the same name, but different content
files with the same content, but different name
files with both unique content and name, that exist only in one directory but not the other
Is there a robust Python library to do this? I looked everywhere, but I can't find anything that can do all of the above. If possible, I wouldn't want to create one from a scratch since it is potentially a very complex endeavour.
All I can do so far is make a list of files, but I'm utterly lost how to proceed from there.
from pathlib import Path
file_list = []
file_path = Path.cwd()
for file in file_path.rglob('*'):
if file.is_file():
file_list.append(file)

This method prints result of comparison between directories.
result = filecmp.dircmp('dir1', 'dir2')
result.report()
diff dir1 dir2
Only in dir1 : ['newfile.txt']
Identical files : ['file1.txt']
Differing files : ['file2.txt']
"""

Related

Find files with regex and their respective directory

I'm working on the 'C:\Documents' directory.
It has many subdirectories and I need to find all the files that their filename starts with 'A0' prefix and ends with '.xls' extension. For example, 'A0SSS.xls' or 'A0ASDF.xls'
Is it possible to fetch all those files and get their directory?
For instace, if the file 'A0SSS.xls' is located on 'C:\Documents\Folder1', I need to know the file name (A0SSS.xls) along with their respective directory (C:\Documents\Folder1).
To find the path of the matching files, you run a recursive search with a filter. I recommend for you to use pathlib, so you can easily get the parent folder for each of them. The list of parent folders can be redundant, if you have got multiple matching files in the same folder. There are many ways to make a list unique in python. One of them is to convert the list to set, which must be unique by definition, and convert it back to list.
from pathlib import Path
search_path = Path("C:\Documents")
results = list(search_path.rglob("A0*.xlsx"))
string_results = [str(matching_path) for matching_path in results]
containing_folders = [r.parent for r in results]
unique_folders = list(set(containing_folders))
print("matching files:")
for r in string_results:
print(r)
print()
print("containing folders:")
for f in unique_folders:
print(f)

Ignore all files other than specific type of file, for directory comparison in Python

I want to compare two directories for all ".bin" files in them. There can be some other extension type files such as ".txt", ".tar.bz2" in those directories. I want to get the common files as well as files which are not common.
I tried using filecmp.dircmp(), but I am not able to use the ignore parameter with some wild card to ignore those files. Is there any solution which I can use to serve my purpose.
Select the common subset of *.bin files in the two folders and remove the first part of the path (the folder name), then pass them to cmpfiles():
import filecmp
from pathlib import Path
dir1_files = [f.relative_to('folder1') for f in Path('folder1').glob('*.bin')]
dir2_files = [f.relative_to('folder2') for f in Path('folder2').glob('*.bin')]
common_files = set(dir1_files).intersection(dir2_files)
match, mismatch, error = filecmp.cmpfiles('folder1', 'folder2', common_files)
If you want to avoid the preselection of common files, you can instead take the union of the two sets:
common_files = set(dir1_files).union(dir2_files)

Different File Paths in Python ZipFile Depending on .write() vs .writestr()

I just wanted to ask quickly if the behavior I'm seeing in Python's zipfile module is expected... I wanted to put together a zip archive. For reasons I don't think I need to get into, I was adding some files using zipfile.writestr() and others using .write(). I was writing some files to zip subdirectory called /scripts and others to a zip subdirectory called /data.
For /data, I originally did this:
for root, _, filenames in os.walk(tmpdirname):
for root_name in filenames:
print(f"Handle zip of {root_name}")
name = os.path.join(root, root_name)
name = os.path.normpath(name)
zipFile.write(name, f'/data/{root_name}')
This worked fine and produced a working archive that I could extract. So far, so good. To write text files to the /script subdirectory, I used:
zipFile.writestr(f'/script/{scriptname}', fileBytes)
Again, so far so good.
Now it gets odd... I wanted to extract files in /data/. So I looked for paths in zipFile.namelist() starting with /data. My code kept missing the files in /data/, however. Doing some more digging, I noticed that the files written using .writestr had a slash at the start of the zipfile path like this: "/scripts/myscript.py". The files written using .write did not have a slash at the start of the path, so the data file paths looked like this: "data/mydata.pickle".
I changed my code to use .writestr() for the data files:
for root, _, filenames in os.walk(tmpdirname):
for root_name in filenames:
print(f"Handle zip of {root_name}")
name = os.path.join(root, root_name)
name = os.path.normpath(name)
with open(name, mode='rb') as extracted_file:
zipFile.writestr(f'/data/{root_name}', extracted_file.read())
Voila, the data files now have slashes at the start of the path. I'm not sure why, however, as I'm providing the same file path either way, and I wouldn't expect using one method versus another would change the paths.
Is this supposed to work this way? Am I missing something obvious here?

Getting the absolute paths of all files in a folder, without traversing the subfolders

Let
my_dir = "/raid/user/my_dir"
be a folder on my filesystem, which is not the current folder (i.e., it's not the result of os.getcwd()). I want to retrieve the absolute paths of all files at the first level of hierarchy in my_dir (i.e., the absolute paths of all files which are in my_dir, but not in a subfolder of my_dir) as a list of strings absolute_paths. I need it, in order to later delete those files with os.remove().
This is nearly the same use case as
Get absolute paths of all files in a directory
but the difference is that I don't want to traverse the folder hierarchy: I only need the files at the first level of hierarchy (at depth 0? not sure about terminology here).
It's easy to adapt that solution: Call os.walk() just once, and don't let it continue:
root, dirs, files = next(os.walk(my_dir, topdown=True))
files = [ os.path.join(root, f) for f in files ]
print(files)
You can use the os.path module and a list comprehension.
import os
absolute_paths= [os.path.abspath(f) for f in os.listdir(my_dir) if os.path.isfile(f)]
You can use os.scandir which returns an os.DirEntry object that has a variety of options including the ability to distinguish files from directories.
with os.scandir(somePath) as it:
paths = [entry.path for entry in it if entry.is_file()]
print(paths)
If you want to list directories as well, you can, of course, remove the condition from the list comprehension if you want to see them in the list.
The documentation also has this note under listDir:
See also The scandir() function returns directory entries along with file attribute information, giving better performance for many common use cases.

Getting the Folder Path of the last location I right clicked in Python

I'm using Glob.Glob to search a folder, and the sub-folders there in for all the invoices I have. To simplify that I'm going to add the program to the context menu, and have it take the path as the first part of,
import glob
for filename in glob.glob(path + "/**/*.pdf", recursive=True):
print(filename)
I'll have it keep the list and send those files to a Printer, in a later version, but for now just writing the name is a good enough test.
So my question is twofold:
Is there anything fundamentally wrong with the way I'm writing this?
Can anyone point me in the direction of how to actually capture folder path and provide it as path-variable?
You should have a look at this question: Python script on selected file. It shows how to set up a "Sent To" command in the context menu. This command calls a python script an provides the file name sent via sys.argv[1]. I assume that also works for a directory.
I do not have Python3.5 so that I can set the flag recursive=True, so I prefer to provide you a solution which you can run on any Python version (known up to day).
The solution consists in using calling os.walk() to run explore the directories and the set build-in type.
it is better to use set instead of list as with this later one you'll need more code to check if the directory you want to add is not listed already.
So basically you can keep two sets: one for the names of files you want to print and the other one for the directories and their sub folders.
So you can adapat this solution to your class/method:
import os
path = '.' # Any path you want
exten = '.pdf'
directories_list = set()
files_list = set()
# Loop over direcotries
for dirpath, dirnames, files in os.walk(path):
for name in files:
# Check if extension matches
if name.lower().endswith(exten):
files_list.add(name)
directories_list.add(dirpath)
You can then loop over directories_list and files_list to print them out.

Categories

Resources