Rsync filters in a python loop - python

After reading the man page on filtering rules and looking here: Using Rsync filter to include/exclude files
I don't understand why the code below doesn't work.
import subprocess, os
from ftplib import FTP
ftp_site = 'ftp.ncbi.nlm.nih.gov'
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('genomes/genbank/bacteria')
dirs = ftp.nlst()
for organism in dirs:
latest = os.path.join(organism, "latest_assembly_versions")
for path in ftp.nlst(latest):
accession = path.split("/")[-1]
fasta = accession+"_genomic.fna.gz"
subprocess.call(['rsync',
'--recursive',
'--copy-links',
#'--dry-run',
'-vv',
'-f=+ '+accession+'/*',
'-f=+ '+fasta,
'-f=- *',
'ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest,
'--log-file=scratch/test_dir/log.txt',
'scratch/' + organism])
I also tried '--exclude=*[^'+fasta+']' to try to exclude files that don't match fasta instead of -f=- *
For each directory path within latest/*, I want the file that matches fasta exactly. There will always be exactly one file fasta in the directory latest/path.
EDIT: I am testing this with rsync version 3.1.0 and have seen incompatibility issues with earlier versions.
Here is a link to working code that you should be able to paste into a python interpreter to get the results of a "dry run," which won't download anything onto your machine: http://pastebin.com/0reVKMCg it gets EVERYTHING under ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest, which is not what I want. and if I run that script with '-f=- *' uncommented, it doesn't get anything, which seems to contradict the answer here Using Rsync filter to include/exclude files

This part of the rsync man page contained the info I needed to solve my problem:
Note that, when using the --recursive (-r) option (which is implied by -a), every subcomponent of every
path is visited from the top down, so include/exclude patterns get applied recursively to each subcompo-
nent's full name (e.g. to include "/foo/bar/baz" the subcomponents "/foo" and "/foo/bar" must not be
excluded). The exclude patterns actually short-circuit the directory traversal stage when rsync finds the
files to send. If a pattern excludes a particular parent directory, it can render a deeper include pat-
tern ineffectual because rsync did not descend through that excluded section of the hierarchy. This is
particularly important when using a trailing '*' rule. For instance, this won't work:
+ /some/path/this-file-will-not-be-found
+ /file-is-included
- *
This fails because the parent directory "some" is excluded by the '*' rule, so rsync never visits any of
the files in the "some" or "some/path" directories. One solution is to ask for all directories in the
hierarchy to be included by using a single rule: "+ */" (put it somewhere before the "- *" rule), and per-
haps use the --prune-empty-dirs option. Another solution is to add specific include rules for all the
parent dirs that need to be visited. For instance, this set of rules works fine:
+ /some/
+ /some/path/
+ /some/path/this-file-is-found
+ /file-also-included
- *
This helped me write the following code:
def get_fastas(local_mirror="scratch/ncbi", bacteria="Escherichia_coli"):
ftp_site = 'ftp.ncbi.nlm.nih.gov'
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('genomes/genbank/bacteria')
rsync_log = os.path.join(local_mirror, "rsync_log.txt")
latest = os.path.join(bacteria, 'latest_assembly_versions')
for parent in ftp.nlst(latest)[0:2]:
accession = parent.split("/")[-1]
fasta = accession+"_genomic.fna.gz"
organism_dir = os.path.join(local_mirror, bacteria)
subprocess.call(['rsync',
'--copy-links',
'--recursive',
'--itemize-changes',
'--prune-empty-dirs',
'-f=+ '+accession,
'-f=+ '+fasta,
'--exclude=*',
'ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+parent,
organism_dir])
It turns out '-f=+ '+accession, doesn't work with a * after the trailing /. Although it does work with just a trailing / without the *

Related

Python find broken symlinks caused by Git

I'm currently working on a script that is supposed to go through a cloned Git repo and remove any existing symlinks (broken or not) before recreating them for the user to ease with project initialization.
The solved problems:
Finding symbolic links (broken or unbroken) in Python is very well documented.
GitHub/GitLab breaking symlinks upon downloading/cloning/updating repositories is also very well documented as is how to fix this problem. Tl;dr: Git will download symlinks within the repo as plain text files (with no extension) containing only the symlink path if certain config flags are not set properly.
The unsolved problem:
My problem is that developers may download this repo without realizing the issues with Git, and end up with the symbolic links "checked out as small plain files that contain the link text" which is completely undetectable (as far as I can tell) when parsing the cloned files/directories (via existing base libraries). Running os.stat() on one of these broken symlink files returns information as though it were a normal text file:
os.stat_result(st_mode=33206, st_ino=14073748835637675, st_dev=2149440935, st_nlink=1, st_uid=0, st_gid=0, st_size=42, st_atime=1671662717, st_mtime=1671662699, st_ctime=1671662699)
The st_mode information only indicates that it is a normal text file- 100666 (the first 3 digits are the file type and the last 3 are the UNIX-style permissions). It should show up as 120000.
The os.path.islink() function only ever returns False.
THE CONFUSING PART is that when I run git ls-files -s ./microservices/service/symlink_file it gives 1200000 as the mode bits which, according to the documentation, indicates that this file is a symlink. However I cannot figure out how to see this information from within Python.
I've tried a bunch of things to try and find and delete existing symlinks. Here's the base method that just finds symlink directories and then deletes them:
def clearsymlinks(cwd: str = ""):
"""
Crawls starting from root directory and deletes all symlinks
within the directory tree.
"""
if not cwd:
cwd = os.getcwd()
print(f"Clearing symlinks - starting from {cwd}")
# Create a queue
cwd_dirs: list[str] = [cwd]
while len(cwd_dirs) > 0:
processing_dir: str = cwd_dirs.pop()
# print(f"Processing {processing_dir}") # Only when debugging - else it's too much output spam
for child_dir in os.listdir(processing_dir):
child_dir_path = os.path.join(processing_dir, child_dir)
# check if current item is a directory
if not os.path.isdir(child_dir_path):
if os.path.islink(child_dir_path):
print(f"-- Found symbolic link file {child_dir_path} - removing.\n")
os.remove(child_dir_path)
# skip the dir checking
continue
# Check if the child dir is a symlink
if os.path.islink(child_dir_path):
print(f"-- Found symlink directory {child_dir_path} - removing.")
os.remove(child_dir_path)
else:
# Add the child dir to the queue
cwd_dirs.append(child_dir_path)
After deleting symlinks I run os.symlink(symlink_src, symlink_dst) and generally run into the following error:
Traceback (most recent call last):
File "C:\Users\me\my_repo\remakesymlinks.py", line 123, in main
os.symlink(symlink_src, symlink_dst)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\me\\my_repo\\SharedPy' -> 'C:\\Users\\me\\my_repo\\microservices\\service\\symlink_file'
A workaround to specifically this error (inside the create symlink method) is:
try:
os.symlink(symlink_src, symlink_dst)
except FileExistsError:
os.remove(symlink_dst)
os.symlink(symlink_src, symlink_dst)
But this is not ideal because it doesn't prevent a huge list of defunct/broken symlinks from piling up in the directory. I should be able to find any symlinks (working, broken, non-existent, etc.) and then delete them.
I have a list of the symlinks that should be created by my script, but extracting the list of targets from this list is also a workaround that also causes a 'symlink-leak'. Below is how I'm currently finding the broken symlink purely for testing purposes.
if not os.path.isdir(child_dir_path):
if os.path.basename(child_dir_path) in [s.symlink_install_to for s in dirs_to_process]:
print(f"-- Found symlink file {child_dir_path} - removing.")
os.remove(child_dir_path)
# skip the dir checking
continue
A rudimentary solution where I filter for only 'text/plain' files with exactly 1 line (since checking anything else is pointless) and trying to determine whether that single line is just a file path (this seems excessive though):
# Check if Git downloaded the symlink as a plain text file (undetectable broken symlink)
if not os.path.isdir(child_dir_path):
try:
if magic.Magic(mime = True, uncompress = True).from_file(child_dir_path) == 'text/plain':
with open(child_dir_path, 'r') as file:
for i, line in enumerate(file):
if i >= 1:
raise StopIteration
else:
# this function is directly copied from https://stackoverflow.com/a/34102855/8705841
if is_path_exists_or_creatable(line):
print(f"-- Found broken Git link file '{child_dir_path}' - removing.")
print(f"\tContents: \"{line}\"")
# os.remove(child_dir_path)
raise StopIteration
except (StopIteration, UnicodeDecodeError, magic.MagicException):
file.close()
continue
Clearly this solution would require a lot of refactoring (9 indents is pretty ridiculous) if it's the only viable option. Problem with this solution (currently) is that it also tries to delete any single-line files with a string that does not break pathname requirements- i.e. import _virtualenv, some random test string, project-name. Filtering out those files with spaces, or files without slashes, could potentially work but this still feels like chasing edge cases instead of solving the actual problem.
I could potentially rewrite this script in Bash wherein I could, in addition to existing symlink search/destroy code, parse the results from git ls-files -s ... and then delete any files with the 120000 tag. But is this method feasible and/or reliable? There should be a way to do this from within Python since Bash isn't going to run on every OS.
Note: file names have been redacted/changed after copy-paste for privacy, they shouldn't matter anyways since they are generated dynamically by the path searching functions

Os.path gives unexpected output

lately I started working with the Os module in python . And I finally arrived to this Os.path method . So here is my question . I ran this method in one of my kivy project just for testing and it actually didn't returned the correct output.The method consisted of finding if any directory exist and return a list of folders in the directory . otherwise print Invalid Path and return -1 . I passed in an existing directory and it returned -1 but the weird path is that when I run similar program out of my kivy project using the same path present in thesame folder as my python file it return the desired output .here is the image with the python file and the directory name image I have tested which returns invalid path.
and here is my code snippet
def get_imgs(self, img_path):
if not os.path.exists(img_path):
print("Invalid Path...")
return -1
else:
all_files = os.listdir(img_path)
imgs = []
for f in all_files:
if (
f.endswith(".png")
or f.endswith(".PNG")
or f.endswith(".jpg")
or f.endswith(".JPG")
or f.endswith(".jpeg")
or f.endswith(".JPEG")
):
imgs.append("/".join([img_path, f]))
return imgs
It's tough to tell without seeing the code with your function call. Whatever argument you're passing must not be a valid path. I use the os module regularly and have slowly learned a lot of useful methods. I always print out paths that I'm reading or where I'm writing before doing it in case anything unexpected happens, I can see that img_path variable, for example. Copy and paste the path in file explorer up to the directory and make sure that's all good.
Some other useful os.path methods you will find useful, based on your code:
os.join(<directory>, <file_name.ext>) is much more intuitive than imgs.append("/".join([img_path, f]))
os.getcwd() gets your working directory (which I print at the start of scripts in dev to quickly address issues before debugging). I typically use full paths to play it safe because Python pathing can cause differences/issues when running from cmd vs. PyCharm
os.path.basename(f) gives you the file, while os.path.dirname(f) gives you the directory.
It seems like a better approach to this is to use pathlib and glob. You can iterate over directories and use wild cards.
Look at these:
iterating over directories: How can I iterate over files in a given directory?
different file types: Python glob multiple filetypes
Then you don't even need to check whether os.path.exists(img_path) because this will read the files directly from your file system. There's also more wild cards in the glob library such as * for anything/any length, ? for any character, [0-9] for any number, found here: https://docs.python.org/3/library/glob.html

os.path.basename() is inconsistent and I'm not sure why

While creating a program that backs up my files, I found that os.path.basename() was not working consistently. For example:
import os
folder = '\\\\server\\studies\\backup\\backup_files'
os.path.basename(folder)
returns 'backup_files'
folder = '\\\\server\\studies'
os.path.basename(folder)
returns ''
I want that second basename function to return 'studies' but it returns an empty string. I ran os.path.split(folder) to see how it's splitting the string and it turns out it's considering the entire path to be the directory, i.e. ('\\\\server\\studies', ' ').
I can't figure out how to get around it.. The weirdest thing is I ran the same line earlier and it worked, but it won't anymore! Does it have something to do with the very first part being a shared folder on the network drive?
that looks like a Windows UNC specificity
UNC paths can be seen as equivalent of unix path, only with double backslashes at the start.
A workaround would be to use classical rsplit:
>>> r"\\server\studies".rsplit(os.sep,1)[-1]
'studies'
Fun fact: with 3 paths it works properly:
>>> os.path.basename(r"\\a\b\c")
'c'
Now why this? let's check the source code of ntpath on windows:
def basename(p):
"""Returns the final component of a pathname"""
return split(p)[1]
okay now split
def split(p):
seps = _get_bothseps(p)
d, p = splitdrive(p)
now splitdrive
def splitdrive(p):
"""Split a pathname into drive/UNC sharepoint and relative path specifiers.
Returns a 2-tuple (drive_or_unc, path); either part may be empty.
Just reading the documentation makes us understand what's going on.
A Windows sharepoint has to contain 2 path parts:
\\server\shareroot
So \\server\studies is seen as the drive, and the path is empty. Doesn't happen when there are 3 parts in the path.
Note that it's not a bug, since it's not possible to use \\server like a normal directory, create dirs below, etc...
Note that the official documentation for os.path.basename doesn't mention that (because os.path calls ntpath behind the scenes) but it states:
Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split(). Note that the result of this function is different from the Unix basename program
That last emphasised part at least is true! (and the documentation for os.path.split() doesn't mention that issue or even talks about windows)

How to set pattern matching in Python's SMBCoonect listPath function

SMBConnect has the following function, listPath, which lists out the the contents of a given directory.
listPath(service_name, path, search=55, pattern='*', timeout=30)
Retrieve a directory listing of files/folders at path
Parameters:
service_name (string/unicode) – the name of the shared folder for the path
path (string/unicode) – path relative to the service_name where we are interested to learn about its files/sub-folders.
search (integer) – integer value made up from a bitwise-OR of SMB_FILE_ATTRIBUTE_xxx bits (see smb_constants.py). The default search value will query for all read-only, hidden, system, archive files and directories.
pattern (string/unicode) – the filter to apply to the results before returning to the client.
Returns:
A list of smb.base.SharedFile instances.
newConn=SMBConnection(arguments.username, password, DEFAULT_CLIENT_NAME, arguments.hostname, domain=arguments.domain,
use_ntlm_v2=True, is_direct_tcp=True)
assert newConn.connect(ip_address, 445, timeout=60)
files = newConn.listPath('C$', '/' + 'testing', '*.pdf')
for file in files:
print(file.filename)
I cannot get the pattern matching to change to anything specific. Above, I want to print out only those filenames that contain ".pdf" in the listing. Instead when the code executes, I just get ALL the files. No errors or anything. I have tried with and without the '*' and '.' and get the same results.
So, we got it to work with the re class as a work-around from the SMBConnection listPath function by using a variation of this that works with the created SMBConnection object. the ListPath function is still used but just not the pattern part of it. I built an "If-else" structure to handle the arg input and regex.
extensions = ['pdf', 'doc']
filenames = ['foobar.pdf', 'bar.doc']
for extension in extensions:
compiled = re.compile('\.{0}$'.format(extension))
for filename in filenames:
results = re.search(compiled, filename)
print results
You can use the regular expression .+\.pdf.
.+: matches all characters except for a newline character 1 or more times.
\.pdf: escapes the dot character, since the . characters has a special meaning in regular expressions, and search for pdf after the .
The original poster wasn't specifying the pattern parameter in the listpath() method correctly.
Instead of this:
files = newConn.listPath('C$', '/' + 'testing', '*.pdf')
It should be like this:
files = newConn.listPath('C$', '/' + 'testing', pattern='*.pdf')
Actually I asked a question in another thread if it is possible to filter on more than one file extension possibly by way of a regular expression. The author of pysmb responded to say this is not possible.
To filter on more than one file extension the workaround created by the original poster to this thread could be used.

Getting the Folder Path of the last location I right clicked in Python

I'm using Glob.Glob to search a folder, and the sub-folders there in for all the invoices I have. To simplify that I'm going to add the program to the context menu, and have it take the path as the first part of,
import glob
for filename in glob.glob(path + "/**/*.pdf", recursive=True):
print(filename)
I'll have it keep the list and send those files to a Printer, in a later version, but for now just writing the name is a good enough test.
So my question is twofold:
Is there anything fundamentally wrong with the way I'm writing this?
Can anyone point me in the direction of how to actually capture folder path and provide it as path-variable?
You should have a look at this question: Python script on selected file. It shows how to set up a "Sent To" command in the context menu. This command calls a python script an provides the file name sent via sys.argv[1]. I assume that also works for a directory.
I do not have Python3.5 so that I can set the flag recursive=True, so I prefer to provide you a solution which you can run on any Python version (known up to day).
The solution consists in using calling os.walk() to run explore the directories and the set build-in type.
it is better to use set instead of list as with this later one you'll need more code to check if the directory you want to add is not listed already.
So basically you can keep two sets: one for the names of files you want to print and the other one for the directories and their sub folders.
So you can adapat this solution to your class/method:
import os
path = '.' # Any path you want
exten = '.pdf'
directories_list = set()
files_list = set()
# Loop over direcotries
for dirpath, dirnames, files in os.walk(path):
for name in files:
# Check if extension matches
if name.lower().endswith(exten):
files_list.add(name)
directories_list.add(dirpath)
You can then loop over directories_list and files_list to print them out.

Categories

Resources