How can a directory at the same time be a file? - python

I was working with the python os module and confronted some obstacles regarding the symbolic path thing..
linkdir = os.path.dirname(filepath)
if not os.path.isdir(linkdir):
if os.path.exists(linkdir):
os.unlink(linkdir)
os.makedirs(linkdir)
this is the code that i had problem fully understanding. Accrording to the explanation on the book, it means:
If I enter the if clause, this means the directory either does not exist or is a plain file.
In the case, it is the latter, so it will be erased. Finally, the target directory is created.
However i do not exactly understand how the directory(linkdir) could be a plain file. I tried to google it but just got an answer : 'Because it is the symbolic link'. I honestly do not get it with such short answer... Would you be kind enough to explain it to me in an understandable fashion?

The code tries to clear the way for a directory being created. The value in filepath is just a string. It isn't actually connected to anything on the filesystem, but you cannot just create a directory without checking if there isn't anything in the way first.
If you have the value /foo/bar/spam.html in filepath, the code does this:
extract the directory portion of that path, /foo/bar. This is still just a string, nothing really to do with the actual file system.
Test if /foo/bar is an actual directory on your filesystem with os.path.isdir(). If there is an existing directory at that location, you are done, mission accomplished.
If it is not a directory, then test if /foo/bar exists at all. We already discounted that it is a directory, so if /foo/bar exists anyway it must be something else. Usually that means it is a file. The code then will delete whatever is there to make way for the directory.
This doesn't have all that much to do with symbolic links; /foo/bar could be a pre-existing symbolic link too, but that doesn't really matter here. All that matters is that whatever actually exists on your filesystem at /foo/bar better be a directory already, otherwise it needs to be removed before you can create a directory there.

Because os.path.dirname(filepath) only splits the string "filepath" into head and tail according to slash.
It doesn't check whether the head is an existing directory.
For example, we hava a file named "a" in the working directory.
(1) the code
os.path.dirname("a/a")
returns "a".
(2) however, it is false if we check it via isdir
(3) it returns true if we check it via isfile

Related

how to check duplicate files in airflow

I have incoming files in 'source-bucket'
I archive files after processing into another bucket 'archive-bucket'
in the current date-time folder.
eg:
gs://archive-bucket/module1/2021-06-25/source_file_20210622.csv
gs://archive-bucket/module1/2021-06-26/source_file_20210623.csv
gs://archive-bucket/module1/2021-06-27/source_file_20210624.csv
Every time I process a file, I want to check if the file is already processed by checking the if it is present in the archive folder.
duplicate_check = GoogleCloudStoragePrefixSensor(
task_id=f'detect_duplicate_{task_name}',
bucket=ARCHIVE_BUCKET,
prefix=f'module1/{DATE}/{source_file_name}')
This approach is only allowing to check for the particular date folder.
How to check if the 'source_file_<>.csv' is already present in the 'gs://archive-bucket/module1/all the date folders/'
even if the file is present in any date folder in the archive path, I need to fail further processing.
How can that be acheived?
I do not think you can do it easily. You could probably play with "delimiter" parameter https://github.com/googleapis/google-cloud-python/issues/920#issuecomment-230214336 to achieve something similar. Maybe you can even try to set delimiter to be your file name and look for "module1/" prefix. Not sure about efficiency of that though.
The problem is that GCS is NOT a filesystem with folders. The "/" is just convenience to treat it as directory and the UI allows you to "browse" it in similar way, but in fact the GCS object are not stored in subfolders - whole name of the object is the only identifier and there nothing like "folders" in GCS. So you can only match the file names and matching by prefix is efficient. If you will have a lot of files, any other kind of matching might be slow.
What I recommend, maybe is to have a separate path. where you create empty objects corresponding to file names processsed. For example "processed/file.name" path. Without any structure. Then you could check for presence of the file name there. This will be rather efficient (but might not be atomic, depending how your processing looks like).
From Your requirement what i understand is, you want to move files once they are processed from src bucket to another bucket and you want to make sure that file is moved to dest bucket successfully.
Best way to do it is,
1)Maintain a small sql table to insert file path which is processed into table as "Processed" and whenever state is processed move those files to dest bucket. From this table you can check always what all files are processed and moved to dest bucket.
and
2) another approach is
if task1=task to process files
task2=pass processed files to bashoperator
and using gsutil option you can move files easily and also check in
that script whether it is being pushed.

Linux checksum of directory

I need to check if a directory and all its sub-directories and files have had any changes to:
file permission
file name
file contents
I'd like to get a checksum out of it.
Using tar doesn't seem to work well, because it contains file modify time, and maybe access time too. I'm not sure. Python's checksumdir doesn't contain any metadata at all, not even file name.
Anyone know of a way to accomplish this?
Thanks.
You can use tar with --mtime flag and provide specific value to modification time. E.g. 1-1-70

Empty 'folder' not removed in GCS

When I delete through the Console all files from a "folder" in a bucket, that folder is gone too since there is no such thing as directories - the whole path after the bucket is the key.
However when I move (copy & delete method) programmatically these files through REST API, the folder remains, empty. I must therefore write additional logic to check for these and remove explicitly.
Isn't that a bug in the REST API handling ? I had expected the same behavior regardless of the method used.
Turns out that you can safely remove all object ending with / if you don't need them once empty. "Content" will not be deleted.
If you are using Google Console, you must create a folder before uploading to it. That folder is therefore an explicit object that will remain even if empty. Same behavior apparently when uploading with tools like Cyberduck.
But if you upload the file using REST API and its full path i.e. bucket/folder/file, the folder is implicit visually but it's not getting created as such. So when removing the file, there is no folder left behind since it wasn't there in the first place.
Since the expected behavior for my use case is to auto-remove empty folders, I just have a pre-processing routine that deletes all blobs ending with /

How can I determine if a given directory is `symlink`ed to in Python?

I am trying to remove a set of directories, excluding those that are in used and symlinked to elsewhere.
What is the most effective way to determine if a given directory is symlinked to?
I've tried os.stat(dir).mt_nlink, but it returns 3 even for directories I want to remove.
EDIT:
By symlinked to I mean this directory is a target of some symlink.
There is no easy way to determine if someone else has made a link to a given "hard" directory. You can only check if a given directory is a symlink to another directory.
This means that you need to traverse your entire directory structure, look for symlinks, and then check if they point to the directory in question.
A symlink is a special file which points to another file/directory, somewhere in your directory structure. Symlinks can point to other filesystems as well. Creating a symlink does not change the inode of the destination file/folder (as opposed to hard links), so you can't tell by looking at the target, only at the link itself.
Use os.path.islink(path).
Straight from the docs:
Return True if path refers to a directory entry that is a symbolic
link. Always False if symbolic links are not supported.
Use os.path.islink.
You can also test if a path is a broken link using os.path.lexists (a file is a broken link iff it lexists() and not exists()).

Ubuntu encrypted home directory | Errno 36 File Name too long

Working on a python scraper/spider and encountered a URL that exceeds the char limit with the titled IOError. Using httplib2 and when I attempt to retrieve the URL I receive a file name too long error. I prefer to have all of my projects within the home directory since I am using Dropbox. Anyway around this issue or should I just setup my working directory outside of home?
You are probably hitting limitation of the encrypted file system, which allows up to 143 chars in file name.
Here is the bug:
https://bugs.launchpad.net/ecryptfs/+bug/344878
The solution for now is to use any other directory outside your encrypted home directory. To double check this:
mount | grep ecryptfs
and see if your home dir is listed.
If that's the case either use some other dir above home, or create a new home directory without using encryption.
The fact that the filename that's too long starts with '.cache/www.example.com' explains the problem.
httplib2 optionally caches requests that you make. You've enabled caching, and you've given it .cache as the cache directory.
The easy solution is to put the cache directory somewhere else.
Without seeing your code, it's impossible to tell you how to fix it. But it should be trivial. The documentation for FileCache shows that it takes a dir_name as the first parameter.
Or, alternatively, you can pass a safe function that lets you generate a filename from the URI, overriding the default. That would allow you to generate filenames that fit within the 144-character limit for Ubuntu encrypted fs.
Or, alternatively, you can create your own object with the same interface as FileCache and pass that to the Http object to use as a cache. For example, you could use tempfile to create random filenames, and store a mapping of URLs to filenames in an anydbm or sqlite3 database.
A final alternative is to just turn off caching, of course.
As you apparently have passed '.cache' to the httplib.Http constructor, you should change this to something more appropriate or disable the cache.

Categories

Resources