Empty 'folder' not removed in GCS - python

When I delete through the Console all files from a "folder" in a bucket, that folder is gone too since there is no such thing as directories - the whole path after the bucket is the key.
However when I move (copy & delete method) programmatically these files through REST API, the folder remains, empty. I must therefore write additional logic to check for these and remove explicitly.
Isn't that a bug in the REST API handling ? I had expected the same behavior regardless of the method used.

Turns out that you can safely remove all object ending with / if you don't need them once empty. "Content" will not be deleted.
If you are using Google Console, you must create a folder before uploading to it. That folder is therefore an explicit object that will remain even if empty. Same behavior apparently when uploading with tools like Cyberduck.
But if you upload the file using REST API and its full path i.e. bucket/folder/file, the folder is implicit visually but it's not getting created as such. So when removing the file, there is no folder left behind since it wasn't there in the first place.
Since the expected behavior for my use case is to auto-remove empty folders, I just have a pre-processing routine that deletes all blobs ending with /

Related

Using temporary files and folders in Web2py app

I am relatively new to web development and very new to using Web2py. The application I am currently working on is intended to take in a CSV upload from a user, then generate a PDF file based on the contents of the CSV, then allow the user to download that PDF. As part of this process I need to generate and access several intermediate files that are specific to each individual user (these files would be images, other pdfs, and some text files). I don't need to store these files in a database since they can be deleted after the session ends, but I am not sure the best way or place to store these files and keep them separate based on each session. I thought that maybe the subfolders in the sessions folder would make sense, but I do not know how to dynamically get the path to the correct folder for the current session. Any suggestions pointing me in the right direction are appreciated!
I was having this error "TypeError: expected string or Unicode object, NoneType found" and I had to store just a link in the session to the uploaded document in the db or maybe the upload folder in your case. I would store it to upload to proceed normally, and then clear out the values and the file if not 'approved'?
If the information is not confidential in similar circumstances, I directly write the temporary files under /tmp.

how to check duplicate files in airflow

I have incoming files in 'source-bucket'
I archive files after processing into another bucket 'archive-bucket'
in the current date-time folder.
eg:
gs://archive-bucket/module1/2021-06-25/source_file_20210622.csv
gs://archive-bucket/module1/2021-06-26/source_file_20210623.csv
gs://archive-bucket/module1/2021-06-27/source_file_20210624.csv
Every time I process a file, I want to check if the file is already processed by checking the if it is present in the archive folder.
duplicate_check = GoogleCloudStoragePrefixSensor(
task_id=f'detect_duplicate_{task_name}',
bucket=ARCHIVE_BUCKET,
prefix=f'module1/{DATE}/{source_file_name}')
This approach is only allowing to check for the particular date folder.
How to check if the 'source_file_<>.csv' is already present in the 'gs://archive-bucket/module1/all the date folders/'
even if the file is present in any date folder in the archive path, I need to fail further processing.
How can that be acheived?
I do not think you can do it easily. You could probably play with "delimiter" parameter https://github.com/googleapis/google-cloud-python/issues/920#issuecomment-230214336 to achieve something similar. Maybe you can even try to set delimiter to be your file name and look for "module1/" prefix. Not sure about efficiency of that though.
The problem is that GCS is NOT a filesystem with folders. The "/" is just convenience to treat it as directory and the UI allows you to "browse" it in similar way, but in fact the GCS object are not stored in subfolders - whole name of the object is the only identifier and there nothing like "folders" in GCS. So you can only match the file names and matching by prefix is efficient. If you will have a lot of files, any other kind of matching might be slow.
What I recommend, maybe is to have a separate path. where you create empty objects corresponding to file names processsed. For example "processed/file.name" path. Without any structure. Then you could check for presence of the file name there. This will be rather efficient (but might not be atomic, depending how your processing looks like).
From Your requirement what i understand is, you want to move files once they are processed from src bucket to another bucket and you want to make sure that file is moved to dest bucket successfully.
Best way to do it is,
1)Maintain a small sql table to insert file path which is processed into table as "Processed" and whenever state is processed move those files to dest bucket. From this table you can check always what all files are processed and moved to dest bucket.
and
2) another approach is
if task1=task to process files
task2=pass processed files to bashoperator
and using gsutil option you can move files easily and also check in
that script whether it is being pushed.

Python GCS how to rename file within inner zip file?

Suppose I have a file hosted on GCS on a Python AppEngine project. unfortunately, this file structure is something like:
outer.zip/
- inner.zip/
- vid_file
- png_file
the problem is that the two files inside inner.zip do not have the .mp4 extension on the file, and it's causing all sorts of troubles. How do i rename the files so that it appears like:
outer.zip/
- inner.zip/
- vid_file.mp4
- png_file.png
so that the files inside inner.zip have their extensions?
I keep running into all sorts of limitations since gcs doesn't allow file renaming, unarchiving...etc.
the files aren't terribly large.
P.S. i'm not very familiar with Python, so any code examples would be great appreciated, thanks!
There is absolutely no way to perform any alteration to GCS objects -- full stop. They are exactly the bunch of bytes you decided at their birth (uninterpreted by GCS itself) and thus they will stay.
The best you can do is create a new object which is almost like the original except it fixes little errors and oopses you did when creating the original. Then you can overwrite (i.e completely replace) the original with the new, improved version.
Hopefully it's a one-off terrible mistake you made just once and now want to fix so it's not worth writing a program for that. Just download that GCS object, use normal tools to unzip it and unzip any further zipfiles it may contain, do the fixes on the filesystem with your favorite local filesystem tools, zip things up again, upload/rewrite the final zip to your desired new GCS object -- phew, you're done.
Alex is right that objects are immutable, i.e., no editing in-place. The only way to accomplish what you're talking about would be to download the current file, unzip it, update the new files, re-zip the files into the same-named file, and upload to GCS. GCS object overwrites are transactional, so the old content will be visible until the instant the upload completes. Doing it this way is obviously not very network efficient but at least it wouldn't leave periods of time when the object is invisible (as deleting and re-uploading would).
"Import zipfile" and you can unzip the file once it's downloaded into gcs storage.
I have code doing exactly this on a nightly basis from a cron job.
Ive never tried creating a zip file with GAE but the docs say you can do it.
https://docs.python.org/2/library/zipfile.html

How can a directory at the same time be a file?

I was working with the python os module and confronted some obstacles regarding the symbolic path thing..
linkdir = os.path.dirname(filepath)
if not os.path.isdir(linkdir):
if os.path.exists(linkdir):
os.unlink(linkdir)
os.makedirs(linkdir)
this is the code that i had problem fully understanding. Accrording to the explanation on the book, it means:
If I enter the if clause, this means the directory either does not exist or is a plain file.
In the case, it is the latter, so it will be erased. Finally, the target directory is created.
However i do not exactly understand how the directory(linkdir) could be a plain file. I tried to google it but just got an answer : 'Because it is the symbolic link'. I honestly do not get it with such short answer... Would you be kind enough to explain it to me in an understandable fashion?
The code tries to clear the way for a directory being created. The value in filepath is just a string. It isn't actually connected to anything on the filesystem, but you cannot just create a directory without checking if there isn't anything in the way first.
If you have the value /foo/bar/spam.html in filepath, the code does this:
extract the directory portion of that path, /foo/bar. This is still just a string, nothing really to do with the actual file system.
Test if /foo/bar is an actual directory on your filesystem with os.path.isdir(). If there is an existing directory at that location, you are done, mission accomplished.
If it is not a directory, then test if /foo/bar exists at all. We already discounted that it is a directory, so if /foo/bar exists anyway it must be something else. Usually that means it is a file. The code then will delete whatever is there to make way for the directory.
This doesn't have all that much to do with symbolic links; /foo/bar could be a pre-existing symbolic link too, but that doesn't really matter here. All that matters is that whatever actually exists on your filesystem at /foo/bar better be a directory already, otherwise it needs to be removed before you can create a directory there.
Because os.path.dirname(filepath) only splits the string "filepath" into head and tail according to slash.
It doesn't check whether the head is an existing directory.
For example, we hava a file named "a" in the working directory.
(1) the code
os.path.dirname("a/a")
returns "a".
(2) however, it is false if we check it via isdir
(3) it returns true if we check it via isfile

Ubuntu encrypted home directory | Errno 36 File Name too long

Working on a python scraper/spider and encountered a URL that exceeds the char limit with the titled IOError. Using httplib2 and when I attempt to retrieve the URL I receive a file name too long error. I prefer to have all of my projects within the home directory since I am using Dropbox. Anyway around this issue or should I just setup my working directory outside of home?
You are probably hitting limitation of the encrypted file system, which allows up to 143 chars in file name.
Here is the bug:
https://bugs.launchpad.net/ecryptfs/+bug/344878
The solution for now is to use any other directory outside your encrypted home directory. To double check this:
mount | grep ecryptfs
and see if your home dir is listed.
If that's the case either use some other dir above home, or create a new home directory without using encryption.
The fact that the filename that's too long starts with '.cache/www.example.com' explains the problem.
httplib2 optionally caches requests that you make. You've enabled caching, and you've given it .cache as the cache directory.
The easy solution is to put the cache directory somewhere else.
Without seeing your code, it's impossible to tell you how to fix it. But it should be trivial. The documentation for FileCache shows that it takes a dir_name as the first parameter.
Or, alternatively, you can pass a safe function that lets you generate a filename from the URI, overriding the default. That would allow you to generate filenames that fit within the 144-character limit for Ubuntu encrypted fs.
Or, alternatively, you can create your own object with the same interface as FileCache and pass that to the Http object to use as a cache. For example, you could use tempfile to create random filenames, and store a mapping of URLs to filenames in an anydbm or sqlite3 database.
A final alternative is to just turn off caching, of course.
As you apparently have passed '.cache' to the httplib.Http constructor, you should change this to something more appropriate or disable the cache.

Categories

Resources