I need to check if a directory and all its sub-directories and files have had any changes to:
file permission
file name
file contents
I'd like to get a checksum out of it.
Using tar doesn't seem to work well, because it contains file modify time, and maybe access time too. I'm not sure. Python's checksumdir doesn't contain any metadata at all, not even file name.
Anyone know of a way to accomplish this?
Thanks.
You can use tar with --mtime flag and provide specific value to modification time. E.g. 1-1-70
Related
I have incoming files in 'source-bucket'
I archive files after processing into another bucket 'archive-bucket'
in the current date-time folder.
eg:
gs://archive-bucket/module1/2021-06-25/source_file_20210622.csv
gs://archive-bucket/module1/2021-06-26/source_file_20210623.csv
gs://archive-bucket/module1/2021-06-27/source_file_20210624.csv
Every time I process a file, I want to check if the file is already processed by checking the if it is present in the archive folder.
duplicate_check = GoogleCloudStoragePrefixSensor(
task_id=f'detect_duplicate_{task_name}',
bucket=ARCHIVE_BUCKET,
prefix=f'module1/{DATE}/{source_file_name}')
This approach is only allowing to check for the particular date folder.
How to check if the 'source_file_<>.csv' is already present in the 'gs://archive-bucket/module1/all the date folders/'
even if the file is present in any date folder in the archive path, I need to fail further processing.
How can that be acheived?
I do not think you can do it easily. You could probably play with "delimiter" parameter https://github.com/googleapis/google-cloud-python/issues/920#issuecomment-230214336 to achieve something similar. Maybe you can even try to set delimiter to be your file name and look for "module1/" prefix. Not sure about efficiency of that though.
The problem is that GCS is NOT a filesystem with folders. The "/" is just convenience to treat it as directory and the UI allows you to "browse" it in similar way, but in fact the GCS object are not stored in subfolders - whole name of the object is the only identifier and there nothing like "folders" in GCS. So you can only match the file names and matching by prefix is efficient. If you will have a lot of files, any other kind of matching might be slow.
What I recommend, maybe is to have a separate path. where you create empty objects corresponding to file names processsed. For example "processed/file.name" path. Without any structure. Then you could check for presence of the file name there. This will be rather efficient (but might not be atomic, depending how your processing looks like).
From Your requirement what i understand is, you want to move files once they are processed from src bucket to another bucket and you want to make sure that file is moved to dest bucket successfully.
Best way to do it is,
1)Maintain a small sql table to insert file path which is processed into table as "Processed" and whenever state is processed move those files to dest bucket. From this table you can check always what all files are processed and moved to dest bucket.
and
2) another approach is
if task1=task to process files
task2=pass processed files to bashoperator
and using gsutil option you can move files easily and also check in
that script whether it is being pushed.
I have used Selenium x Python to download a zip file daily but i am currently facing a few issues after downloading it on my local download folder
is it possible to use Python to read those files dynamically? let's say the date is always different. Can we simply add wildcard*? I am trying to move it from downloader folder to another folder but it always require me to name the file entirely.
how to unzip a file and look for specific files there? let's say those file will always start with files names "ABC202103xx.csv"
much appreciate for your help! any sample code will be truly appreciate!
Not knowing the excact name of a file in a local folder should usually not be a problem. You could just list all filenames in the local folder and then use a for loop to find the filename you need. For example let's assume that you have downloaded a zip file into a Downloads folder and you know it is named "file-X.zip" with X being any date.
import os
for filename in os.listdir("Downloads"):
if filename.startswith("file-") and filename.endswith(".zip"):
filename_you_are_looking_for = filename
break
To unzip files, I will refer you to this stackoverflow thread. Again, to look for specific files in there, you can use os.listdir.
Suppose I have a file hosted on GCS on a Python AppEngine project. unfortunately, this file structure is something like:
outer.zip/
- inner.zip/
- vid_file
- png_file
the problem is that the two files inside inner.zip do not have the .mp4 extension on the file, and it's causing all sorts of troubles. How do i rename the files so that it appears like:
outer.zip/
- inner.zip/
- vid_file.mp4
- png_file.png
so that the files inside inner.zip have their extensions?
I keep running into all sorts of limitations since gcs doesn't allow file renaming, unarchiving...etc.
the files aren't terribly large.
P.S. i'm not very familiar with Python, so any code examples would be great appreciated, thanks!
There is absolutely no way to perform any alteration to GCS objects -- full stop. They are exactly the bunch of bytes you decided at their birth (uninterpreted by GCS itself) and thus they will stay.
The best you can do is create a new object which is almost like the original except it fixes little errors and oopses you did when creating the original. Then you can overwrite (i.e completely replace) the original with the new, improved version.
Hopefully it's a one-off terrible mistake you made just once and now want to fix so it's not worth writing a program for that. Just download that GCS object, use normal tools to unzip it and unzip any further zipfiles it may contain, do the fixes on the filesystem with your favorite local filesystem tools, zip things up again, upload/rewrite the final zip to your desired new GCS object -- phew, you're done.
Alex is right that objects are immutable, i.e., no editing in-place. The only way to accomplish what you're talking about would be to download the current file, unzip it, update the new files, re-zip the files into the same-named file, and upload to GCS. GCS object overwrites are transactional, so the old content will be visible until the instant the upload completes. Doing it this way is obviously not very network efficient but at least it wouldn't leave periods of time when the object is invisible (as deleting and re-uploading would).
"Import zipfile" and you can unzip the file once it's downloaded into gcs storage.
I have code doing exactly this on a nightly basis from a cron job.
Ive never tried creating a zip file with GAE but the docs say you can do it.
https://docs.python.org/2/library/zipfile.html
I was working with the python os module and confronted some obstacles regarding the symbolic path thing..
linkdir = os.path.dirname(filepath)
if not os.path.isdir(linkdir):
if os.path.exists(linkdir):
os.unlink(linkdir)
os.makedirs(linkdir)
this is the code that i had problem fully understanding. Accrording to the explanation on the book, it means:
If I enter the if clause, this means the directory either does not exist or is a plain file.
In the case, it is the latter, so it will be erased. Finally, the target directory is created.
However i do not exactly understand how the directory(linkdir) could be a plain file. I tried to google it but just got an answer : 'Because it is the symbolic link'. I honestly do not get it with such short answer... Would you be kind enough to explain it to me in an understandable fashion?
The code tries to clear the way for a directory being created. The value in filepath is just a string. It isn't actually connected to anything on the filesystem, but you cannot just create a directory without checking if there isn't anything in the way first.
If you have the value /foo/bar/spam.html in filepath, the code does this:
extract the directory portion of that path, /foo/bar. This is still just a string, nothing really to do with the actual file system.
Test if /foo/bar is an actual directory on your filesystem with os.path.isdir(). If there is an existing directory at that location, you are done, mission accomplished.
If it is not a directory, then test if /foo/bar exists at all. We already discounted that it is a directory, so if /foo/bar exists anyway it must be something else. Usually that means it is a file. The code then will delete whatever is there to make way for the directory.
This doesn't have all that much to do with symbolic links; /foo/bar could be a pre-existing symbolic link too, but that doesn't really matter here. All that matters is that whatever actually exists on your filesystem at /foo/bar better be a directory already, otherwise it needs to be removed before you can create a directory there.
Because os.path.dirname(filepath) only splits the string "filepath" into head and tail according to slash.
It doesn't check whether the head is an existing directory.
For example, we hava a file named "a" in the working directory.
(1) the code
os.path.dirname("a/a")
returns "a".
(2) however, it is false if we check it via isdir
(3) it returns true if we check it via isfile
Is there is a way to rename the file name inside the compressed file ,but without extract it, in python ??
No. The python ZipFile module does not provide any way to do that.
You can extract the archive into memory, and then rewrite it with the new filename for the file in question, but that is memory intensive and not what you want to do.
You may be able to edit the zipfile's header and information fields, but the standard Python interface doesn't have an easy way to do so.