how to check duplicate files in airflow

how to check duplicate files in airflow - python

I have incoming files in 'source-bucket'
I archive files after processing into another bucket 'archive-bucket'
in the current date-time folder.
eg:
gs://archive-bucket/module1/2021-06-25/source_file_20210622.csv
gs://archive-bucket/module1/2021-06-26/source_file_20210623.csv
gs://archive-bucket/module1/2021-06-27/source_file_20210624.csv
Every time I process a file, I want to check if the file is already processed by checking the if it is present in the archive folder.
duplicate_check = GoogleCloudStoragePrefixSensor(
task_id=f'detect_duplicate_{task_name}',
bucket=ARCHIVE_BUCKET,
prefix=f'module1/{DATE}/{source_file_name}')
This approach is only allowing to check for the particular date folder.
How to check if the 'source_file_<>.csv' is already present in the 'gs://archive-bucket/module1/all the date folders/'
even if the file is present in any date folder in the archive path, I need to fail further processing.
How can that be acheived?

I do not think you can do it easily. You could probably play with "delimiter" parameter https://github.com/googleapis/google-cloud-python/issues/920#issuecomment-230214336 to achieve something similar. Maybe you can even try to set delimiter to be your file name and look for "module1/" prefix. Not sure about efficiency of that though.
The problem is that GCS is NOT a filesystem with folders. The "/" is just convenience to treat it as directory and the UI allows you to "browse" it in similar way, but in fact the GCS object are not stored in subfolders - whole name of the object is the only identifier and there nothing like "folders" in GCS. So you can only match the file names and matching by prefix is efficient. If you will have a lot of files, any other kind of matching might be slow.
What I recommend, maybe is to have a separate path. where you create empty objects corresponding to file names processsed. For example "processed/file.name" path. Without any structure. Then you could check for presence of the file name there. This will be rather efficient (but might not be atomic, depending how your processing looks like).

From Your requirement what i understand is, you want to move files once they are processed from src bucket to another bucket and you want to make sure that file is moved to dest bucket successfully.
Best way to do it is,
1)Maintain a small sql table to insert file path which is processed into table as "Processed" and whenever state is processed move those files to dest bucket. From this table you can check always what all files are processed and moved to dest bucket.
and
2) another approach is
if task1=task to process files
task2=pass processed files to bashoperator
and using gsutil option you can move files easily and also check in
that script whether it is being pushed.

Related

Using temporary files and folders in Web2py app

I am relatively new to web development and very new to using Web2py. The application I am currently working on is intended to take in a CSV upload from a user, then generate a PDF file based on the contents of the CSV, then allow the user to download that PDF. As part of this process I need to generate and access several intermediate files that are specific to each individual user (these files would be images, other pdfs, and some text files). I don't need to store these files in a database since they can be deleted after the session ends, but I am not sure the best way or place to store these files and keep them separate based on each session. I thought that maybe the subfolders in the sessions folder would make sense, but I do not know how to dynamically get the path to the correct folder for the current session. Any suggestions pointing me in the right direction are appreciated!

I was having this error "TypeError: expected string or Unicode object, NoneType found" and I had to store just a link in the session to the uploaded document in the db or maybe the upload folder in your case. I would store it to upload to proceed normally, and then clear out the values and the file if not 'approved'?

If the information is not confidential in similar circumstances, I directly write the temporary files under /tmp.

Is there an easy way to handle inconsistent file paths in blob storage?

I have a service that drops a bunch of .gz files to an azure container on a daily cadence. I'm looking to pick these files up and convert the underlying txt/json into tables. The issue perplexing me is that the service adds two random string prefix folders and a date folder to the path.
Here's an example file path:
container/service-exports/z633dbc1-3934-4cc3-ad29-e82c6e74f070/2022-07-12/42625mc4-47r6-4bgc-ac72-11092822dd81-9657628860/*.gz
I've thought of 3 possible solutions:
I don't necessarily need the data to persist. I could theoretically loop through each folder and look for .gz, open and write them to an output file and then go back through and delete the folders in the path.
Create some sort of checkpoint file that keeps track of each path per gzip and then configure some way of comparison to the checkpoint file at runtime. Not sure how efficient this would be over time.
Use RegEx to look for random strings matching the pattern/length of the prefixes and then look for the current date folder. If the date isn't today, pass.
Am I missing a prebuilt library or function capable of simplifying this? I searched around but couldn't find any discussions on this type of problem.

You can do this using koalas.
import databricks.koalas as ks
path = "wasbs://container/service-exports/*/*/*.gz"
df = ks.read_csv(path, sep="','", header='infer')
This should work out well if all the .gz files have the same columns, then df would contain all the data from .gz files concatenated.

Empty 'folder' not removed in GCS

When I delete through the Console all files from a "folder" in a bucket, that folder is gone too since there is no such thing as directories - the whole path after the bucket is the key.
However when I move (copy & delete method) programmatically these files through REST API, the folder remains, empty. I must therefore write additional logic to check for these and remove explicitly.
Isn't that a bug in the REST API handling ? I had expected the same behavior regardless of the method used.

Turns out that you can safely remove all object ending with / if you don't need them once empty. "Content" will not be deleted.
If you are using Google Console, you must create a folder before uploading to it. That folder is therefore an explicit object that will remain even if empty. Same behavior apparently when uploading with tools like Cyberduck.
But if you upload the file using REST API and its full path i.e. bucket/folder/file, the folder is implicit visually but it's not getting created as such. So when removing the file, there is no folder left behind since it wasn't there in the first place.
Since the expected behavior for my use case is to auto-remove empty folders, I just have a pre-processing routine that deletes all blobs ending with /

How can I download all files from Amazon S3 that are in a folder that was created last month?

How can I download a folder from Amazon S3 that was created last month until the present?
I have this code using boto:
for key in bucket.list():
if last_month < dateutil.parser(key.last_modified).month:
key.get_contents_to_filename(local_path + key.name)
The problem is the loop will take a long time because it's comparing each file in the folder. I only want to compare the folder timestamp.
If there's a way using the AWS CLI much better.

This is not possible.
Amazon S3 does not use directories. It is a flat storage structure.
To give the appearance of directories, paths are prepended to the Key (filename) of an Object (file).
For example, an object called cat.jpg stored in a directory called animals would actually have a Key (filename) of animals/cat.jpg.
Since the directories do not exist, there is no way to retrieve properties from a directory.
(BTW, there is a concept of "common prefixes" that can work like directories when referring to multiple files, but it still isn't an actual directory.)

Python GCS how to rename file within inner zip file?

Suppose I have a file hosted on GCS on a Python AppEngine project. unfortunately, this file structure is something like:
outer.zip/
- inner.zip/
- vid_file
- png_file
the problem is that the two files inside inner.zip do not have the .mp4 extension on the file, and it's causing all sorts of troubles. How do i rename the files so that it appears like:
outer.zip/
- inner.zip/
- vid_file.mp4
- png_file.png
so that the files inside inner.zip have their extensions?
I keep running into all sorts of limitations since gcs doesn't allow file renaming, unarchiving...etc.
the files aren't terribly large.
P.S. i'm not very familiar with Python, so any code examples would be great appreciated, thanks!

There is absolutely no way to perform any alteration to GCS objects -- full stop. They are exactly the bunch of bytes you decided at their birth (uninterpreted by GCS itself) and thus they will stay.
The best you can do is create a new object which is almost like the original except it fixes little errors and oopses you did when creating the original. Then you can overwrite (i.e completely replace) the original with the new, improved version.
Hopefully it's a one-off terrible mistake you made just once and now want to fix so it's not worth writing a program for that. Just download that GCS object, use normal tools to unzip it and unzip any further zipfiles it may contain, do the fixes on the filesystem with your favorite local filesystem tools, zip things up again, upload/rewrite the final zip to your desired new GCS object -- phew, you're done.

Alex is right that objects are immutable, i.e., no editing in-place. The only way to accomplish what you're talking about would be to download the current file, unzip it, update the new files, re-zip the files into the same-named file, and upload to GCS. GCS object overwrites are transactional, so the old content will be visible until the instant the upload completes. Doing it this way is obviously not very network efficient but at least it wouldn't leave periods of time when the object is invisible (as deleting and re-uploading would).

"Import zipfile" and you can unzip the file once it's downloaded into gcs storage.
I have code doing exactly this on a nightly basis from a cron job.
Ive never tried creating a zip file with GAE but the docs say you can do it.
https://docs.python.org/2/library/zipfile.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to check duplicate files in airflow - python

Related

Using temporary files and folders in Web2py app

Is there an easy way to handle inconsistent file paths in blob storage?

Empty 'folder' not removed in GCS

How can I download all files from Amazon S3 that are in a folder that was created last month?

Python GCS how to rename file within inner zip file?

Categories

Resources