Read specific files with Spark (exact match only)

Read specific files with Spark (exact match only) - python

I need to read a specific set of text files using Spark:
df = spark.text.read(paths)
The problem is that Spark treats the path as either a directory or a file. This has two problems:
If the file does not exist, it's skipped with just a warning in the log – "directory does not exist; was it deleted recently?"
Trying to list the path as a directory takes additional processing time.
From the documentation, Spark uses partition discovery by default (which means it must try to list the path as a directory), but it can also operate in a different mode – enabled by the recursiveFileLookup option.
What I want is a third mode where it just reads each of the provided paths as an exact match file. Is there a way to achieve this without a change in Spark's read method?

Related

Is there an easy way to handle inconsistent file paths in blob storage?

I have a service that drops a bunch of .gz files to an azure container on a daily cadence. I'm looking to pick these files up and convert the underlying txt/json into tables. The issue perplexing me is that the service adds two random string prefix folders and a date folder to the path.
Here's an example file path:
container/service-exports/z633dbc1-3934-4cc3-ad29-e82c6e74f070/2022-07-12/42625mc4-47r6-4bgc-ac72-11092822dd81-9657628860/*.gz
I've thought of 3 possible solutions:
I don't necessarily need the data to persist. I could theoretically loop through each folder and look for .gz, open and write them to an output file and then go back through and delete the folders in the path.
Create some sort of checkpoint file that keeps track of each path per gzip and then configure some way of comparison to the checkpoint file at runtime. Not sure how efficient this would be over time.
Use RegEx to look for random strings matching the pattern/length of the prefixes and then look for the current date folder. If the date isn't today, pass.
Am I missing a prebuilt library or function capable of simplifying this? I searched around but couldn't find any discussions on this type of problem.

You can do this using koalas.
import databricks.koalas as ks
path = "wasbs://container/service-exports/*/*/*.gz"
df = ks.read_csv(path, sep="','", header='infer')
This should work out well if all the .gz files have the same columns, then df would contain all the data from .gz files concatenated.

Search for all Python files in PyCharm, excluding test files [duplicate]

I have a large number of auto generated code files that are identifiable by the having _pb2 in the file name.
When I search using PyCharm CTRL+Shift+F I can use a file mask. I would like for instance to find all Python files *.py that do not have _pb2 in their name. Is there a way to achieve that?

You can include and exclude files and directories by creating a Custom Scope that filters using a combination of filename wildcards.
Ctrl+Shift+F to open "Find in Path".
Create a new Custom Scope following steps 2-4 in the screenshot.
Enter the pattern, for your specification it would be file[Project_Name]:*.py&&!file:*_pb2*
Afterwards the search results are restricted to within the Custom Scope.
Source at JetBrains official site: "Scope configuration controls"

how to check duplicate files in airflow

I have incoming files in 'source-bucket'
I archive files after processing into another bucket 'archive-bucket'
in the current date-time folder.
eg:
gs://archive-bucket/module1/2021-06-25/source_file_20210622.csv
gs://archive-bucket/module1/2021-06-26/source_file_20210623.csv
gs://archive-bucket/module1/2021-06-27/source_file_20210624.csv
Every time I process a file, I want to check if the file is already processed by checking the if it is present in the archive folder.
duplicate_check = GoogleCloudStoragePrefixSensor(
task_id=f'detect_duplicate_{task_name}',
bucket=ARCHIVE_BUCKET,
prefix=f'module1/{DATE}/{source_file_name}')
This approach is only allowing to check for the particular date folder.
How to check if the 'source_file_<>.csv' is already present in the 'gs://archive-bucket/module1/all the date folders/'
even if the file is present in any date folder in the archive path, I need to fail further processing.
How can that be acheived?

I do not think you can do it easily. You could probably play with "delimiter" parameter https://github.com/googleapis/google-cloud-python/issues/920#issuecomment-230214336 to achieve something similar. Maybe you can even try to set delimiter to be your file name and look for "module1/" prefix. Not sure about efficiency of that though.
The problem is that GCS is NOT a filesystem with folders. The "/" is just convenience to treat it as directory and the UI allows you to "browse" it in similar way, but in fact the GCS object are not stored in subfolders - whole name of the object is the only identifier and there nothing like "folders" in GCS. So you can only match the file names and matching by prefix is efficient. If you will have a lot of files, any other kind of matching might be slow.
What I recommend, maybe is to have a separate path. where you create empty objects corresponding to file names processsed. For example "processed/file.name" path. Without any structure. Then you could check for presence of the file name there. This will be rather efficient (but might not be atomic, depending how your processing looks like).

From Your requirement what i understand is, you want to move files once they are processed from src bucket to another bucket and you want to make sure that file is moved to dest bucket successfully.
Best way to do it is,
1)Maintain a small sql table to insert file path which is processed into table as "Processed" and whenever state is processed move those files to dest bucket. From this table you can check always what all files are processed and moved to dest bucket.
and
2) another approach is
if task1=task to process files
task2=pass processed files to bashoperator
and using gsutil option you can move files easily and also check in
that script whether it is being pushed.

Empty 'folder' not removed in GCS

When I delete through the Console all files from a "folder" in a bucket, that folder is gone too since there is no such thing as directories - the whole path after the bucket is the key.
However when I move (copy & delete method) programmatically these files through REST API, the folder remains, empty. I must therefore write additional logic to check for these and remove explicitly.
Isn't that a bug in the REST API handling ? I had expected the same behavior regardless of the method used.

Turns out that you can safely remove all object ending with / if you don't need them once empty. "Content" will not be deleted.
If you are using Google Console, you must create a folder before uploading to it. That folder is therefore an explicit object that will remain even if empty. Same behavior apparently when uploading with tools like Cyberduck.
But if you upload the file using REST API and its full path i.e. bucket/folder/file, the folder is implicit visually but it's not getting created as such. So when removing the file, there is no folder left behind since it wasn't there in the first place.
Since the expected behavior for my use case is to auto-remove empty folders, I just have a pre-processing routine that deletes all blobs ending with /

Having trouble ignoring files in Rsync?

Alright, I have two folders that need to be in sync, but certain files need to be ignored before the first upload.
To make sense of what I mean lets say for example I have a folder called src and another folder called dest.
src contains settings.properties, some python code, and a template properties file.
dest contains the same settings.properties, same python code but the template properties file is populate during the sync process (done by a script that wraps the protocol)
Now, if I decide to modify the python code in dest, the python code should be updated in src folder, but the new template.properties which is populated should be ignored.
I tried using excludes and includes but I read that you can't use both because "includes takes precedence"
Using Windows, and I am currently using a python script that formats the paths to the default "/cygdrive/C/" then I populate the properties file, then I run rsync

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read specific files with Spark (exact match only) - python

Related

Is there an easy way to handle inconsistent file paths in blob storage?

Search for all Python files in PyCharm, excluding test files [duplicate]

how to check duplicate files in airflow

Empty 'folder' not removed in GCS

Having trouble ignoring files in Rsync?

Categories

Resources