I want to use s3fs based on fsspec to access files on S3. Mainly because of 2 neat features:
local caching of files to disk with checking if files change, i.e. a file gets redownloaded if the local and remote file differ
file version id support for versioned S3 buckets, i.e. the ability to open different versions of the same remote file based on their version id
I don't need this for high frequency use and the files don't change often. It is mainly for using unit/integration test data stored on S3, which changes only if tests and related test data get updated (versions!).
I got both of the above working separately just fine, but it seems I can't get the combination of the two working. That is, I want to be able to cache different versions of the same file locally. It seems that as soon as you use a filecache, the version id disambiguation is lost.
fs = fsspec.filesystem("filecache", target_protocol='s3', cache_storage='/tmp/aws', check_files=True, version_aware=True)
with fs.open("s3://my_bucket/my_file.txt", "r", version_id=version_id) as f:
text = f.read()
No matter what version_id is, I always get the most recent file from S3, which is also the one that gets cached locally.
What I expect is that I always get the correct file version and the local cache either keeps separate files for each version (preferred) or just updates the local file whenever I request a version different from the cached one.
Is there a way I can achieve this with the current state of the libraries or is this currently not possible? I am using s3fs==fsspec==2022.3.0.
After checking with the developers this combination seems not to be possible with the current state of the libraries, since the hash of the target file is based on the filepath alone, disregarding any other kwargs such as version_id.
Related
I used the following code to read a shapefile from dbfs:
geopandas.read_file("file:/databricks/folderName/fileName.shp")
Unfortunately, I don't have access to do so and I get the following error
DriverError: dbfs:/databricks/folderName/fileName.shp: Permission denied
Any idea how to grant the access? File exist there (I have a permission to save a file there using dbutils, also - I can read a file from there using spark, but I have no idea how to read a file using pyspark).
After adding those lines:
dbutils.fs.cp("/databricks/folderName/fileName.shp", "file:/tmp/fileName.shp", recurse = True)
geopandas.read_file("/tmp/fileName.shp")
...from suggestion below I get another error.
org.apache.spark.api.python.PythonSecurityException: Path 'file:/tmp/fileName.shp' uses an untrusted filesystem 'org.apache.hadoop.fs.LocalFileSystem', but your administrator has configured Spark to only allow trusted filesystems: (com.databricks.s3a.S3AFileSystem, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, com.databricks.adl.AdlFileSystem, shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem)
GeoPandas doesn't know anything about DBFS - it's working with the local files. So you either need:
to use the DBFS Fuse to read file from DBFS (but there are some limitations):
geopandas.read_file("/dbfs/databricks/folderName/fileName.shp")
or use dbutils.fs.cp command to copy file from DBFS to the local filesystem, and read from it:
dbutils.fs.cp("/databricks/folderName/fileName.shp", "file:/tmp/fileName.shp", recurse = True)
geopandas.read_file("/tmp/fileName.shp")
P.S. But if the file is already copied to the driver node, then you just need to remove file: from the name.
Updated after updated question:
There are limitations on what could be done on the AAD passthrough clusters, so your administrator needs to change cluster configuration as it's described in the documentation on troubleshooting if you want to copy file from DBFS to local file system.
But the /dbfs way should work for passthrough clusters as well, although it should be at least DBR 7.3 (docs)
Okay, the answer is easier than I could thought:
geopandas.read_file("/dbfs/databricks/folderName")
(folder name as it is a folder with all shape files)
Why it should be like that? Easy. Enable the possibility of checking files on DBFS in admin control panel ('Advanced' tab), click on the file you need and you will get two possible paths to the file. One is dedicated to Spark API and another one for File API (this is what I needed).
:)
Short Explanation
Some csv files are incoming on a OneDrive Server which is synced onto a machine where a script is running to read them and push them onto BigQuery. And while the script is running fine now, I intend to run it only after all files are synced completely (i.e. available offline) on that machine since last push...
Long Explanation
So basically I use a local database for sales history of our organization which I want to push to bigquery as well to reflect realtime (lagged) info on dashboards and for other analyses and stuff as a lot of other data besides sales history resides there. Since database is strictly on-premises and cannot be accessed outside organization’s network (So literally no way to link to BigQuery!), I have some people there who export day to time sales (sales from start of the day till time of export) info periodically (1-2hrs) from database and upload to OneDrive. I got OneDrive on a machine where many other scripts are hosted (Its just convenient!) and I run (python) script for reading all csvs, combine them and push to BigQuery. Often there are duplicates so it is necessary to read all the files, remove duplicates and then push them to BigQuery (for which I use:
files = [file for file in os.listdir(input_directory) if file.count('-')<=1]
data = [pd.read_excel(input_directory+file) for file in files if file.endswith('.xlsx')]
all_data = pd.concat(data, ignore_index=True).drop_duplicates()
def upload():
all_data.to_gbq(project_id = project_id,
destination_table = table,
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
progress_bar = True,
if_exists = 'replace')
What I am trying to do is to is only update bigquery table if there are any new changes when script is run since they don’t always got time to do it.
My current approach is I write the length of dataframe in a file at the end of script as:
with open("length.txt", "w") as f:
f.write(len(all_data))
and once all files are read in df, I use:
if len(all_data) > int(open("length.txt","r").readlines()[0]):
upload()
But doing this needs all files to be read in RAM Reading so many files actually make it a bit congested for other scripts on the machine (RAM-wise). So I do not even want to read them all in RAM as per my current approach.
I tried accessing file attributes as well and tried to build a logic based on date modified as well but as long as a new file is added, it got changed even when file is not fully downloaded on machine. I searched as well to access sync status of files and came across: Determine OneDrive Sync Status From Batch File but that did not help. Any help bettering this situation is appreciated!
We have similar workflows to this where we load data from files into a database regularly by script. For us, once a file has been processed, we move it to a different directory as part of the python script. This way, we allow the python script to load all data from all files in the directory as it is definitely new data.
If the files are cumulative (contain old data as well as new data) and therefore you only want to load any rows that are new, this is where it gets tricky. You are definitely on the right track, as we use the modified date to ascertain whether the file has changed since we last processed it. in python you can get this from the os library os.path.getmtime(file_path).
This should give you the last date/time the file was changed in any way, for any operating system.
I recommend just moving the files out of your folder containing new files once they are loaded to make it easier for your python script to handle. I do not know much about OneDrive though so i cannot help with that aspect.
Good luck!
I'm using python 2.7 and Jenkins latest version.
When Jenkins is triggered, results from python script is stored in pickle files for the future use.
Whenever a build happens, I want to store every build results of a python script in pickle files. I'm able to store the results in pickle files.
How can I save every build's results with different file names so that I can access the files from the python script.
Is it possible to add build_number and job_name as a file name
or appending/prepending build_number to current file name ?
Later with another python script I should be able to access the current and previous successful build (last 2 successful builds) pickle files to compare the results.
For the solution, Changes in Jenkinsfile or python script or both are welcome.
Thanks.
From a python script you can use os.rename() to rename your file. Jenkins sets environment variables for a running job to give you the job name and number.
Combinig the two you can do:
job_name = os.environ.get('JOB_NAME', 'default_name')
build_number = os.environ.get('BUILD_NUMBER', 1)
new_output = "{}_{}.bin".format(job_name, build_number)
os.rename(original_output, new_output)
I need to check to see that a file that I have uploaded to my Google Drive account needs to be updated with the original file I have on my local harddrive. For this purpose, I would like to use some hash such as a md5sum. So the md5Checksum metadata on files stored on the Drive seems to be the best fit (baring that, I'd look for modification time stamps but that is less appealing given the burden of dealing with date/time conversion). So, using code of below, with a known fileId:
fileId = 'whatever' # gotten from a previous upload using MediaFileUpload():
fi = drive_service.files().get(fileId=fileId).execute()
import pprint
pprint.pprint(fi)
md5Checksum = fi.get('md5Checksum')
print("\n{}\nmd5Checksum\n{}".format('-' * 80, md5Checksum))
it returns:
{u'id': u'whatever',
u'kind': u'drive#file',
u'mimeType': u'application/octet-stream',
u'name': u'the_name_of_that_file_I_uploaded'}
--------------------------------------------------------------------------------
md5Checksum
None
As indicated by https://stackoverflow.com/a/36623919/257924, the md5Checksum value will be available only to binary files. But since the mimeType shown above is application/octet-stream, shouldn't it be available? Isn't application/octet-stream equivalent to it being a "binary" file upload?
What I cannot do is simply download the file and run the md5sum Linux utility (or equivalent Python code to do the same) to check that it is different, because then that defeats the whole point of checking to see if the file actually needs to be updated (deleted and the re-uploaded, or uploading a new version, etc.).
Update #1: This is not the same as Google Drive API 'md5Checksum' does not work on Python as I am not using the list api to obtain the file id. I am just using the id that I found by a previous upload to retrieve metadata on that file, and I am hoping to see some type of hash value indicating its content, and specifically on a file that I think (hope) is considered "binary".
Suppose I have a file hosted on GCS on a Python AppEngine project. unfortunately, this file structure is something like:
outer.zip/
- inner.zip/
- vid_file
- png_file
the problem is that the two files inside inner.zip do not have the .mp4 extension on the file, and it's causing all sorts of troubles. How do i rename the files so that it appears like:
outer.zip/
- inner.zip/
- vid_file.mp4
- png_file.png
so that the files inside inner.zip have their extensions?
I keep running into all sorts of limitations since gcs doesn't allow file renaming, unarchiving...etc.
the files aren't terribly large.
P.S. i'm not very familiar with Python, so any code examples would be great appreciated, thanks!
There is absolutely no way to perform any alteration to GCS objects -- full stop. They are exactly the bunch of bytes you decided at their birth (uninterpreted by GCS itself) and thus they will stay.
The best you can do is create a new object which is almost like the original except it fixes little errors and oopses you did when creating the original. Then you can overwrite (i.e completely replace) the original with the new, improved version.
Hopefully it's a one-off terrible mistake you made just once and now want to fix so it's not worth writing a program for that. Just download that GCS object, use normal tools to unzip it and unzip any further zipfiles it may contain, do the fixes on the filesystem with your favorite local filesystem tools, zip things up again, upload/rewrite the final zip to your desired new GCS object -- phew, you're done.
Alex is right that objects are immutable, i.e., no editing in-place. The only way to accomplish what you're talking about would be to download the current file, unzip it, update the new files, re-zip the files into the same-named file, and upload to GCS. GCS object overwrites are transactional, so the old content will be visible until the instant the upload completes. Doing it this way is obviously not very network efficient but at least it wouldn't leave periods of time when the object is invisible (as deleting and re-uploading would).
"Import zipfile" and you can unzip the file once it's downloaded into gcs storage.
I have code doing exactly this on a nightly basis from a cron job.
Ive never tried creating a zip file with GAE but the docs say you can do it.
https://docs.python.org/2/library/zipfile.html