Deleting a file from Google Cloud using fsspec - python

I am currently using fsspec in order to read and write files to my google cloud buckets with the code as shown below:
with fs.open(gcs_file_tmp, "rb") as fp:
gcs_file_content = fp.read()
Now I want to remove a file from a GCS bucket, but I cannot seem to find the code for it. Reading the documentation here, it seems as though there are some rm based functions and also some delete functions but they don't seem to work as they cannot be called as such fs.rm(...) or so

fs.open() returns fsspec.core.OpenFile instance, having fs property.
When you open GCS file, fs is GCSFileSystem.
When you open local file, fs is LocalFileSystem.
I understand fsspec absorbs difference between file systems, as class AbstractFileSystem.
We can call delete() of it.
In this way, we should pass target file path again(we can use path property if you like).
# for GCS
f = fsspec.open("gs://bucketHoge/fuga.txt")
f.fs.delete(f.path)
# for local
f = fsspec.open("/tmp/fuga.txt")
f.fs.delete(f.path)
I wonder there is a better way...

Related

Rename filename on windows enviroment github workflow

I was trying to process some data in a GitHub action.
However, due to a Japanese file name, I can not read the file successfully by:
pd.read_csv('C:\\202204_10エリア計.csv')
And I was trying to rename it before read it by:
for filename in os.listdir(download_path):
if filename.startswith('202204'):
filename = filename.encode('utf-8').decode(locale.getpreferredencoding(False))
print(filename) # this print '202204_10エリア計.csv' on github action
os.rename(os.path.join(download_path, filename), os.path.join(download_path, '202204.csv')
But it gets error:
[WinError 2] The system cannot find the file specified: 'C:\\202204_10エリア計.csv' -> 'C:\\202204.csv'
You should be able to sidestep the encoding/decoding entirely with pathlib:
from pathlib import Path
fn = next(p for p in Path(download_path).glob("*.csv") if p.name.startswith("202204"))
fn.rename(fn.with_stem("202204"))
This is a bit of a workaround to whatever the real issue is, however.
That said I have never needed to meddle with the encoding when using os.path, and a quick search of the docs doesn't turn up anything, so you may be fine if you simple remove your encoding/decoding step. I would expect the os.path api to use the same internal representation throughout.

Patching over local JSON file in unit testing

I have some Python code that loads in a local JSON file:
with open("/path/to/file.json") as f:
json_str = f.read()
# Now do stuff with this JSON string
In testing, I want to patch that JSON file to be a JSON file located in my repo's test directory ("/path/to/repo/test/fake_file.json").
How can I go about doing that?
One other requirement is I actually have a version of "/path/to/file.json" locally, but I don't want to change it. I want it patched over at test time, and unpatched upon test completion.
Note: I use pytest, and it seems like the plug-in pyfakefs would do this. Sadly, I can't figure out how to get it to patch in another local file (from within my repo's test directory). I am open to solutions using vanilla Python 3.10+ and/or pyfakefs.
With pyfakefs, you can map real files into the fake file system. In your case, you can use add_real_file:
def test_json(fs):
fs.add_real_file("/path/to/repo/test/fake_file.json",
target_path="/path/to/file.json")
assert os.path.exists("/path/to/file.json")
This will map your existing file into target_path in the fake file system (if target_path is not given, it will map it to the same location as the source file).
It does not matter if there is a real file at the same location, as the real file system will be ignored in the fake file system. If you read "/path/to/file.json" in your test code, it will actually read "/path/to/repo/test/fake_file.json" (mapped files are only read on demand).
Note that by default the file is mapped read only, so if you want to change it in your tested code, you have to set read_only=False in the mapping call. This will make the file in the fake file system writable, though writing to it will not touch the file in the real file system, of course.
Disclaimer:
I'm a contributor to pyfakefs.

Create new csv file in Google Cloud Storage from cloud function

First time working with Google Cloud Storage. Below I have a cloud function which is triggered whenever a csv file gets uploaded to my-folder inside my bucket. My goal is to create a new csv file in the same folder, read the contents of the uploaded csv and convert each line to a URL that will go into the newly created csv. Problem is I'm having trouble just creating the new csv in the first place, let alone actually writing to it.
My code:
import os.path
import csv
import sys
import json
from csv import reader, DictReader, DictWriter
from google.cloud import storage
from io import StringIO
def generate_urls(data, context):
if context.event_type == 'google.storage.object.finalize':
storage_client = storage.Client()
bucket_name = data['bucket']
bucket = storage_client.get_bucket(bucket_name)
folder_name = 'my-folder'
file_name = data['name']
if not file_name.endswith('.csv'):
return
These next few lines came from an example in GCP's GitHub repo. This is when I would expect the new csv to be created, but nothing happens.
# Prepend 'URL_' to the uploaded file name for the name of the new csv
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
# Transform uploaded csv to string - this was recommended on a similar SO post, not sure if this works or is the right approach...
blob = bucket.blob(file_name)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob)
input_csv = csv.reader(blob)
On the next line is where I get an error: No such file or directory: 'myProjectId/my-folder/URL_my_file.csv'
with open(output, 'w') as output_csv:
csv_dict_reader = csv.DictReader(input_csv, )
csv_writer = csv.DictWriter(output_csv, fieldnames=['URL'], delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
csv_writer.writeheader()
line_count = 0
for row in csv_dict_reader:
line_count += 1
url = ''
...
# code that converts each line
...
csv_writer.writerow({'URL': url})
print(f'Total rows: {line_count}')
If anyone has any suggestions on how I could get this to create the new csv and then write to it, it would be a huge help. Thank you!
Probably I would say that I have a few questions about the code and the design of the solution:
As I understand - on one hand the cloud function is triggered by a finalise event Google Cloud Storage Triggers, not he other hand you would like to save a newly created file into the same bucket. Upon success, an appearance of a new object in that bucket is to trigger another instance of your cloud function. Is that the intended behaviour? You cloud function is ready for that?
Ontologically there is no such thing as folder. Thus in this code:
folder_name = 'my-folder'
file_name = data['name']
the first line is a bit redundant, unless you would like to use that variable and value for something else... and the file_name gets the object name including all prefixes (you may consider them as "folders".
The example you refer - storage_compose_file.py - is about how a few objects in the GCS can be composed into one. I am not sure if that example is relevant for your case, unless you have some additional requirements.
Now, let's have a look at this snippet:
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
a. bucket.blob - is a factory constructor - see API buckets description. I am not sure if you really would like to use a bucket_name as an element of its argument...
b. sources - becomes a list with only one element - a reference to the existing object in the GCS bucket.
c. destination.compose(sources) - is it an attempt to make a copy of the existing object? If successful - it may trigger another instance of your cloud function.
About type changes
blob = bucket.blob(file_name)
blob = blob.download_as_string()
After the first line the blob variable has the type google.cloud.storage.blob.Blob. After the second - bytes. I think Python allows such things... but would you really like it? BTW, the download_as_string method is deprecated - see Blobs / Objects API
About the output:
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
with open(output, 'w') as output_csv:
Bear in mind - all of that happens inside the memory of the cloud function. Nothing to do with the GCS buckets of blobs. If you would like to use temporary files within cloud functions - you are to use them in the /tmp directory - Write temporary files from Google Cloud Function I would guess that you get the error because of this issue.
=> Coming to some suggestions.
You probably would like to download the object into the cloud function memory (into the /tmp directory). Then you would like to process the source file and save the result near the source. Then you would like to upload the result to another (not the source) bucket. If my assumptions are correct, I would suggest to implement those things step by step, and check that you get the desired result on each step.
You can reach saving a csv in the Google Cloud Storage in two ways.
Either, you save it directly to GCS with gcsfs package in the "requirements.txt", or you use the container's /tmp folder and push it to the GCS bucket afterwards from there.
Use the power of the Python package "gcsfs"
gcsfs stands for "Google Cloud Storage File System". Add
gcsfs==2021.11.1
or another version to your "requirements.txt". You do not directly use this package name in the code, instead, its installation just allows you to save to the Google Cloud Storage directly, no interim /tmp and push to GCS bucket directory needed. You can also store the file in a sub-directory.
You can save a dataframe for example with:
df.to_csv('gs://MY_BUCKET_NAME/MY_OUTPUT.csv')
or:
df.to_csv('gs://MY_BUCKET_NAME/MY_DIR/MY_OUTPUT.csv')
or use an environment variable of the first menu step when creating the CF:
from os import environ
df.to_csv(environ["CSV_OUTPUT_FILE_PATH"], index=False)
Not sure whether this is needed, but I saw an example where the gcsfs package is installed together with
fsspec==2021.11.1
and it will not hurt adding it. My tests of saving a small df to csv on GCS did not need the package, though. Since I am not sure about this helping module, quote:
Purpose (of fsspec):
To produce a template or specification for a file-system interface,
that specific implementations should follow, so that applications
making use of them can rely on a common behaviour and not have to
worry about the specific internal implementation decisions with any
given backend. Many such implementations are included in this package,
or in sister projects such as s3fs and gcsfs.
In addition, if this is well-designed, then additional functionality,
such as a key-value store or FUSE mounting of the file-system
implementation may be available for all implementations "for free".
First in container's "/tmp", then push to GCS
Here is an example of how to do what the other answer says about storing it at first in the container's /tmp (and only there, no other dir is possible) and then moving it to a bucket of your choice. You can also save it to the bucket that also stores the source code of the cloud function, against the last sentence of the other answer (tested, works):
# function `write_to_csv_file()` not used but might be helpful if no df at hand:
#def write_to_csv_file(file_path, file_content, root):
# """ Creates a file on runtime. """
# file_path = path.join(root, file_path)
#
# # If file is a binary, we rather use 'wb' instead of 'w'
# with open(file_path, 'w') as file:
# file.write(file_content)
def push_to_gcs(file, bucket):
""" Writes to Google Cloud Storage. """
file_name = file.split('/')[-1]
print(f"Pushing {file_name} to GCS...")
blob = bucket.blob(file_name)
blob.upload_from_filename(file)
print(f"File pushed to {blob.id} succesfully.")
# Root path on CF will be /workspace, while on local Windows: C:\
root = path.dirname(path.abspath(__file__))
file_name = 'test_export.csv'
# This is the main step: you *must* use `/tmp`:
file_path = '/tmp/' + file_name
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.to_csv(path.join(root, file_path), index = False)
# If you have a df anyway, `df.to_csv()` is easier.
# The following file writer should rather be used if you have records instead (here: dfAsString). Since we do not use the function `write_to_csv_file()`, it is also commented out above, but can be useful if no df at hand.
# dfAsString = df.to_string(header=True, index=False)
# write_to_csv_file(file_path, dfAsString, root)
# Cloud Storage Client
# Move csv file to Cloud Storage
storage_client = storage.Client()
bucket_name = MY_GOOGLE_STORAGE_BUCKET_NAME
bucket = storage_client.get_bucket(bucket_name)
push_to_gcs(path.join(root, file_path), bucket)

How to write file to memory filepath and read from memory filepath in Python?

An existing Python package requires a filepath as input parameter for a method to be able to parse the file from the filepath. I want to use this very specific Python package in a cloud environment, where I can't write files to the harddrive. I don't have direct control over the code in the existing Python package, and it's not easy to switch to another environment, where I would be able to write files to the harddrive. So I'm looking for a solution that is able to write a file to a memory filepath, and let the parser read directly from this memory filepath. Is this possible in Python? Or are there any other solutions?
Example Python code that works by using harddrive, which should be changed so that no harddrive is used:
temp_filepath = "./temp.txt"
with open(temp_filepath, "wb") as file:
file.write("some binary data")
model = Model()
model.parse(temp_filepath)
Example Python code that uses memory filesystem to store file, but which does not let parser read file from memory filesystem:
from fs import open_fs
temp_filepath = "./temp.txt"
with open_fs('osfs://~/') as home_fs:
home_fs.writetext(temp_filepath, "some binary data")
model = Model()
model.parse(temp_filepath)
You're probably looking for StringIO or BytesIO from io
import io
with io.BytesIO() as tmp:
tmp.write(content)
# to continue working, rewind file pointer
tmp.seek(0)
# work with tmp
pathlib may also be an advantage

Render hg.tar into in-memory tree-structure after getting back from MongoDB

I am working on a project where I have to store .hg directories. The easiest way is to pack the .hg into hg.tar. I save it in MongoDB's GridFS filesystem.
If I go with this plan, I have to read the tar out.
import tarfile, cStringIO as io
repo = get_repo(saved_repo.id)
ios = io.StringIO()
ios.write(repo.hgfile.read())
ios.seek(0)
tar = tarfile.open(mode='r', fileobj=ios)
members = tar.getmembers()
#for info in members:
# tar.extract(info.name, '/tmp')
for file in members:
print file.name, file.isdir()
This is a working code. I can get all the files and directories names as the loop continues.
My question is how do I extract this tar into a valid, file-system like directory. I can .extractfile individually into memory, but if I want to feed into Mercurial API, I probably need the entire directory as in a single DIRECTORY .hg in memory like how they exist in the filesystem.
Thoughts?
Mercurial has a concept called opener that's used to abstract filesystem access. I first looked at http://hg.intevation.org/mercurial/crew/file/tip/mercurial/revlog.py to see if you can replace the revlog class (which is the base class for changelog, manifest log and filelogs), but recent versions of Mercurial also have a VFS abstraction layer. It can be found in http://hg.intevation.org/mercurial/crew/file/8c64c4af21a4/mercurial/scmutil.py#l202 and is used by the localrepo.localrepository class for all file access.

Categories

Resources