How to deal with GSutil URI not working all the time - python

I am facing a little issue here that I can't explain.
On some occasions, I am able to open files from my cloud storage buckets using a GSutil URI. For instance this one works fine
df = pd.read_csv('gs://poker030120203/ouptut_test.csv')
But on some other occasions, this method does not work & returns an error FileNotFoundError: [Errno 2] No such file or directory
This happens for instance with the following codes
rank_table_filename = 'gs://poker030120203/rank_table.bin'
rank_table_file = open(rank_table_filename, "r")
preflop_table_filename = 'gs://poker030120203/preflop_table.npy'
self.preflop_table = np.load(preflop_table_filename)
I am not sure if this is related to the "open" or "load" methode, or maybe the file type, but I can't figure out why this return an error. I do not know if this has an impact on that matter, but I'm running everything from Vertex (ie. the AI module that automatically sets up a storage bucket / a VM and a jupyter notebook).
Thanks a lot for the help

In order to read and write the file from the google cloud storage, you can use google recommended methods. It's easier to use google client libraries to read / write anything from / in Google Cloud Storage.
From the doc Example:
from google.cloud import storage
def write_read(bucket_name, blob_name):
"""Write and read a blob from GCS using file-like IO"""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The ID of your new GCS object
# blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
# Mode can be specified as wb/rb for bytes mode.
# See: https://docs.python.org/3/library/io.html
with blob.open("w") as f:
f.write("Hello world")
with blob.open("r") as f:
print(f.read())

Related

How to successfully using GCS Filesystem to read a JPG File [duplicate]

As the topic indicates...
I have try two ways and none of them work:
First:
I want to programmatically talk to GCS in Python. such as reading gs://{bucketname}/{blobname} as a path or a file. The only thing I can find is a gsutil module, however it seems used in a commend line instead of a python application.
i find a code here Accessing data in google cloud bucket, but still confused on how to retrieve it to a type i need. there is a jpg file in the bucket, and want to download it for a text detection, this will be deploy on google funtion.
Second:
download_as_bytes()method, Link to the blob document I import the googe.cloud.storage module and provide the GCP key, however the error rise saying the Blob has no attribute of download_as_bytes().
is there anything else i haven't try? Thank you!
for the reference:
def text_detected(user_id):
bucket=storage_client.bucket(
'img_platecapture')
blob=bucket.blob({user_id})
content= blob.download_as_bytes()
image = vision.Image(content=content) #insert a content
response = vision_client.text_detection(image=image)
if response.error.message:
raise Exception(
'{}\nFor more info on error messages, check: '
'https://cloud.google.com/apis/design/errors'.format(
response.error.message))
img = Image.open(input_file) #insert a path
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("simsun.ttc", 18)
for text in response.text_annotations[1::]:
ocr = text.description
draw.text((bound.vertices[0].x-25, bound.vertices[0].y-25),ocr,fill=(255,0,0),font=font)
draw.polygon(
[
bound.vertices[0].x,
bound.vertices[0].y,
bound.vertices[1].x,
bound.vertices[1].y,
bound.vertices[2].x,
bound.vertices[2].y,
bound.vertices[3].x,
bound.vertices[3].y,
],
None,
'yellow',
)
texts=response.text_annotations
a=str(texts[0].description.split())
b=re.sub(u"([^\u4e00-\u9fa5\u0030-u0039])","",a)
b1="".join(b)
print("偵測到的地址為:",b1)
return b1
#handler.add(MessageEvent, message=ImageMessage)
def handle_content_message(event):
message_content = line_bot_api.get_message_content(event.message.id)
user = line_bot_api.get_profile(event.source.user_id)
data=b''
for chunk in message_content.iter_content():
data+= chunk
global bucket_name
bucket_name = 'img_platecapture'
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(f'{user.user_id}.jpg')
blob.upload_from_string(data)
text_detected1=text_detected(user.user_id) ####Here's the problem
line_bot_api.reply_message(
event.reply_token,
messages=TextSendMessage(
text=text_detected1
))
reference code(gcsfs/fsspec):
gcs = gcsfs.GCSFileSystem()
bucket=storage_client.bucket('img_platecapture')
blob=bucket.blob({user_id})
f =fsspec.open("gs://img_platecapture/{user_id}")
with f.open({user_id}, "rb") as fp:
content = fp.read()
image = vision.Image(content=content)
response = vision_client.text_detection(image=image)
You can do that with the Cloud Storage Python client :
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The ID of your GCS object
# source_blob_name = "storage-object-name"
# The path to which the file should be downloaded
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob(source_blob_name)
# blob.download_to_filename(destination_file_name)
# blob.download_as_string()
blob.download_as_bytes()
print(
"Downloaded storage object {} from bucket {} to local file {}.".format(
source_blob_name, bucket_name, destination_file_name
)
)
You can use the following methods :
blob.download_to_filename(destination_file_name)
blob.download_as_string()
blob.download_as_bytes()
To be able to correctly use this library, you have to install the expected pip package in your virtual env.
Example of project structure :
my-project
requirements.txt
your_python_script.py
The requirements.txt file :
google-cloud-storage==2.6.0
Run the following command :
pip install -r requirements.txt
In your case maybe the package was not installed correctly in your virtual env, that's why you could not access to the download_as_bytes method.
I'd be using fsspec's GCS filesystem implementation instead.
https://github.com/fsspec/gcsfs/
>>> import gcsfs
>>> fs = gcsfs.GCSFileSystem(project='my-google-project')
>>> fs.ls('my-bucket')
['my-file.txt']
>>> with fs.open('my-bucket/my-file.txt', 'rb') as f:
... print(f.read())
b'Hello, world'
https://gcsfs.readthedocs.io/en/latest/#examples

Python - download entire directory from Google Cloud Storage with progress bar

I am downloading a entire directory from Google Cloud Storage using below python code
from google.cloud import storage
from pathlib import Path
def download_blob():
"""Downloads a blob from the bucket."""
# The ID of your GCS bucket
bucket_name = "Bucket name"
# The ID of your GCS object
blob_name = input("Enter the folder name in "+bucket_name+" : ")
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=blob_name) # Get list of files
print('Downloading file')
for blob in blobs:
if blob.name.endswith("/"):
continue
file_split = blob.name.split("/")
directory = "/".join(file_split[0:-1])
Path(directory).mkdir(parents=True, exist_ok=True)
blob.download_to_filename(blob.name)
print('Download completed')
download_blob()
How to show progress bar after printing the line "Downloading File"
I will assume that you have a Python library that is capable of showing a progress bar on the console/terminal where you plan to run this program.
What you can do at a coarse level is the following:
You have a list of blobs that are present in the specific Google Cloud Bucket + prefix.
For each of the blob, you have a property named size. This can tell you the number of bytes that are there for each of the blobs.
You can sum up first the total number of bytes that make up all the blobs and then start the download_to_filename loop, where you go through downloading each blob and then everytime that download is complete, you update the % complete in the progress bar.
Alternatively, if you really want fine grained percentage, then you probably need to use the start and end parameters of the download_to_filename method, where you can get specific number of bytes only. Refer to the documentation.

Can you temporarily copy a Google Cloud image file in Python?

For a project, I'm trying to get an uploaded image file, stored in a bucket. I'm trying to have Python save a copy temporarily, just to perform a few tasks on this file (read, decode and give the decoded file back as JSON). After this is done, the temp file needs to be deleted.
I'm using Python 3.8, if that helps at all.
If you want some snippets of what I tried, I'm happy to provide :)
#edit
So far, I tried just downloading the file from the bucket, which works. But I can't seem to figure out how to temporarily save it to just decode (I got an API that will decode the image and get data from that file). This is the code for downloading
def download_file_from_bucket(blob_name, file_path, bucket_name):
try:
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(blob_name)
with open(file_path, 'wb') as f:
storage_client.download_blob_to_file(blob, f)
except Exception as e:
print(e)
return False
bucket_name = 'white-cards-with-qr'
download_file_from_bucket('My first Blob Image', os.path.join(os.getcwd(), 'file2.jpg'), bucket_name)
for object store in cloud environment, you can sign your object to give access for ones who don't have account for that object, you may read this for google cloud
You can use the tempfile library. This is a really basic snippet. You can also name the file or read it after writing it.
import tempfile
temp = tempfile.TemporaryFile()
try:
temp.write(blob)
finally:
temp.close()

How to process a file located in Azure blob Storage using python with pandas read_fwf function

I need to open and work on data coming in a text file with python.
The file will be stored in the Azure Blob storage or Azure file share.
However, my question is can I use the same modules and functions like os.chdir() and read_fwf() I was using in windows? The code I wanted to run:
import pandas as pd
import os
os.chdir( file_path)
df=pd.read_fwf(filename)
I want to be able to run this code and file_path would be a directory in Azure blob.
Please let me know if it's possible. If you have a better idea where the file can be stored please share.
Thanks,
As far as I know, os.chdir(path) can only operate on local files. If you want to move files from storage to local, you can refer to the following code:
connect_str = "<your-connection-string>"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_name = "<container-name>"
file_name = "<blob-name>"
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(file_name)
download_file_path = "<local-path>"
with open(download_file_path, "wb") as download_file:
download_file.write(blob_client.download_blob().readall())
pandas.read_fwf can read blob directly from storage using URL:
For example:
url = "https://<your-account>.blob.core.windows.net/test/test.txt?<sas-token>"
df=pd.read_fwf(url)

Subprocess CALL or python library for uploading to Google Cloud Storage?

I'm trying to do a script to upload files to Google Cloud Storage. I've noticed that there are two ways for doing this:
a) Using gsutil and calling it from python with subprocess
b) Using from google.cloud import storage with the "native" methods.
What are the advantages/disadvantages of each method?
the (a) method seems to be easier, but I don't know if there is any disadvantage compared to the b) method.
Thanks!
Example of (a)
filename='myfile.csv'
gs_bucket='my/bucket'
parallel_threshold='150M' # minimum size for parallel upload; 0 to disable
subprocess.check_call([
'gsutil',
'-o', 'GSUtil:parallel_composite_upload_threshold=%s' % (parallel_threshold,),
'cp', filename, 'gs://%s/%s' % (gs_bucket, filename)
])
Example of (b)
from google.cloud import storage
def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# bucket_name = "your-bucket-name"
# source_file_name = "local/path/to/file"
# destination_blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)
The bottom line is that you should just pick the method that best suits your preferences. If it works for you either way, then it's a matter of preference.
However, if you intend to run this code anywhere except for a machine that has gsutil correctly installed and configured, you will have problems. It becomes an external dependency, and you might not enjoy trying to set that up anywhere other than where it already works.
If you want to have an easier time moving this code around, the client library is more predictable and should run anywhere there is an internet connection, assuming you have service account credentials available to your code to initialize the SDK.

Categories

Resources