How do you write a .feather file into GCS? - python

Previously worked on .csv files which was straightforward to upload to GCS
For csv I would do the following, which works:
blob = bucket.blob(path)
blob.upload_from_string(dataframe.to_csv(), 'text/csv')
I am trying to do the same i.e. write the dataframe as a .feather file in bucket
blob = bucket.blob(path)
blob.upload_from_string(dataframe.reset_index().to_feather(), 'text/feather')
However, this fails saying to_feather() requires a fname. Any suggestions/guidance on where I went wrong would be helpful.

upload_from_string works for the to_csv() method because the ‘path’ parameter is optional. When no path is provided, the result is returned as a string. On the other hand, the to_feather() method requires a path specified. So you should store the feather file and then upload the feather file into GCS.
Refer the code below:
dataFrame.reset_index().to_feather(FILE PATH)
bucket_name = "BUCKET-NAME"
source_file_name = "FILE PATH"
destination_blob_name = "GCS Object Name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)

Related

How to deal with GSutil URI not working all the time

I am facing a little issue here that I can't explain.
On some occasions, I am able to open files from my cloud storage buckets using a GSutil URI. For instance this one works fine
df = pd.read_csv('gs://poker030120203/ouptut_test.csv')
But on some other occasions, this method does not work & returns an error FileNotFoundError: [Errno 2] No such file or directory
This happens for instance with the following codes
rank_table_filename = 'gs://poker030120203/rank_table.bin'
rank_table_file = open(rank_table_filename, "r")
preflop_table_filename = 'gs://poker030120203/preflop_table.npy'
self.preflop_table = np.load(preflop_table_filename)
I am not sure if this is related to the "open" or "load" methode, or maybe the file type, but I can't figure out why this return an error. I do not know if this has an impact on that matter, but I'm running everything from Vertex (ie. the AI module that automatically sets up a storage bucket / a VM and a jupyter notebook).
Thanks a lot for the help
In order to read and write the file from the google cloud storage, you can use google recommended methods. It's easier to use google client libraries to read / write anything from / in Google Cloud Storage.
From the doc Example:
from google.cloud import storage
def write_read(bucket_name, blob_name):
"""Write and read a blob from GCS using file-like IO"""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The ID of your new GCS object
# blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
# Mode can be specified as wb/rb for bytes mode.
# See: https://docs.python.org/3/library/io.html
with blob.open("w") as f:
f.write("Hello world")
with blob.open("r") as f:
print(f.read())

How to successfully using GCS Filesystem to read a JPG File [duplicate]

As the topic indicates...
I have try two ways and none of them work:
First:
I want to programmatically talk to GCS in Python. such as reading gs://{bucketname}/{blobname} as a path or a file. The only thing I can find is a gsutil module, however it seems used in a commend line instead of a python application.
i find a code here Accessing data in google cloud bucket, but still confused on how to retrieve it to a type i need. there is a jpg file in the bucket, and want to download it for a text detection, this will be deploy on google funtion.
Second:
download_as_bytes()method, Link to the blob document I import the googe.cloud.storage module and provide the GCP key, however the error rise saying the Blob has no attribute of download_as_bytes().
is there anything else i haven't try? Thank you!
for the reference:
def text_detected(user_id):
bucket=storage_client.bucket(
'img_platecapture')
blob=bucket.blob({user_id})
content= blob.download_as_bytes()
image = vision.Image(content=content) #insert a content
response = vision_client.text_detection(image=image)
if response.error.message:
raise Exception(
'{}\nFor more info on error messages, check: '
'https://cloud.google.com/apis/design/errors'.format(
response.error.message))
img = Image.open(input_file) #insert a path
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("simsun.ttc", 18)
for text in response.text_annotations[1::]:
ocr = text.description
draw.text((bound.vertices[0].x-25, bound.vertices[0].y-25),ocr,fill=(255,0,0),font=font)
draw.polygon(
[
bound.vertices[0].x,
bound.vertices[0].y,
bound.vertices[1].x,
bound.vertices[1].y,
bound.vertices[2].x,
bound.vertices[2].y,
bound.vertices[3].x,
bound.vertices[3].y,
],
None,
'yellow',
)
texts=response.text_annotations
a=str(texts[0].description.split())
b=re.sub(u"([^\u4e00-\u9fa5\u0030-u0039])","",a)
b1="".join(b)
print("偵測到的地址為:",b1)
return b1
#handler.add(MessageEvent, message=ImageMessage)
def handle_content_message(event):
message_content = line_bot_api.get_message_content(event.message.id)
user = line_bot_api.get_profile(event.source.user_id)
data=b''
for chunk in message_content.iter_content():
data+= chunk
global bucket_name
bucket_name = 'img_platecapture'
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(f'{user.user_id}.jpg')
blob.upload_from_string(data)
text_detected1=text_detected(user.user_id) ####Here's the problem
line_bot_api.reply_message(
event.reply_token,
messages=TextSendMessage(
text=text_detected1
))
reference code(gcsfs/fsspec):
gcs = gcsfs.GCSFileSystem()
bucket=storage_client.bucket('img_platecapture')
blob=bucket.blob({user_id})
f =fsspec.open("gs://img_platecapture/{user_id}")
with f.open({user_id}, "rb") as fp:
content = fp.read()
image = vision.Image(content=content)
response = vision_client.text_detection(image=image)
You can do that with the Cloud Storage Python client :
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The ID of your GCS object
# source_blob_name = "storage-object-name"
# The path to which the file should be downloaded
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob(source_blob_name)
# blob.download_to_filename(destination_file_name)
# blob.download_as_string()
blob.download_as_bytes()
print(
"Downloaded storage object {} from bucket {} to local file {}.".format(
source_blob_name, bucket_name, destination_file_name
)
)
You can use the following methods :
blob.download_to_filename(destination_file_name)
blob.download_as_string()
blob.download_as_bytes()
To be able to correctly use this library, you have to install the expected pip package in your virtual env.
Example of project structure :
my-project
requirements.txt
your_python_script.py
The requirements.txt file :
google-cloud-storage==2.6.0
Run the following command :
pip install -r requirements.txt
In your case maybe the package was not installed correctly in your virtual env, that's why you could not access to the download_as_bytes method.
I'd be using fsspec's GCS filesystem implementation instead.
https://github.com/fsspec/gcsfs/
>>> import gcsfs
>>> fs = gcsfs.GCSFileSystem(project='my-google-project')
>>> fs.ls('my-bucket')
['my-file.txt']
>>> with fs.open('my-bucket/my-file.txt', 'rb') as f:
... print(f.read())
b'Hello, world'
https://gcsfs.readthedocs.io/en/latest/#examples

read google bucket files using python

I have to read google bucket files which are in xlsx format.
The file structure in the bucket look like
bucket_name
folder_name_1
file_name_1
folder_name_2
folder_name_3
file_name_3
The python snippet looks like
def main():
storage_client = storage.Client.from_service_account_json(
Constants.GCP_CRENDENTIALS)
bucket = storage_client.bucket(Constants.GCP_BUCKET_NAME)
blob = bucket.blob(folder_name_2 + '/' + Constants.GCP_FILE_NAME)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes, engine='openpyxl')
print(df)
def function1():
print("no file in the folder") # sample error
In the above snippet, I'm trying to open folder_name_2, it returns an error because there's no file to read.
Instead of throwing an error, I need to use function1 to print the error whenever there's no file in any folder.
Any ideas of doing this?
I'm not familiar with the GCP API, but you're going to want to do something along the lines of this:
try:
blob = bucket.blob(folder_name_2 + '/' + Constants.GCP_FILE_NAME)
data_bytes = blob.download_as_bytes()
except Exception as e:
print(e)
https://docs.python.org/3/tutorial/errors.html#handling-exceptions
I'm not sure to understand what is your final goal, but an other logic is to list available resources in the bucket, and process it.
First, let's define a fonction that will list the available resources in a Bucket. You can add a prefix if you want to limit the research to a sub folder inside the Bucket.
def list_resource(client, bucket_name, prefix=''):
path_files = []
for blob in client.list_blobs(bucket_name, prefix=prefix):
path_files.append(blob.name)
return path_files
Now you can process your xlsx files:
for resource in list_resource(storage_client, Constants.GCP_BUCKET_NAME):
if '.xlsx' in resource:
print(resource)
# Load blob and process your xlsx file

python - read csv from s3 and identify its encoding info for pandas dataframe

I am making a service to download csv files from s3 bucket.
The bucket contains csv with various encodings (which I may not know before hand), since users are uploading these files.
This is what I am trying:
...
obj = s3c.get_object(Bucket= BUCKET_NAME , Key = KEY)
content = io.BytesIO(obj['Body'].read())
df_s3_file = pd.read_csv(content)
...
This works fine for utf-8, however for other format it fails (obviously!).
I have found an independent code which can help me identify the encoding of a csv file on a netwrok drive.
It looks like this:
...
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding(content)
print('detected csv encoding: ',my_encoding)
df_s3_file = pd.read_csv(content, encoding=my_encoding)
...
This snippet works absolutely fine for a file on a drive(local), but how do I do this for a file on s3 bucket? Since I am reading the s3 file as io.BytesIO object.
I think if I write the file on a drive and then execute the function find_encoding, its going to work, since that function takes csv file as input as opoosed to BytesIO object.
Is there a way to do this without having to download the file on a drive, within memory?
Note: the files size is not very big (<10 mb).
According to their docs s3c.get_object(Bucket= BUCKET_NAME , Key = KEY) will return a dict where one of the keys is ContentEncoding so I would try:
obj = s3c.get_object(Bucket= BUCKET_NAME , Key = KEY)
print(obj["ContentEncoding"])

how do I write a list of data to S3 in ORC format

I need to write a file in ORC format directly to an S3 bucket. the file will be a result of a query to a db.
I know how to write a CSV file directly to S3 but couldn't find a way to write directly in ORC.. any recommendations?
save ORC content to file
using default values as per the linked documentation as there is no code sample to work with
df = spark.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
upload file
import boto3
# Create an S3 client
s3 = boto3.client('s3')
filename = 'file.txt'
bucket_name = 'my-bucket'
# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)

Categories

Resources