Python upload to s3 : Stuck on put_object() call - python

I want to send an image in an s3 storage (MinIO).
Here is an example code :
import cv2
import argparse
import boto3
from botocore.client import Config
import logging
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--s3_endpoint', type=str, default='http://127.0.0.1:9000')
parser.add_argument('--s3_access_key', type=str,default='minioadmin')
parser.add_argument('--s3_secret_key', type=str,default='minioadmin')
parser.add_argument('--image_bucket', type=str,default='my_bucket')
args = parser.parse_args()
s3 = boto3.client(
"s3",
use_ssl=False,
endpoint_url=args.s3_endpoint,
aws_access_key_id=args.s3_access_key,
aws_secret_access_key=args.s3_secret_key,
region_name='us-east-1',
config=Config(s3={'addressing_style': 'path'})
)
def upload_image_s3(bucket, image_name, image):
logging.info("Uploading image "+image_name+" to s3 bucket "+bucket)
body = cv2.imencode('.jpg', image)[1].tostring()
s3.put_object(Bucket=bucket, Key = image_name, Body = body)
logging.info("Image uploaded !")
upload_image_s3(args.processed_images_bucket,'test/image.jpg',cv2.imread("resources/images/my_image.jpg"))
But when I run it, I get stuck on the put_object() call forever (well, until timeout to be exact).
To do my test, I run locally a minIO server using the default configuration (on Windows).
Have you any idea what is the problem in my case ?

I was having the same error, my mistake was incorrect credentials when I want to connect with minIO. When I see that you just change the host to localhost I review all the code and realized that it was just a typo in my credentials

Related

S3 boto3 refuses to overwrite endpoint URL

I'm working on an internal S3 service (not AWS one). When I provide hard coded credentials, region and endpoint_url, boto3 seems to ignore them.
I came to that conclusion because it is attempting to go on internet (by using a public aws endpoint URL instead of the internal I have provided) but it does not work because of the following proxy error. But he should not go on internet, since it is an internal S3 service :
botocore.exceptions.ProxyConnectionError: Failed to connect to proxy URL: "http://my_company_proxy"
Here is my code
import io
import os
import boto3
import pandas as pd
# Method 1 : Client #########################################
s3_client = boto3.client(
's3',
region_name='EU-WEST-1',
aws_access_key_id='xxx',
aws_secret_access_key='zzz',
endpoint_url='https://my_company_enpoint_url'
)
# ==> at this point no error, but I don't know the value of endpoint_url
# Read bucket
bucket = "bkt-udt-arch"
file_name = "banking.csv"
print("debug 1") # printed OK
obj = s3_client.get_object(Bucket= bucket, Key= file_name)
# program stops here :
botocore.exceptions.ProxyConnectionError: Failed to connect to proxy URL: "http://my_company_proxy"
print("debug 2") # not printed -
initial_df = pd.read_csv(obj['Body']) # 'Body' is a key word
print("debug 3")
# Method 2 : Resource #########################################
# use third party object storage
s3 = boto3.resource('s3', endpoint_url='https://my_company_enpoint_url',
aws_access_key_id='xxx',
aws_secret_access_key='zzz',
region_name='EU-WEST-1'
)
print("debug 4") # Printed OK if method 1 is commented
# Print out bucket names
for bucket in s3.buckets.all():
print(bucket.name)
Thank you for the review
It was indeed a proxy problem : when http_prxoxy env variable is disabled, it works fine.

Using Textract for OCR locally

I want to extract text from images using Python. (Tessaract lib does not work for me because it requires installation).
I have found boto3 lib and Textract, but I'm having trouble working with it. I'm still new to this. Can you tell me what I need to do in order to run my script correctly.
This is my code:
import cv2
import boto3
import textract
#img = cv2.imread('slika2.jpg') #this is jpg file
with open('slika2.pdf', 'rb') as document:
img = bytearray(document.read())
textract = boto3.client('textract',region_name='us-west-2')
response = textract.detect_document_text(Document={'Bytes': img}). #gives me error
print(response)
When I run this code, I get:
botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
I have also tried this:
# Document
documentName = "slika2.jpg"
# Read document content
with open(documentName, 'rb') as document:
imageBytes = bytearray(document.read())
# Amazon Textract client
textract = boto3.client('textract',region_name='us-west-2')
# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes}) #ERROR
#print(response)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
But I get this error:
botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
Im noob in this, so any help would be good. How can I read text form my image or pdf file?
I have also added this block of code, but the error is still Unable to locate credentials.
session = boto3.Session(
aws_access_key_id='xxxxxxxxxxxx',
aws_secret_access_key='yyyyyyyyyyyyyyyyyyyyy'
)
There is problem in passing credentials to boto3. You have to pass the credentials while creating boto3 client.
import boto3
# boto3 client
client = boto3.client(
'textract',
region_name='us-west-2',
aws_access_key_id='xxxxxxx',
aws_secret_access_key='xxxxxxx'
)
# Read image
with open('slika2.png', 'rb') as document:
img = bytearray(document.read())
# Call Amazon Textract
response = client.detect_document_text(
Document={'Bytes': img}
)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
Do note, it is not recommended to hardcode AWS Keys in code. Please refer following this document
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html

How to write parquet file to ECS in Flask python using boto or boto3

I have flask python rest api which is called by another flask rest api.
the input for my api is one parquet file (FileStorage object) and ECS connection and bucket details.
I want to save parquet file to ECS in a specific folder using boto or boto3
the code I have tried
def uploadFileToGivenBucket(self,inputData,file):
BucketName = inputData.ecsbucketname
calling_format = OrdinaryCallingFormat()
client = S3Connection(inputData.access_key_id, inputData.secret_key, port=inputData.ecsport,
host=inputData.ecsEndpoint, debug=2,
calling_format=calling_format)
#client.upload_file(BucketName, inputData.filename, inputData.folderpath)
bucket = client.get_bucket(BucketName,validate=False)
key = boto.s3.key.Key(bucket, inputData.filename)
fileName = NamedTemporaryFile(delete=False,suffix=".parquet")
file.save(fileName)
with open(fileName.name) as f:
key.send_file(f)
but it is not working and giving me error like...
signature_host = '%s:%d' % (self.host, port)
TypeError: %d format: a number is required, not str
I tried google but no luck Can anyone help me with this or any sample code for the same.
After a lot of hit and tried and time, I finally got the solution. I posting it for everyone else who are facing the same issue.
You need to use Boto3 and here is the code...
def uploadFileToGivenBucket(self,inputData,file):
BucketName = inputData.ecsbucketname
#bucket = client.get_bucket(BucketName,validate=False)
f = NamedTemporaryFile(delete=False,suffix=".parquet")
file.save(f)
endpointurl = "<your endpoints>"
s3_client = boto3.client('s3',endpoint_url=endpointurl, aws_access_key_id=inputData.access_key_id,aws_secret_access_key=inputData.secret_key)
try:
newkey = 'yourfolderpath/anotherfolder'+inputData.filename
response = s3_client.upload_file(f.name, BucketName,newkey)
except ClientError as e:
logging.error(e)
return False
return True

Where does API key go in Google Cloud Vision API?

Want to use Google's Cloud Vision API for OCR. Using the python sample code here we have:
def detect_text(path):
"""Detects text in the file."""
client = vision.ImageAnnotatorClient()
with io.open(path, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
response = client.text_detection(image=image)
texts = response.text_annotations
print('Texts:')
for text in texts:
print('\n"{}"'.format(text.description))
vertices = (['({},{})'.format(vertex.x, vertex.y)
for vertex in text.bounding_poly.vertices])
print('bounds: {}'.format(','.join(vertices)))
Where do I put my API key? I (obviously) can't authenticate without it.
From the docs,
If you plan to use a service account with client library code, you need to set an environment variable.
Provide the credentials to your application code by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS. Replace [PATH] with the location of the JSON file you downloaded in the previous step.
For example:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
So it looks like you should create a service account, download a credentials file, and set up an environmental variable to point to it.
There are two ways through which you can authenticate
Exporting the credential file as an environment variable. Here is a sample code:
from google.cloud import vision
def get_text_from_image(image_file):
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "./creds/xxxx-xxxxx.json"
try:
# process_image is a method to convert numpy array to bytestream
# (not of interest in this context hence not including it here)
byte_img = process_image_to_bytes(image_file)
client = vision.ImageAnnotatorClient()
image = vision.Image(content=byte_img)
response = client.text_detection(image=image)
texts = response.text_annotations
return texts
except BaseException as e:
print(str(e))
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] is doing the work here
Using oauth2
from google.cloud import vision
from google.oauth2 import service_account
def get_text_from_image(image_file):
creds = service_account.Credentials.from_service_account_file("./creds/xxx-xxxxx.json")
try:
# process_image is a method to convert numpy array to bytestream
# (not of interest in this context hence not including it here)
byte_img = process_image_to_bytes(image_file)
client = vision.ImageAnnotatorClient(credentials=creds)
image = vision.Image(content=byte_img)
response = client.text_detection(image=image)
texts = response.text_annotations
return texts
except BaseException as e:
print(str(e))
Here we are using the google-auth library to create a credential file from the JSON credential file and passing that object to ImageAnnotatorClient for authentication.
Hope these sample snippets helped you

Complete a multipart_upload with boto3?

Tried this:
import boto3
from boto3.s3.transfer import TransferConfig, S3Transfer
path = "/temp/"
fileName = "bigFile.gz" # this happens to be a 5.9 Gig file
client = boto3.client('s3', region)
config = TransferConfig(
multipart_threshold=4*1024, # number of bytes
max_concurrency=10,
num_download_attempts=10,
)
transfer = S3Transfer(client, config)
transfer.upload_file(path+fileName, 'bucket', 'key')
Result: 5.9 gig file on s3. Doesn't seem to contain multiple parts.
I found this example, but part is not defined.
import boto3
bucket = 'bucket'
path = "/temp/"
fileName = "bigFile.gz"
key = 'key'
s3 = boto3.client('s3')
# Initiate the multipart upload and send the part(s)
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
with open(path+fileName,'rb') as data:
part1 = s3.upload_part(Bucket=bucket
, Key=key
, PartNumber=1
, UploadId=mpu['UploadId']
, Body=data)
# Next, we need to gather information about each part to complete
# the upload. Needed are the part number and ETag.
part_info = {
'Parts': [
{
'PartNumber': 1,
'ETag': part['ETag']
}
]
}
# Now the upload works!
s3.complete_multipart_upload(Bucket=bucket
, Key=key
, UploadId=mpu['UploadId']
, MultipartUpload=part_info)
Question: Does anyone know how to use the multipart upload with boto3?
Your code was already correct. Indeed, a minimal example of a multipart upload just looks like this:
import boto3
s3 = boto3.client('s3')
s3.upload_file('my_big_local_file.txt', 'some_bucket', 'some_key')
You don't need to explicitly ask for a multipart upload, or use any of the lower-level functions in boto3 that relate to multipart uploads. Just call upload_file, and boto3 will automatically use a multipart upload if your file size is above a certain threshold (which defaults to 8MB).
You seem to have been confused by the fact that the end result in S3 wasn't visibly made up of multiple parts:
Result: 5.9 gig file on s3. Doesn't seem to contain multiple parts.
... but this is the expected outcome. The whole point of the multipart upload API is to let you upload a single file over multiple HTTP requests and end up with a single object in S3.
As described in official boto3 documentation:
The AWS SDK for Python automatically manages retries and multipart and
non-multipart transfers.
The management operations are performed by using reasonable default
settings that are well-suited for most scenarios.
So all you need to do is just to set the desired multipart threshold value that will indicate the minimum file size for which the multipart upload will be automatically handled by Python SDK:
import boto3
from boto3.s3.transfer import TransferConfig
# Set the desired multipart threshold value (5GB)
GB = 1024 ** 3
config = TransferConfig(multipart_threshold=5*GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Moreover, you can also use multithreading mechanism for multipart upload by setting max_concurrency:
# To consume less downstream bandwidth, decrease the maximum concurrency
config = TransferConfig(max_concurrency=5)
# Download an S3 object
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)
And finally in case you want perform multipart upload in single thread just set use_threads=False:
# Disable thread use/transfer concurrency
config = TransferConfig(use_threads=False)
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)
Complete source code with explanation: Python S3 Multipart File Upload with Metadata and Progress Indicator
I would advise you to use boto3.s3.transfer for this purpose. Here is an example:
import boto3
def upload_file(filename):
session = boto3.Session()
s3_client = session.client("s3")
try:
print("Uploading file: {}".format(filename))
tc = boto3.s3.transfer.TransferConfig()
t = boto3.s3.transfer.S3Transfer(client=s3_client, config=tc)
t.upload_file(filename, "my-bucket-name", "name-in-s3.dat")
except Exception as e:
print("Error uploading: {}".format(e))
In your code snippet, clearly should be part -> part1 in the dictionary. Typically, you would have several parts (otherwise why use multi-part upload), and the 'Parts' list would contain an element for each part.
You may also be interested in the new pythonic interface to dealing with S3: http://s3fs.readthedocs.org/en/latest/
Why not use just the copy option in boto3?
s3.copy(CopySource={
'Bucket': sourceBucket,
'Key': sourceKey},
Bucket=targetBucket,
Key=targetKey,
ExtraArgs={'ACL': 'bucket-owner-full-control'})
There are details on how to initialise s3 object and obviously further options for the call available here boto3 docs.
copy from boto3 is a managed transfer which will perform a multipart copy in multiple threads if necessary.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.copy
This works with objects greater than 5Gb and I have already tested this.
Change Part to Part1
import boto3
bucket = 'bucket'
path = "/temp/"
fileName = "bigFile.gz"
key = 'key'
s3 = boto3.client('s3')
# Initiate the multipart upload and send the part(s)
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
with open(path+fileName,'rb') as data:
part1 = s3.upload_part(Bucket=bucket
, Key=key
, PartNumber=1
, UploadId=mpu['UploadId']
, Body=data)
# Next, we need to gather information about each part to complete
# the upload. Needed are the part number and ETag.
part_info = {
'Parts': [
{
'PartNumber': 1,
'ETag': part1['ETag']
}
]
}
# Now the upload works!
s3.complete_multipart_upload(Bucket=bucket
, Key=key
, UploadId=mpu['UploadId']
, MultipartUpload=part_info)

Categories

Resources