Textract Unsupported Document Exception

Textract Unsupported Document Exception - python

I'm trying to use boto3 to run a textract detect_document_text request.
I'm using the following code:
client = boto3.client('textract')
response = client.detect_document_text(
Document={
'Bytes': image_b64['document_b64']
}
)
Where image_b64['document_b64'] is a base64 image code that I converted using, for exemplo, https://base64.guru/converter/encode/image website.
But I'm getting the following error:
UnsupportedDocumentException
What I'm doing wrong?

Per doc:
If you're using an AWS SDK to call Amazon Textract, you might not need to base64-encode image bytes passed using the Bytes field.
Base64-encoding is only required when directly invoking the REST API. When using Python or NodeJS SDK, use native bytes (binary bytes).

For future reference, I solved that problem using:
client = boto3.client('textract')
image_64_decode = base64.b64decode(image_b64['document_b64'])
bytes = bytearray(image_64_decode)
response = client.detect_document_text(
Document={
'Bytes': bytes
}
)

With Boto3 if you are using Jupyternotebook for image (.jpg or .png), you can use:
import boto3
import cv2
with open(images_path, "rb") as img_file:
img_str = bytearray(img_file.read())
textract = boto3.client('textract')
response = textract.detect_document_text(Document={'Bytes': img_str})

This worked for me. It assumes you have configured the ~/.aws with your aws credentials
import boto3
import os
def main():
client = boto3.client('textract', region_name="ca-central-1")
for imageFile in os.listdir('./img'):
image_file = f"./imgs/{imageFile}"
with open(image_file, "rb") as f:
response = client.analyze_expense(
Document={
'Bytes': f.read(),
'S3Object': {
'Bucket': 'REDACTED',
'Name': imageFile,
'Version': '1'
}
})
print(response)
if __name__ == "__main__":
main()

Related

Upload multiple files from S3 to Frame IO

On file upload in S3, I am triggering lambda function which will generate s3 url and create file in Frame IO. Whenever I am trying to upload many files at once in S3, file is not creating properly in Frame IO and throwing Preview Unsupported Error (for mp4 files which is supported by default). To fix this issue, I tried to use index as a request parameter which worked out only on 2 or 3 files upload. If I am trying to upload more files, the same error arise. Please find the lambda function code below
import requests
import boto3
import json
import urllib.parse
import mimetypes
from botocore.config import Config
import os
s3_client = boto3.client('s3', config = Config(signature_version='s3v4'))
client = boto3.client('ssm')
def lambda_handler(event, context):
print(event)
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
if not key.endswith('/'):
if key.find('/') >= 0:
temp_key = key.rsplit('/', 1)
key = temp_key[1]
print(key)
size = event['Records'][0]['s3']['object']['size']
frameioIndex = int(client.get_parameter(Name='/frameio/asset/index_Dev')['Parameter']['Value']) - 1
print(frameioIndex)
s3_url = s3_client.generate_presigned_url("get_object", Params={"Bucket": bucket, "Key": key})
response = requests.post(os.environ['FRAMEIO_BASE_API_URL'] + "assets" + "/" + os.environ['FRAMEIO_PROJECT_ID'] + "/" + "children",data=json.dumps({"type": "file","name": key,"filesize": size,"filetype": mimetypes.guess_type(key)[0],"source": {"url": s3_url},"index": frameioIndex}), headers={"Authorization":"Bearer " + os.environ['FRAMEIO_TOKEN'], "Content-type": "application/json"}) client.put_parameter(Name='/frameio/asset/index_Dev',Value=str(frameioIndex),Type='String',Overwrite=True)
print(response)
return {
'statusCode': 200,
'body': json.dumps('Successfully uploaded the asset!')
}
return {
'statusCode': 200,
'body': json.dumps('Uploaded object is not a file!')
}

The issue resolved by changing variable name 'key' to 'filename' in line number 14 (key = temp_key[1]) and used filename in requests API. The above issue occurred as I tried to override the filename and passing it to generate_presigned_url method to generate s3 url.

How to return byte array from AWS Lambda API gateway?

I am a beginner so I am hoping to get some help here.
I want create a lambda function (written in Python) that is able to read an image stored in S3 then return the image as a binary file (eg. a byte array). The lambda function is triggered by an API gateway.
Right now, I have setup the API gateway to trigger the Lambda function and it can return a hello message. I also have a gif image stored in a S3 bucket.
import base64
import json
import boto3
s3 = boto.client('s3')
def lambda_handler(event, context):
# TODO implement
bucket = 'mybucket'
key = 'myimage.gif'
s3.get_object(Bucket=bucket, Key=key)['Body'].read()
return {
"statusCode": 200,
"body": json.dumps('Hello from AWS Lambda!!')
}
I really have no idea how to continue. Can anyone advise? Thanks in advance!

you can return Base64 encoded data from your Lambda function with appropriate headers.
Here the updated Lambda function:
import base64
import boto3
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = 'mybucket'
key = 'myimage.gif'
image_bytes = s3.get_object(Bucket=bucket, Key=key)['Body'].read()
# We will now convert this image to Base64 string
image_base64 = base64.b64encode(image_bytes)
return {'statusCode': 200,
# Providing API Gateway the headers for the response
'headers': {'Content-Type': 'image/gif'},
# The image in a Base64 encoded string
'body': image_base64,
'isBase64Encoded': True}
For further details and step by step guide, you can refer to this official blog.

http post request to API with azure blob storage

I'm trying to make an http post request with Microsoft's face api, in order to connect it with photos in my azure blob storage account. When I run the following code, I get multiple errors like handshake error, or ssl routines type errors. I appreciate any help! The problem code is :
api_response = requests.post(url, headers=headers, data=blob)
obviously for context here is what I ran before that. This first chunk sets up the storage account:
%matplotlib inline
import matplotlib.pyplot as plt
import io
from io import StringIO
import numpy as np
import cv2
from PIL import Image
from PIL import Image
import os
from array import array
azure_storage_account_name = 'musicsurveyphotostorage'
azure_storage_account_key = None #dont need key... we will access public blob...
if azure_storage_account_name is None:
raise Exception("You must provide a name for an Azure Storage account")
from azure.storage.blob import BlockBlobService
blob_service = BlockBlobService(azure_storage_account_name, azure_storage_account_key)
# select container (folder) name where the files resides
container_name = 'musicsurveyphotostorage'
# list files in the selected folder
generator = blob_service.list_blobs(container_name)
blob_prefix = 'https://{0}.blob.core.windows.net/{1}/{2}'
# load image file to process
blob_name = 'shiba.jpg' #name of image I have stored
blob = blob_service.get_blob_to_bytes(container_name, blob_name)
image_file_in_mem = io.BytesIO(blob.content)
img_bytes = Image.open(image_file_in_mem)
This second chunk calls out the API and the problematic post request:
#CALL OUT THE API
import requests
import urllib
url_face_api = 'https://eastus.api.cognitive.microsoft.com/face/v1.0'
api_key ='____'
#WHICH PARAMETERS ATTRIBUTES DO YOU WANT RETURNED
headers = {'Content-Type': 'application/octet-stream', 'Ocp-Apim-
Subscription-Key':api_key}
params = urllib.parse.urlencode({
'returnFaceId': 'true',
'returnFaceLandmarks': 'true',
'returnFaceAttributes': 'age,gender,smile,facialHair,headPose,glasses',
})
query_string = '?{0}'.format(params)
url = url_face_api + query_string
#THIS IS THE PROBLEM CODE
api_response = requests.post(url, headers=headers, data=blob)
#print out output in json
import json
res_json = json.loads(api_response.content.decode('utf-8'))
print(json.dumps(res_json, indent=2, sort_keys=True))

If I open the Fiddler, I also could reproduce the issue that you mentioned. If it is that case, you could pause to capture the request with fiddler during send request.
Based on my test, in your code there are 2 code lines need to be changed. From more information you could refer to the screenshot.
We also could get the some demo code from azure offical document.
url_face_api = 'https://westcentralus.api.cognitive.microsoft.com/face/v1.0/detect' # in your case miss detect
api_response = requests.post(url, headers=headers,data=blob.content) # data should be blob.content

Analysing URL's using Google Cloud Vision - Python

Is there anyway I can analyse URL's using Google Cloud Vision. I know how to analyse images that I store locally, but I can't seem to analyse jpg's that exist on the internet:
import argparse
import base64
import httplib2
from googleapiclient.discovery import build
import collections
import time
import datetime
import pyodbc
time_start = datetime.datetime.now()
def main(photo_file):
'''Run a label request on a single image'''
API_DISCOVERY_FILE = 'https://vision.googleapis.com/$discovery/rest?version=v1'
http = httplib2.Http()
service = build('vision', 'v1', http, discoveryServiceUrl=API_DISCOVERY_FILE, developerKey=INSERT API KEY HERE)
with open(photo_file, 'rb') as image:
image_content = base64.b64encode(image.read())
service_request = service.images().annotate(
body={
'requests': [{
'image': {
'content': image_content
},
'features': [{
'type': 'LOGO_DETECTION',
'maxResults': 10,
}]
}]
})
response = service_request.execute()
try:
logo_description = response['responses'][0]['logoAnnotations'][0]['description']
logo_description_score = response['responses'][0]['logoAnnotations'][0]['score']
print logo_description
print logo_description_score
except KeyError:
print "logo nonexistent"
pass
print time_start
if __name__ == '__main__':
main("C:\Users\KVadher\Desktop\image_file1.jpg")
Is there anyway I can analyse a URL and get an answer as to whether there are any logo's in them?

I figured out how to do it. Re-wrote my code and added used urllib to open the image and then I passed it through base64 and the google cloud vision logo recognition api:
import argparse
import base64
import httplib2
from googleapiclient.discovery import build
import collections
import time
import datetime
import pyodbc
import urllib
import urllib2
time_start = datetime.datetime.now()
#API AND DEVELOPER KEY DETAILS
API_DISCOVERY_FILE = 'https://vision.googleapis.com/$discovery/rest?version=v1'
http = httplib2.Http()
service = build('vision', 'v1', http, discoveryServiceUrl=API_DISCOVERY_FILE, developerKey=INSERT DEVELOPER KEY HERE)
url = "http://www.lcbo.com/content/dam/lcbo/products/218040.jpg/jcr:content/renditions/cq5dam.web.1280.1280.jpeg"
opener = urllib.urlopen(url)
#with open(photo_file) as image:
image_content = base64.b64encode(opener.read())
service_request = service.images().annotate(
body={
'requests': [{
'image': {
'content': image_content
},
'features': [{
'type': 'LOGO_DETECTION',
'maxResults': 10,
}]
}]
})
response = service_request.execute()
try:
logo_description = response['responses'][0]['logoAnnotations'][0]['description']
logo_description_score = response['responses'][0]['logoAnnotations'][0]['score']
print logo_description
print logo_description_score
except KeyError:
print "logo nonexistent"
pass
print time_start

The Google Cloud Vision API allows you to either specify the image content in base64 or a link to a file on Google Cloud storage. See:
https://cloud.google.com/vision/docs/requests-and-responses#json_request_format
This means that you will have to download each image url in your code (using Python's urllib2 library maybe) and encode it in base64, then add it to service_request.

Upload image available at public URL to S3 using boto

I'm working in a Python web environment and I can simply upload a file from the filesystem to S3 using boto's key.set_contents_from_filename(path/to/file). However, I'd like to upload an image that is already on the web (say https://pbs.twimg.com/media/A9h_htACIAAaCf6.jpg:large).
Should I somehow download the image to the filesystem, and then upload it to S3 using boto as usual, then delete the image?
What would be ideal is if there is a way to get boto's key.set_contents_from_file or some other command that would accept a URL and nicely stream the image to S3 without having to explicitly download a file copy to my server.
def upload(url):
try:
conn = boto.connect_s3(settings.AWS_ACCESS_KEY_ID, settings.AWS_SECRET_ACCESS_KEY)
bucket_name = settings.AWS_STORAGE_BUCKET_NAME
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = "test"
k.set_contents_from_file(url)
k.make_public()
return "Success?"
except Exception, e:
return e
Using set_contents_from_file, as above, I get a "string object has no attribute 'tell'" error. Using set_contents_from_filename with the url, I get a No such file or directory error . The boto storage documentation leaves off at uploading local files and does not mention uploading files stored remotely.

Here is how I did it with requests, the key being to set stream=True when initially making the request, and uploading to s3 using the upload.fileobj() method:
import requests
import boto3
url = "https://upload.wikimedia.org/wikipedia/en/a/a9/Example.jpg"
r = requests.get(url, stream=True)
session = boto3.Session()
s3 = session.resource('s3')
bucket_name = 'your-bucket-name'
key = 'your-key-name' # key is the name of file on your bucket
bucket = s3.Bucket(bucket_name)
bucket.upload_fileobj(r.raw, key)

Ok, from #garnaat, it doesn't sound like S3 currently allows uploads by url. I managed to upload remote images to S3 by reading them into memory only. This works.
def upload(url):
try:
conn = boto.connect_s3(settings.AWS_ACCESS_KEY_ID, settings.AWS_SECRET_ACCESS_KEY)
bucket_name = settings.AWS_STORAGE_BUCKET_NAME
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = url.split('/')[::-1][0] # In my situation, ids at the end are unique
file_object = urllib2.urlopen(url) # 'Like' a file object
fp = StringIO.StringIO(file_object.read()) # Wrap object
k.set_contents_from_file(fp)
return "Success"
except Exception, e:
return e
Also thanks to How can I create a GzipFile instance from the “file-like object” that urllib.urlopen() returns?

For a 2017-relevant answer to this question which uses the official 'boto3' package (instead of the old 'boto' package from the original answer):
Python 3.5
If you're on a clean Python install, pip install both packages first:
pip install boto3
pip install requests
import boto3
import requests
# Uses the creds in ~/.aws/credentials
s3 = boto3.resource('s3')
bucket_name_to_upload_image_to = 'photos'
s3_image_filename = 'test_s3_image.png'
internet_image_url = 'https://docs.python.org/3.7/_static/py.png'
# Do this as a quick and easy check to make sure your S3 access is OK
for bucket in s3.buckets.all():
if bucket.name == bucket_name_to_upload_image_to:
print('Good to go. Found the bucket to upload the image into.')
good_to_go = True
if not good_to_go:
print('Not seeing your s3 bucket, might want to double check permissions in IAM')
# Given an Internet-accessible URL, download the image and upload it to S3,
# without needing to persist the image to disk locally
req_for_image = requests.get(internet_image_url, stream=True)
file_object_from_req = req_for_image.raw
req_data = file_object_from_req.read()
# Do the actual upload to s3
s3.Bucket(bucket_name_to_upload_image_to).put_object(Key=s3_image_filename, Body=req_data)

Unfortunately, there really isn't any way to do this. At least not at the moment. We could add a method to boto, say set_contents_from_url, but that method would still have to download the file to the local machine and then upload it. It might still be a convenient method but it wouldn't save you anything.
In order to do what you really want to do, we would need to have some capability on the S3 service itself that would allow us to pass it the URL and have it store the URL to a bucket for us. That sounds like a pretty useful feature. You might want to post that to the S3 forums.

A simple 3-lines implementation that works on a lambda out-of-the-box:
import boto3
import requests
s3_object = boto3.resource('s3').Object(bucket_name, object_key)
with requests.get(url, stream=True) as r:
s3_object.put(Body=r.content)
The source for the .get part comes straight from the requests documentation

from io import BytesIO
def send_image_to_s3(url, name):
print("sending image")
bucket_name = 'XXX'
AWS_SECRET_ACCESS_KEY = "XXX"
AWS_ACCESS_KEY_ID = "XXX"
s3 = boto3.client('s3', aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
response = requests.get(url)
img = BytesIO(response.content)
file_name = f'path/{name}'
print('sending {}'.format(file_name))
r = s3.upload_fileobj(img, bucket_name, file_name)
s3_path = 'path/' + name
return s3_path

I have tried as following with boto3 and it works me:
import boto3;
import contextlib;
import requests;
from io import BytesIO;
s3 = boto3.resource('s3');
s3Client = boto3.client('s3')
for bucket in s3.buckets.all():
print(bucket.name)
url = "#resource url";
with contextlib.closing(requests.get(url, stream=True, verify=False)) as response:
# Set up file stream from response content.
fp = BytesIO(response.content)
# Upload data to S3
s3Client.upload_fileobj(fp, 'aws-books', 'reviews_Electronics_5.json.gz')

Using the boto3 upload_fileobj method, you can stream a file to an S3 bucket, without saving to disk. Here is my function:
import boto3
import StringIO
import contextlib
import requests
def upload(url):
# Get the service client
s3 = boto3.client('s3')
# Rember to se stream = True.
with contextlib.closing(requests.get(url, stream=True, verify=False)) as response:
# Set up file stream from response content.
fp = StringIO.StringIO(response.content)
# Upload data to S3
s3.upload_fileobj(fp, 'my-bucket', 'my-dir/' + url.split('/')[-1])

S3 doesn't support remote upload as of now it seems. You may use the below class for uploading an image to S3. The upload method here first tries to download the image and keeps it in memory for sometime until it gets uploaded. To be able to connect to S3 you will have to install AWS CLI using command pip install awscli, then enter few credentials using command aws configure:
import urllib3
import uuid
from pathlib import Path
from io import BytesIO
from errors import custom_exceptions as cex
BUCKET_NAME = "xxx.yyy.zzz"
POSTERS_BASE_PATH = "assets/wallcontent"
CLOUDFRONT_BASE_URL = "https://xxx.cloudfront.net/"
class S3(object):
def __init__(self):
self.client = boto3.client('s3')
self.bucket_name = BUCKET_NAME
self.posters_base_path = POSTERS_BASE_PATH
def __download_image(self, url):
manager = urllib3.PoolManager()
try:
res = manager.request('GET', url)
except Exception:
print("Could not download the image from URL: ", url)
raise cex.ImageDownloadFailed
return BytesIO(res.data) # any file-like object that implements read()
def upload_image(self, url):
try:
image_file = self.__download_image(url)
except cex.ImageDownloadFailed:
raise cex.ImageUploadFailed
extension = Path(url).suffix
id = uuid.uuid1().hex + extension
final_path = self.posters_base_path + "/" + id
try:
self.client.upload_fileobj(image_file,
self.bucket_name,
final_path
)
except Exception:
print("Image Upload Error for URL: ", url)
raise cex.ImageUploadFailed
return CLOUDFRONT_BASE_URL + id

import boto
from boto.s3.key import Key
from boto.s3.connection import OrdinaryCallingFormat
from urllib import urlopen
def upload_images_s3(img_url):
try:
connection = boto.connect_s3('access_key', 'secret_key', calling_format=OrdinaryCallingFormat())
bucket = connection.get_bucket('boto-demo-1519388451')
file_obj = Key(bucket)
file_obj.key = img_url.split('/')[::-1][0]
fp = urlopen(img_url)
result = file_obj.set_contents_from_string(fp.read())
except Exception, e:
return e

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Textract Unsupported Document Exception - python

Per doc: If you're using an AWS SDK to call Amazon Textract, you might not need to base64-encode image bytes passed using the Bytes field. Base64-encoding is only required when directly invoking the REST API. When using Python or NodeJS SDK, use native bytes (binary bytes).

For future reference, I solved that problem using: client = boto3.client('textract') image_64_decode = base64.b64decode(image_b64['document_b64']) bytes = bytearray(image_64_decode) response = client.detect_document_text( Document={ 'Bytes': bytes } )

With Boto3 if you are using Jupyternotebook for image (.jpg or .png), you can use: import boto3 import cv2 with open(images_path, "rb") as img_file: img_str = bytearray(img_file.read()) textract = boto3.client('textract') response = textract.detect_document_text(Document={'Bytes': img_str})

Related

Upload multiple files from S3 to Frame IO

How to return byte array from AWS Lambda API gateway?

http post request to API with azure blob storage

Analysing URL's using Google Cloud Vision - Python

Upload image available at public URL to S3 using boto

Categories

Resources