I want to extract text from images using Python. (Tessaract lib does not work for me because it requires installation).
I have found boto3 lib and Textract, but I'm having trouble working with it. I'm still new to this. Can you tell me what I need to do in order to run my script correctly.
This is my code:
import cv2
import boto3
import textract
#img = cv2.imread('slika2.jpg') #this is jpg file
with open('slika2.pdf', 'rb') as document:
img = bytearray(document.read())
textract = boto3.client('textract',region_name='us-west-2')
response = textract.detect_document_text(Document={'Bytes': img}). #gives me error
When I run this code, I get:
botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
I have also tried this:
# Document
documentName = "slika2.jpg"
# Read document content
with open(documentName, 'rb') as document:
imageBytes = bytearray(document.read())
# Amazon Textract client
textract = boto3.client('textract',region_name='us-west-2')
# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes}) #ERROR
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
But I get this error:
botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
Im noob in this, so any help would be good. How can I read text form my image or pdf file?
I have also added this block of code, but the error is still Unable to locate credentials.
session = boto3.Session(
There is problem in passing credentials to boto3. You have to pass the credentials while creating boto3 client.
import boto3
# boto3 client
client = boto3.client(
# Read image
with open('slika2.png', 'rb') as document:
img = bytearray(document.read())
# Call Amazon Textract
response = client.detect_document_text(
Document={'Bytes': img}
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
Do note, it is not recommended to hardcode AWS Keys in code. Please refer following this document
I'm working on an internal S3 service (not AWS one). When I provide hard coded credentials, region and endpoint_url, boto3 seems to ignore them.
I came to that conclusion because it is attempting to go on internet (by using a public aws endpoint URL instead of the internal I have provided) but it does not work because of the following proxy error. But he should not go on internet, since it is an internal S3 service :
botocore.exceptions.ProxyConnectionError: Failed to connect to proxy URL: "http://my_company_proxy"
Here is my code
import io
import os
import boto3
import pandas as pd
# Method 1 : Client #########################################
s3_client = boto3.client(
# ==> at this point no error, but I don't know the value of endpoint_url
# Read bucket
bucket = "bkt-udt-arch"
file_name = "banking.csv"
print("debug 1") # printed OK
obj = s3_client.get_object(Bucket= bucket, Key= file_name)
# program stops here :
botocore.exceptions.ProxyConnectionError: Failed to connect to proxy URL: "http://my_company_proxy"
print("debug 2") # not printed -
initial_df = pd.read_csv(obj['Body']) # 'Body' is a key word
print("debug 3")
# Method 2 : Resource #########################################
# use third party object storage
s3 = boto3.resource('s3', endpoint_url='https://my_company_enpoint_url',
print("debug 4") # Printed OK if method 1 is commented
# Print out bucket names
for bucket in s3.buckets.all():
Thank you for the review
It was indeed a proxy problem : when http_prxoxy env variable is disabled, it works fine.
I have flask python rest api which is called by another flask rest api.
the input for my api is one parquet file (FileStorage object) and ECS connection and bucket details.
I want to save parquet file to ECS in a specific folder using boto or boto3
the code I have tried
def uploadFileToGivenBucket(self,inputData,file):
BucketName = inputData.ecsbucketname
calling_format = OrdinaryCallingFormat()
client = S3Connection(inputData.access_key_id, inputData.secret_key, port=inputData.ecsport,
host=inputData.ecsEndpoint, debug=2,
#client.upload_file(BucketName, inputData.filename, inputData.folderpath)
bucket = client.get_bucket(BucketName,validate=False)
key = boto.s3.key.Key(bucket, inputData.filename)
fileName = NamedTemporaryFile(delete=False,suffix=".parquet")
with open(fileName.name) as f:
but it is not working and giving me error like...
signature_host = '%s:%d' % (self.host, port)
TypeError: %d format: a number is required, not str
I tried google but no luck Can anyone help me with this or any sample code for the same.
After a lot of hit and tried and time, I finally got the solution. I posting it for everyone else who are facing the same issue.
You need to use Boto3 and here is the code...
def uploadFileToGivenBucket(self,inputData,file):
BucketName = inputData.ecsbucketname
#bucket = client.get_bucket(BucketName,validate=False)
f = NamedTemporaryFile(delete=False,suffix=".parquet")
endpointurl = "<your endpoints>"
s3_client = boto3.client('s3',endpoint_url=endpointurl, aws_access_key_id=inputData.access_key_id,aws_secret_access_key=inputData.secret_key)
newkey = 'yourfolderpath/anotherfolder'+inputData.filename
response = s3_client.upload_file(f.name, BucketName,newkey)
except ClientError as e:
return False
return True
Want to use Google's Cloud Vision API for OCR. Using the python sample code here we have:
def detect_text(path):
"""Detects text in the file."""
client = vision.ImageAnnotatorClient()
with io.open(path, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
response = client.text_detection(image=image)
texts = response.text_annotations
for text in texts:
vertices = (['({},{})'.format(vertex.x, vertex.y)
for vertex in text.bounding_poly.vertices])
print('bounds: {}'.format(','.join(vertices)))
Where do I put my API key? I (obviously) can't authenticate without it.
From the docs,
If you plan to use a service account with client library code, you need to set an environment variable.
Provide the credentials to your application code by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS. Replace [PATH] with the location of the JSON file you downloaded in the previous step.
For example:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
So it looks like you should create a service account, download a credentials file, and set up an environmental variable to point to it.
There are two ways through which you can authenticate
Exporting the credential file as an environment variable. Here is a sample code:
from google.cloud import vision
def get_text_from_image(image_file):
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "./creds/xxxx-xxxxx.json"
# process_image is a method to convert numpy array to bytestream
# (not of interest in this context hence not including it here)
byte_img = process_image_to_bytes(image_file)
client = vision.ImageAnnotatorClient()
image = vision.Image(content=byte_img)
response = client.text_detection(image=image)
texts = response.text_annotations
return texts
except BaseException as e:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] is doing the work here
Using oauth2
from google.cloud import vision
from google.oauth2 import service_account
def get_text_from_image(image_file):
creds = service_account.Credentials.from_service_account_file("./creds/xxx-xxxxx.json")
# process_image is a method to convert numpy array to bytestream
# (not of interest in this context hence not including it here)
byte_img = process_image_to_bytes(image_file)
client = vision.ImageAnnotatorClient(credentials=creds)
image = vision.Image(content=byte_img)
response = client.text_detection(image=image)
texts = response.text_annotations
return texts
except BaseException as e:
Here we are using the google-auth library to create a credential file from the JSON credential file and passing that object to ImageAnnotatorClient for authentication.
Hope these sample snippets helped you
I have created a S3 bucket, uploaded a video, created a streaming distribution in CloudFront. Tested it with a static HTML player and it works. I have created a keypair through the account settings. I have the private key file sitting on my desktop at the moment. That's where I am.
My aim is to get to a point where my Django/Python site creates secure URLs and people can't access the videos unless they've come from one of my pages. The problem is I'm allergic to the way Amazon have laid things out and I'm just getting more and more confused.
I realise this isn't going to be the best question on StackOverflow but I'm certain I can't be the only fool out here that can't make heads or tails out of how to set up a secure CloudFront/S3 situation. I would really appreciate your help and am willing (once two days has passed) give a 500pt bounty to the best answer.
I have several questions that, once answered, should fit into one explanation of how to accomplish what I'm after:
In the documentation (there's an example in the next point) there's lots of XML lying around telling me I need to POST things to various places. Is there an online console for doing this? Or do I literally have to force this up via cURL (et al)?
How do I create a Origin Access Identity for CloudFront and bind it to my distribution? I've read this document but, per the first point, don't know what to do with it. How does my keypair fit into this?
Once that's done, how do I limit the S3 bucket to only allow people to download things through that identity? If this is another XML jobby rather than clicking around the web UI, please tell me where and how I'm supposed to get this into my account.
In Python, what's the easiest way of generating an expiring URL for a file. I have boto installed but I don't see how to get a file from a streaming distribution.
Are there are any applications or scripts that can take the difficulty of setting this garb up? I use Ubuntu (Linux) but I have XP in a VM if it's Windows-only. I've already looked at CloudBerry S3 Explorer Pro - but it makes about as much sense as the online UI.
You're right, it takes a lot of API work to get this set up. I hope they get it in the AWS Console soon!
UPDATE: I have submitted this code to boto - as of boto v2.1 (released 2011-10-27) this gets much easier. For boto < 2.1, use the instructions here. For boto 2.1 or greater, get the updated instructions on my blog: http://www.secretmike.com/2011/10/aws-cloudfront-secure-streaming.html Once boto v2.1 gets packaged by more distros I'll update the answer here.
To accomplish what you want you need to perform the following steps which I will detail below:
Create your s3 bucket and upload some objects (you've already done this)
Create a Cloudfront "Origin Access Identity" (basically an AWS account to allow cloudfront to access your s3 bucket)
Modify the ACLs on your objects so that only your Cloudfront Origin Access Identity is allowed to read them (this prevents people from bypassing Cloudfront and going direct to s3)
Create a cloudfront distribution with basic URLs and one which requires signed URLs
Test that you can download objects from basic cloudfront distribution but not from s3 or the signed cloudfront distribution
Create a key pair for signing URLs
Generate some URLs using Python
Test that the signed URLs work
1 - Create Bucket and upload object
The easiest way to do this is through the AWS Console but for completeness I'll show how using boto. Boto code is shown here:
import boto
#credentials stored in environment AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
s3 = boto.connect_s3()
#bucket name MUST follow dns guidelines
new_bucket_name = "stream.example.com"
bucket = s3.create_bucket(new_bucket_name)
object_name = "video.mp4"
key = bucket.new_key(object_name)
2 - Create a Cloudfront "Origin Access Identity"
For now, this step can only be performed using the API. Boto code is here:
import boto
#credentials stored in environment AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
cf = boto.connect_cloudfront()
oai = cf.create_origin_access_identity(comment='New identity for secure videos')
#We need the following two values for later steps:
print("Origin Access Identity ID: %s" % oai.id)
print("Origin Access Identity S3CanonicalUserId: %s" % oai.s3_user_id)
3 - Modify the ACLs on your objects
Now that we've got our special S3 user account (the S3CanonicalUserId we created above) we need to give it access to our s3 objects. We can do this easily using the AWS Console by opening the object's (not the bucket's!) Permissions tab, click the "Add more permissions" button, and pasting the very long S3CanonicalUserId we got above into the "Grantee" field of a new. Make sure you give the new permission "Open/Download" rights.
You can also do this in code using the following boto script:
import boto
#credentials stored in environment AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
s3 = boto.connect_s3()
bucket_name = "stream.example.com"
bucket = s3.get_bucket(bucket_name)
object_name = "video.mp4"
key = bucket.get_key(object_name)
#Now add read permission to our new s3 account
s3_canonical_user_id = "<your S3CanonicalUserID from above>"
key.add_user_grant("READ", s3_canonical_user_id)
4 - Create a cloudfront distribution
Note that custom origins and private distributions are not fully supported in boto until version 2.0 which has not been formally released at time of writing. The code below pulls out some code from the boto 2.0 branch and hacks it together to get it going but it's not pretty. The 2.0 branch handles this much more elegantly - definitely use that if possible!
import boto
from boto.cloudfront.distribution import DistributionConfig
from boto.cloudfront.exception import CloudFrontServerError
import re
def get_domain_from_xml(xml):
results = re.findall("<DomainName>([^<]+)</DomainName>", xml)
return results[0]
#custom class to hack this until boto v2.0 is released
class HackedStreamingDistributionConfig(DistributionConfig):
def __init__(self, connection=None, origin='', enabled=False,
caller_reference='', cnames=None, comment='',
DistributionConfig.__init__(self, connection=connection,
origin=origin, enabled=enabled,
cnames=cnames, comment=comment,
#override the to_xml() function
def to_xml(self):
s = '<?xml version="1.0" encoding="UTF-8"?>\n'
s += '<StreamingDistributionConfig xmlns="http://cloudfront.amazonaws.com/doc/2010-07-15/">\n'
s += ' <S3Origin>\n'
s += ' <DNSName>%s</DNSName>\n' % self.origin
if self.origin_access_identity:
val = self.origin_access_identity
s += ' <OriginAccessIdentity>origin-access-identity/cloudfront/%s</OriginAccessIdentity>\n' % val
s += ' </S3Origin>\n'
s += ' <CallerReference>%s</CallerReference>\n' % self.caller_reference
for cname in self.cnames:
s += ' <CNAME>%s</CNAME>\n' % cname
if self.comment:
s += ' <Comment>%s</Comment>\n' % self.comment
s += ' <Enabled>'
if self.enabled:
s += 'true'
s += 'false'
s += '</Enabled>\n'
if self.trusted_signers:
s += '<TrustedSigners>\n'
for signer in self.trusted_signers:
if signer == 'Self':
s += ' <Self/>\n'
s += ' <AwsAccountNumber>%s</AwsAccountNumber>\n' % signer
s += '</TrustedSigners>\n'
if self.logging:
s += '<Logging>\n'
s += ' <Bucket>%s</Bucket>\n' % self.logging.bucket
s += ' <Prefix>%s</Prefix>\n' % self.logging.prefix
s += '</Logging>\n'
s += '</StreamingDistributionConfig>\n'
return s
def create(self):
response = self.connection.make_request('POST',
'/%s/%s' % ("2010-11-01", "streaming-distribution"),
{'Content-Type' : 'text/xml'},
body = response.read()
if response.status == 201:
return body
raise CloudFrontServerError(response.status, response.reason, body)
cf = boto.connect_cloudfront()
s3_dns_name = "stream.example.com.s3.amazonaws.com"
comment = "example streaming distribution"
oai = "<OAI ID from step 2 above like E23KRHS6GDUF5L>"
#Create a distribution that does NOT need signed URLS
hsd = HackedStreamingDistributionConfig(connection=cf, origin=s3_dns_name, comment=comment, enabled=True)
hsd.origin_access_identity = oai
basic_dist = hsd.create()
print("Distribution with basic URLs: %s" % get_domain_from_xml(basic_dist))
#Create a distribution that DOES need signed URLS
hsd = HackedStreamingDistributionConfig(connection=cf, origin=s3_dns_name, comment=comment, enabled=True)
hsd.origin_access_identity = oai
#Add some required signers (Self means your own account)
hsd.trusted_signers = ['Self']
signed_dist = hsd.create()
print("Distribution with signed URLs: %s" % get_domain_from_xml(signed_dist))
5 - Test that you can download objects from cloudfront but not from s3
You should now be able to verify:
stream.example.com.s3.amazonaws.com/video.mp4 - should give AccessDenied
signed_distribution.cloudfront.net/video.mp4 - should give MissingKey (because the URL is not signed)
basic_distribution.cloudfront.net/video.mp4 - should work fine
The tests will have to be adjusted to work with your stream player, but the basic idea is that only the basic cloudfront url should work.
6 - Create a keypair for CloudFront
I think the only way to do this is through Amazon's web site. Go into your AWS "Account" page and click on the "Security Credentials" link. Click on the "Key Pairs" tab then click "Create a New Key Pair". This will generate a new key pair for you and automatically download a private key file (pk-xxxxxxxxx.pem). Keep the key file safe and private. Also note down the "Key Pair ID" from amazon as we will need it in the next step.
7 - Generate some URLs in Python
As of boto version 2.0 there does not seem to be any support for generating signed CloudFront URLs. Python does not include RSA encryption routines in the standard library so we will have to use an additional library. I've used M2Crypto in this example.
For a non-streaming distribution, you must use the full cloudfront URL as the resource, however for streaming we only use the object name of the video file. See the code below for a full example of generating a URL which only lasts for 5 minutes.
This code is based loosely on the PHP example code provided by Amazon in the CloudFront documentation.
from M2Crypto import EVP
import base64
import time
def aws_url_base64_encode(msg):
msg_base64 = base64.b64encode(msg)
msg_base64 = msg_base64.replace('+', '-')
msg_base64 = msg_base64.replace('=', '_')
msg_base64 = msg_base64.replace('/', '~')
return msg_base64
def sign_string(message, priv_key_string):
key = EVP.load_key_string(priv_key_string)
signature = key.sign_final()
return signature
def create_url(url, encoded_signature, key_pair_id, expires):
signed_url = "%(url)s?Expires=%(expires)s&Signature=%(encoded_signature)s&Key-Pair-Id=%(key_pair_id)s" % {
return signed_url
def get_canned_policy_url(url, priv_key_string, key_pair_id, expires):
#we manually construct this policy string to ensure formatting matches signature
canned_policy = '{"Statement":[{"Resource":"%(url)s","Condition":{"DateLessThan":{"AWS:EpochTime":%(expires)s}}}]}' % {'url':url, 'expires':expires}
#now base64 encode it (must be URL safe)
encoded_policy = aws_url_base64_encode(canned_policy)
#sign the non-encoded policy
signature = sign_string(canned_policy, priv_key_string)
#now base64 encode the signature (URL safe as well)
encoded_signature = aws_url_base64_encode(signature)
#combine these into a full url
signed_url = create_url(url, encoded_signature, key_pair_id, expires);
return signed_url
def encode_query_param(resource):
enc = resource
enc = enc.replace('?', '%3F')
enc = enc.replace('=', '%3D')
enc = enc.replace('&', '%26')
return enc
#Set parameters for URL
key_pair_id = "APKAIAZCZRKVIO4BQ" #from the AWS accounts page
priv_key_file = "cloudfront-pk.pem" #your private keypair file
resource = 'video.mp4' #your resource (just object name for streaming videos)
expires = int(time.time()) + 300 #5 min
#Create the signed URL
priv_key_string = open(priv_key_file).read()
signed_url = get_canned_policy_url(resource, priv_key_string, key_pair_id, expires)
#Flash player doesn't like query params so encode them
enc_url = encode_query_param(signed_url)
8 - Try out the URLs
Hopefully you should now have a working URL which looks something like this:
Put this into your js and you should have something which looks like this (from the PHP example in Amazon's CloudFront documentation):
var so_canned = new SWFObject('http://location.domname.com/~jvngkhow/player.swf','mpl','640','360','9');
As you can see, not very easy! boto v2 will help a lot setting up the distribution. I will find out if it's possible to get some URL generation code in there as well to improve this great library!
In Python, what's the easiest way of generating an expiring URL for a file. I have boto installed but I don't see how to get a file from a streaming distribution.
You can generate a expiring signed-URL for the resource. Boto3 documentation has a nice example solution for that:
import datetime
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric import padding
from botocore.signers import CloudFrontSigner
def rsa_signer(message):
with open('path/to/key.pem', 'rb') as key_file:
private_key = serialization.load_pem_private_key(
signer = private_key.signer(padding.PKCS1v15(), hashes.SHA1())
return signer.finalize()
url = 'http://d2949o5mkkp72v.cloudfront.net/hello.txt'
expire_date = datetime.datetime(2017, 1, 1)
cloudfront_signer = CloudFrontSigner(key_id, rsa_signer)
# Create a signed url that will be valid until the specfic expiry date
# provided using a canned policy.
signed_url = cloudfront_signer.generate_presigned_url(
url, date_less_than=expire_date)
Can a Python script upload a photo to photo bucket and then retrieve the URL for it? Is so how?
I found a script at this link: http://www.democraticunderground.com/discuss/duboard.php?az=view_all&address=240x677
But I just found that confusing.
many thanks,
Yes, you can. Photobucket has a well-documented API, and someone wrote a wrapper around it.
Download the it and put it into your Python path, then download httplib2 (you can use easy_install or pip for this one).
Then, you have to request a key for the Photobucket API.
If you did everything right, you can write your script now. The Python wrapper is great, but is not documented at all which makes it very difficult to use it. I spent hours on understanding it (compare the question and response time here). As example, the script even has form/multipart support, but I had to read the code to find out how to use it. I had to prefix the filename with a #.
This library is a great example how you should NOT document your code!
I finally got it working, enjoy the script: (it even has oAuth handling!)
import pbapi
import webbrowser
import cPickle
import os
import re
import sys
from xml.etree import ElementTree
__author__ = "leoluk"
# File in which the oAuth token will be stored
TOKEN_FILE = "token.txt"
IMAGE_PATH = r"D:\Eigene Dateien\Bilder\SC\foo.png"
"type": 'image',
"uploadfile": '#'+IMAGE_PATH,
"title": "My title", # <---
"description": "My description", # <---
ALBUM_NAME = None # default album if None
API_KEY = "149[..]"
API_SECRET = "528[...]"
## SCRIPT ##
api = pbapi.PbApi(API_KEY, API_SECRET)
api.pb_request.connection.cache = None
# Test if service online
result = api.reset().ping().post().response_string
ET = ElementTree.fromstring(result)
if ET.find('status').text != 'OK':
sys.stderr.write("error: Ping failed \n"+result)
# If there is already a saved oAuth token, no need for a new one
api.username, api.pb_request.oauth_token = cPickle.load(open(TOKEN_FILE))
except (ValueError, KeyError, IOError, TypeError):
# If error, there's no valid oAuth token
# Getting request token
# Requesting user permission (you have to login with your account)
raw_input("Press Enter when you finished access permission. ")
#Getting oAuth token
# This is needed for getting the right subdomain
infos = api.reset().album(api.username).url().get().response_string
ET = ElementTree.fromstring(infos)
if ET.find('status').text != 'OK':
# Remove the invalid oAuth
# This happend is user deletes the oAuth permission online
sys.stderr.write("error: Permission deleted. Please re-run.")
# Fresh values for username and subdomain
api.username = ET.find('content/username').text
# Default album name
if not ALBUM_NAME:
ALBUM_NAME = api.username
# Debug :-)
print "User: %s" % api.username
# Save the new, valid oAuth token
cPickle.dump((api.username, api.oauth_token), open(TOKEN_FILE, 'w'))
# Posting the image
result = (api.reset().album(ALBUM_NAME).
ET = ElementTree.fromstring(result)
if ET.find('status').text != 'OK':
sys.stderr.write("error: File upload failed \n"+result)
# Now, as an example what you could do now, open the image in the browser
Use the python API by Ron White that was written to do just this