AWS Kinesis Consumer Python 3.4 Boto

AWS Kinesis Consumer Python 3.4 Boto - python

I am trying to build a kinesis consumer script using python 3.4 below is an example of my code. I want the records to be saved to a local file that I can later push to S3:
from boto import kinesis
import time
import json
# AWS Connection Credentials
aws_access_key = 'your_key'
aws_access_secret = 'your_secret key'
# Selected Kinesis Stream
stream = 'TwitterTesting'
# Aws Authentication
auth = {"aws_access_key_id": aws_access_key, "aws_secret_access_key": aws_access_secret}
conn = kinesis.connect_to_region('us-east-1',**auth)
# Targeted file to be pushed to S3 bucket
fileName = "KinesisDataTest2.txt"
file = open("C:\\Users\\csanders\\PycharmProjects\\untitled\\KinesisDataTest.txt", "a")
# Describe stream and get shards
tries = 0
while tries < 10:
tries += 1
time.sleep(1)
response = conn.describe_stream(stream)
if response['StreamDescription']['StreamStatus'] == 'ACTIVE':
break
else:
raise TimeoutError('Stream is still not active, aborting...')
# Get Shard Iterator and get records from stream
shard_ids = []
stream_name = None
if response and 'StreamDescription' in response:
stream_name = response['StreamDescription']['StreamName']
for shard_id in response['StreamDescription']['Shards']:
shard_id = shard_id['ShardId']
shard_iterator = conn.get_shard_iterator(stream,
shard_id, 'TRIM_HORIZON')
shard_ids.append({'shard_id': shard_id, 'shard_iterator': shard_iterator['ShardIterator']})
tries = 0
result = []
while tries < 100:
tries += 1
response = conn.get_records(shard_iterator, 100)
shard_iterator = response['NextShardIterator']
if len(response['Records'])> 0:
for res in response['Records']:
result.append(res['Data'])
print(result, shard_iterator)
For some reason when I run this script I get the following error each time:
Traceback (most recent call last):
File "C:/Users/csanders/PycharmProjects/untitled/Get_records_Kinesis.py", line 57, in <module>
response = json.load(conn.get_records(shard_ids, 100))
File "C:\Python34\lib\site-packages\boto-2.38.0-py3.4.egg\boto\kinesis\layer1.py", line 327, in get_records
body=json.dumps(params))
File "C:\Python34\lib\site-packages\boto-2.38.0- py3.4.egg\boto\kinesis\layer1.py", line 874, in make_request
body=json_body)
boto.exception.JSONResponseError: JSONResponseError: 400 Bad Request
{'Message': 'Start of list found where not expected', '__type': 'SerializationException'}
My end goal is to eventually kick this data into an S3 bucket. I just need to get these records to return and print first. The data going into the stream is JSON dump twitter data using the put_record function. I can post that code too if needed.
Updated that one line from response = json.load(conn.get_records(shard_ids, 100)) to response = conn.get_records(shard_iterator, 100)

response = json.load(conn.get_records(shard_ids, 100))
get_records expects a shard_id not an array of shards. when it's trying to get records it fails miserably (you see the 400 from Kinesis saying that the request is bad).
http://boto.readthedocs.org/en/latest/ref/kinesis.html?highlight=get_records#boto.kinesis.layer1.KinesisConnection.get_records

if you replace following will work ( "while" you set up according for how many record you would like to collect, you can make infinite "with == 0" and remove "tries += 1")
shard_iterator = conn.get_shard_iterator(stream,
shard_id, 'TRIM_HORIZON')
shard_ids.append({'shard_id': shard_id, 'shard_iterator': shard_iterator['ShardIterator']})
with following:
shard_iterator = conn.get_shard_iterator(stream,
shard_id, "LATEST")["ShardIterator"]
also to write to a file change("\n" is for new line):
print(result, shard_iterator)
to:
file.write(str(result) + "\n")
Hope it helps.

Related

Python error when trying to execute a script to shutdown GCP VM

The functionality of this script to shutdown GCP VM's based on some logic. Currently we are trying shut it down at night, but the scripts are failing before it could shutdown the VM. We shutdown based on time. At evenings and night the VM are shutdown using these scripts
Script to turn off instances overnight
Environment: Runs in Google Cloud Functions (Python3.7)
import datetime
import json
from pprint import pformat
import pytz
import re
import modules.common.cfcommon as cfcommon
import modules.utilities.dateutilities as dateutilities
from modules.compute.instances import InstanceList, Instance
from modules.compute.compute_service import ComputeServiceContext
from modules.utilities.printutilities import print_message, debug_message
from modules.pubsub.topic import PublishMessage
from modules.common.labels import VMAUTOSHUTDOWN_LABEL, VMAUTOSHUTDOWN_DEFER_LABEL, ShutdownDeferLabelValueIsValid, ShutdownLabelValueIsValid
from templates.renderer import render_template
# Takes a list in the following format and checks if the 'Instance' object is within it
# list must contain dictionaries in the following format:
# {"name": "instancename", "zone": "zonename"}
# Example: {"name": "test-01", "zone": "us-east4-c"}
#
# Parameters:
# inputList - list of dictionary objects
# instance - Instance object
def isInstanceInList(inputList, instance):
if not isinstance(inputList, list):
raise TypeError("Provided inputList is not a list")
if not isinstance(instance, Instance):
raise TypeError("Provided instance is not of type 'Instance'")
# Iterate over every item in inputList and check if the name and zone match
for cItem in inputList:
if cItem["name"].lower() == instance.properties["name"].lower() and cItem["zone"].lower() == instance.GetShortZoneName().lower():
return True
# No match found
return False
# Takes a list of Instance objects and sees if their shutdown timezone is within the graceperiod of the shutdownHour
#
# Example: is shutdown hour is 23 and the gracePeriodMin is 15 then if the function is called at 23:12, the instance will be included in the shutdown list
#
# Parameters:
# instanceList - List of Instance objects
# shutdownHour - number (0-23) 0 = Midnight, 23 = 11PM
# gracePeriodMin - number
def getInstancesToStop(instanceList, gracePeriodMin):
instancesToStop = []
debug_message("Entering getInstancesToStop")
for cInstance in instanceList:
debug_message("Instance: %s (ID: %s, Zone: %s, Project: %s)" % (cInstance.GetName(), cInstance.GetId(), cInstance.GetShortZoneName(), cInstance.project))
labels = cInstance.GetLabels()
if VMAUTOSHUTDOWN_LABEL in labels.keys():
labelValue = labels.get(VMAUTOSHUTDOWN_LABEL, '')
pattern = '\d\d-\d\d-\d\d'
match = re.match(pattern, labelValue)
if not match or not ShutdownLabelValueIsValid(labelValue):
debug_message(f'Label {labelValue} does not match the correct format')
instancesToStop.append(cInstance)
continue
else:
debug_message(f'Label {VMAUTOSHUTDOWN_LABEL} not found. Adding to shutdown list')
instancesToStop.append(cInstance)
continue
shutdown_deferred_utc_datetime = None
if VMAUTOSHUTDOWN_DEFER_LABEL in labels.keys():
labelValue = labels.get(VMAUTOSHUTDOWN_DEFER_LABEL, '')
pattern = '\d\d\d\d-\d\d-\d\dt\d\d-\d\d-\d\d'
match = re.match(pattern, labelValue)
if match and ShutdownDeferLabelValueIsValid(labelValue):
shutdown_deferred_utc_date, shutdown_deferred_utc_time = labelValue.split('t')
year, month, day = shutdown_deferred_utc_date.split('-')
hour, minute, second = shutdown_deferred_utc_time.split('-')
shutdown_deferred_utc_datetime = datetime.datetime.now(pytz.timezone('GMT')).replace(
year=int(year), month=int(month), day=int(day), hour=int(hour), minute=int(minute), second=int(second)
)
else:
debug_message(f'Label {labels[VMAUTOSHUTDOWN_DEFER_LABEL]} does not match the correct format')
instancesToStop.append(cInstance)
continue
current_utc_time = dateutilities.get_current_datetime()
# If defer date is in the future, and not in grace window time, skip shutting down
if shutdown_deferred_utc_datetime is not None and shutdown_deferred_utc_datetime > current_utc_time:
debug_message(f'Instance {cInstance.GetName()} shutdown deferred until after {labels[VMAUTOSHUTDOWN_DEFER_LABEL]}')
continue
# If defer time is in past, continue with the vm hour shutdown
shutdown_utc_hour = labels[VMAUTOSHUTDOWN_LABEL].split('-')[0]
# Convert shutdown UTC hour into datetime object
shutdown_utc_time = datetime.datetime.now(pytz.timezone('GMT')).replace(hour=int(shutdown_utc_hour), minute=0, second=0)
shutdown_utc_grace_time = shutdown_utc_time + datetime.timedelta(minutes=gracePeriodMin)
debug_message(f"Shutdown UTC time {shutdown_utc_time}")
debug_message(f"Shutdown UTC grace time {shutdown_utc_grace_time}")
# Check if shutdown is within time window
if current_utc_time >= shutdown_utc_time and current_utc_time <= shutdown_utc_grace_time:
debug_message("We're in the time window")
instancesToStop.append(cInstance)
else:
debug_message("We're outside the time window. Not adding to stop list")
return instancesToStop
# This is the main entry point that cloud functions calls
def AutoStopVMInstances(config, policy=None, payload=None, generate_local_report=False):
FUNCTION_NAME = "AutoStopVMInstances"
# Populated by config later...
QUERY_PROJECT_IDS = None # List of project IDs
INSTANCE_WHITELIST = None # List of dictionaries in format {"name": "instancename", "zone": "zonename", "project": "projectid"}
PREVIEW_MODE = True
SHUTDOWN_GRACEPERIOD_MIN = 30
# Start
startTime = datetime.datetime.now()
print_message("Started %s within Cloud Function %s [%s]" % (FUNCTION_NAME, cfcommon.CLOUD_FUNCTION_NAME, startTime))
debug_message("")
# For ease of access, assign from config values
debug_message("Processing Configuration...")
QUERY_PROJECT_IDS = config.get("QueryProjectIDs", []) # Required field
INSTANCE_WHITELIST = config.get("InstanceWhiteList", []) # Optional Field
PREVIEW_MODE = config.get("PreviewMode", True) # Required field
SHUTDOWN_GRACEPERIOD_MIN = config.get("ShutdownGracePeriodMin", None) # Required field
SKIP_INSTANCE_GROUPS = config.get("SkipInstanceGroups", False) # Optional
EMAIL_PUB_SUB_PROJECT = config.get("EmailPubSubProject", None) # Optional
EMAIL_PUB_SUB_TOPIC = config.get("EmailPubSubTopic", None) # Optional
EMAIL_TO = config.get("EmailTo", []) # Optional
EMAIL_CC = config.get("EmailCC", None) # Optional
EMAIL_BCC = config.get("EmailBCC", None) # Optional
EMAIL_FROM = config.get("EmailFrom", "noreply-ei-cs-cloudops-resource-administration#ei-cs-cloudops.local") # Optional
EMAIL_SUBJECT = config.get("EmailSubject", "Nightly VM Instance Shutdown Summary") # Optional
cfLogger = cfcommon.CloudFunctionLog()
# Validate whitelist
if INSTANCE_WHITELIST is None:
raise Exception("Unable to get whitelist")
debug_message("Whitelist loaded:")
debug_message(pformat((INSTANCE_WHITELIST)))
# Re-init Compute service - execution environment in cloud functions can be shared among each other. Let's re-init our connection every execution.
ComputeServiceContext.InitComputeService()
# Build the service object.
allRunningInstances = []
for cProjectId in QUERY_PROJECT_IDS:
debug_message("Checking Project: %s" % (cProjectId))
# Main Loop - Let's get and analyze all instances from our project
# Paginated within the 'request' object
runningInstances = []
allInstances = []
debug_message("Building Instance List...", end="")
instances = InstanceList(cProjectId)
instances.PopulateInstances()
debug_message("Done")
for cInstance in instances.GetAllInstances():
debug_message("Found Instance %s in %s [%s - %s]" % (cInstance.GetName(), cInstance.GetZone(), cInstance.GetId(), cInstance.GetStatus()))
# Check if whitelisted. If it is, skip it
if isInstanceInList(INSTANCE_WHITELIST, cInstance):
debug_message(" Instance is whitelisted. Skipping.")
continue
# Check if we should skip instance groups
if SKIP_INSTANCE_GROUPS and cInstance.IsWithinInstanceGroup():
debug_message(" Instance is within an instance group. Skipping.")
continue
debug_message(" Is Running: %s" % (cInstance.IsRunning()))
owner = cInstance.GetOwner()
if owner in ("devops", "ei devops", "eicsdevopseng"):
debug_message("Skipping instance owned by devops")
continue
# # TODO: FOR USE WHEN TESTING
# if VMAUTOSHUTDOWN_LABEL not in labels.keys():
# continue
# Keep track of this instance
allInstances.append(cInstance)
# If it's running, it's a candidate to stop
if cInstance.IsRunning():
runningInstances.append(cInstance)
# Handle no instances found
if len(allInstances) == 0:
debug_message("INFO: No Instances found.")
# Summarize for user
debug_message("")
if len(runningInstances) > 0:
debug_message("Found %s/%s non-whitelisted instances are running (project: %s)" % (len(runningInstances), len(allInstances), cProjectId))
else:
debug_message("All %s non-whitelisted instances are good (project: %s)" % (len(allInstances), cProjectId))
# Main loop to stop
debug_message("")
allRunningInstances = allRunningInstances + runningInstances
instancesToBeStopped = getInstancesToStop(allRunningInstances, SHUTDOWN_GRACEPERIOD_MIN)
stoppedCount = 0
instanceSummary = []
if len(instancesToBeStopped) == 0:
print_message("No instances are due to be stopped")
else:
for cInstance in instancesToBeStopped:
summaryEntry = {
"Name": cInstance.GetName(),
"ID": cInstance.GetId(),
"Zone": cInstance.GetShortZoneName(),
"Project": cInstance.GetProject(),
"Preview": PREVIEW_MODE,
"Stopped": False,
"InstanceLink": cInstance.GetSelfLinkToConsole()
}
logMessage = "Stopping Instance: {name} (ID: {id}, Zone: {zone}, Project: {project})".format(
name=summaryEntry.get("Name"),
id=summaryEntry.get("ID"),
zone=summaryEntry.get("Zone"),
project=summaryEntry.get("Project")
)
if PREVIEW_MODE:
print_message("(PREVIEW) " + logMessage )
else:
print_message(logMessage)
cInstance.Stop()
summaryEntry["Stopped"] = True
stoppedCount += 1
instanceSummary.append(summaryEntry)
if EMAIL_PUB_SUB_PROJECT is not None and EMAIL_PUB_SUB_TOPIC is not None:
debug_message("It looks like we have an email config. Attempting to send email")
emailBody = render_template(
'shutdown_report',
instance_summary=instanceSummary,
config=json.dumps(config, indent=4, sort_keys=True),
preview_mode=PREVIEW_MODE,
generation_time=datetime.datetime.now().astimezone(pytz.utc)
)
emailPayload = {
"To": EMAIL_TO,
"From": EMAIL_FROM,
"Subject": EMAIL_SUBJECT,
"BodyHtml": emailBody
}
if EMAIL_CC is not None:
emailPayload["CC"] = EMAIL_CC
if EMAIL_BCC is not None:
emailPayload["BCC"] = EMAIL_BCC
if not generate_local_report:
print_message("Sending email...", end="")
PublishMessage(EMAIL_PUB_SUB_PROJECT, EMAIL_PUB_SUB_TOPIC, json.dumps(emailPayload))
else:
print_message('Generating local HTML report')
with open('./html_reports/shutdown_report.html', 'w') as r:
r.write(emailBody)
print_message("Done")
# We want to log a nice structured json line to stackdriver for easy reporting.
cfLogger.log({
"StartTime": startTime.isoformat(),
"InstancesStoppedCount": stoppedCount,
"PreviewMode": PREVIEW_MODE,
"Instances": instanceSummary,
"Whitelist": INSTANCE_WHITELIST,
"EndTime": datetime.datetime.now().isoformat(),
"LogLine": "Summary"
})
print_message("DONE [%s]" % (datetime.datetime.now()))
The error I'm getting when trying to run a VM shutdown script:
Caught exception while running VMNightlyShutdown. Exception Text: Traceback (most recent call last):
File "/workspace/main.py", line 248, in StartCloudFunction
policy=policy_config)
File "/workspace/vm_nightly_shutdown.py", line 256, in AutoStopVMInstances
cInstance.Stop()
File "/workspace/modules/compute/instances.py", line 729, in Stop
stop_instance(self.project, self.GetShortZoneName(), self.GetName(), waitForCompletion=waitForCompletion)
File "/workspace/modules/compute/instances.py", line 45, in stop_instance
return wait_for_operation(project, zone, response["name"])
File "/workspace/modules/compute/instances.py", line 27, in wait_for_operation
operation=operation
File "/layers/google.python.pip/pip/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/layers/google.python.pip/pip/lib/python3.7/site-packages/googleapiclient/http.py", line 932, in execute
headers=self.headers,
File "/layers/google.python.pip/pip/lib/python3.7/site-packages/googleapiclient/http.py", line 222, in _retry_request
raise exception
File "/layers/google.python.pip/pip/lib/python3.7/site-packages/googleapiclient/http.py", line 191, in _retry_request
resp, content = http.request(uri, method, *args, **kwargs)
File "/layers/google.python.pip/pip/lib/python3.7/site-packages/google_auth_httplib2.py", line 225, in request
**kwargs
File "/layers/google.python.pip/pip/lib/python3.7/site-packages/httplib2/__init__.py", line 1721, in request
conn, authority, uri, request_uri, method, body, headers, redirections, cachekey,
File "/layers/google.python.pip/pip/lib/python3.7/site-packages/httplib2/__init__.py", line 1440, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/layers/google.python.pip/pip/lib/python3.7/site-packages/httplib2/__init__.py", line 1392, in _conn_request
response = conn.getresponse()
File "/layers/google.python.runtime/python/lib/python3.7/http/client.py", line 1373, in getresponse
response.begin()
File "/layers/google.python.runtime/python/lib/python3.7/http/client.py", line 319, in begin
version, status, reason = self._read_status()
File "/layers/google.python.runtime/python/lib/python3.7/http/client.py", line 280, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/layers/google.python.runtime/python/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/layers/google.python.runtime/python/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/layers/google.python.runtime/python/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)",
}

Exception: Failed to process waveform

Error:
Traceback (most recent call last):
File "c:\Programming\New_assistant\speech_to_text.py", line 18, in <module>
if rec.AcceptWaveform(data):
File "C:\Users\david\AppData\Local\Programs\Python\Python310\lib\site-packages\vosk\__init__.py", line 84, in AcceptWaveform
raise Exception("Failed to process waveform")
Exception: Failed to process waveform
PS C:\Programming\New_assistant>
I get this error when I try to use AcceptWaveform regardless of the file (wav) or the rest of the code, but the error is removed only when using vosk-model-small-ru-0.22, and does not give errors on vosk-model-ru-0.22, but the processing time is too long.
Code:
from vosk import Model, KaldiRecognizer
import json
import wave
model = Model(r"File\vosk-model-small-ru-0.22")
wf = wave.open(r"File\record1.wav", "rb")
rec = KaldiRecognizer(model, 8000)
result = ''
last_n = False
while True:
data = wf.readframes(8000)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
res = json.loads(rec.Result())
if res['text'] != '':
result += f" {res['text']}"
last_n = False
elif not last_n:
result += '\n'
last_n = True
res = json.loads(rec.FinalResult())
result += f" {res['text']}"
print(result)

Using the poke method, I found a solution and got an assumption about the error occurring, so if I'm wrong or you have a more complete solution, add it and I'll mark it.
If you copied the sample code using vosk, then most likely this is the version with the module vosk-model-ru-0.22 which works with the sampling rate 8000,but vosk-model-small-ru-0.22 works with 44100, so just change 8000 to 44100 (depending on the entry)

Document AI: google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument

I am getting this error when trying to implement the Document OCR from google cloud in python as explained here: https://cloud.google.com/document-ai/docs/ocr
When I run
result = client.process_document(request=request)
I get this error
Traceback (most recent call last):
File "/Users/Niolo/Desktop/untitled/Desktop/lib/python3.8/site-packages/google/api_core/grpc_helpers.py", line 73, in error_remapped_callable
return callable_(*args, **kwargs)
File "/Users/Niolo/Desktop/untitled/Desktop/lib/python3.8/site-packages/grpc/_channel.py", line 923, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/Users/Niolo/Desktop/untitled/Desktop/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Request contains an invalid argument."
debug_error_string = "{"created":"#1614769280.332675000","description":"Error received from peer ipv4:142.250.180.138:443","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/Niolo/Desktop/untitled/Desktop/lib/python3.8/site-packages/google/cloud/documentai_v1beta3/services/document_processor_service/client.py", line 327, in process_document
response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
File "/Users/Niolo/Desktop/untitled/Desktop/lib/python3.8/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__
return wrapped_func(*args, **kwargs)
File "/Users/Niolo/Desktop/untitled/Desktop/lib/python3.8/site-packages/google/api_core/retry.py", line 281, in retry_wrapped_func
return retry_target(
File "/Users/Niolo/Desktop/untitled/Desktop/lib/python3.8/site-packages/google/api_core/retry.py", line 184, in retry_target
return target()
File "/Users/Niolo/Desktop/untitled/Desktop/lib/python3.8/site-packages/google/api_core/grpc_helpers.py", line 75, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument.
My full code:
import os
# Import the base64 encoding library.
project_id= 'your-project-id'
location = 'eu' # Format is 'us' or 'eu'
processor_id = 'your-processor-id' # Create processor in Cloud Console
file_path = '/file_path/invoice.pdf'
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/full_path/your_credentials.json"
def process_document_sample(
project_id: str, location: str, processor_id: str, file_path: str
):
from google.cloud import documentai_v1beta3 as documentai
# Instantiates a client
client = documentai.DocumentProcessorServiceClient()
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
with open(file_path, "rb") as image:
image_content = image.read()
# Read the file into memory
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "document": document}
# Recognizes text entities in the PDF document
result = client.process_document(request=request)
document = result.document
print("Document processing complete.")
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
document_pages = document.pages
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
paragraphs = page.paragraphs
for paragraph in paragraphs:
paragraph_text = get_text(paragraph.layout, document)
print(f"Paragraph text: {paragraph_text}")

client = documentai.DocumentProcessorServiceClient() points to US end point by default.
in: client = documentai.DocumentProcessorServiceClient()
in: print(client.DEFAULT_ENDPOINT)
out: us-documentai.googleapis.com
You need to override the api_endpoint to EU for this to work.
from google.api_core.client_options import ClientOptions
# Set endpoint to EU
options = ClientOptions(api_endpoint="eu-documentai.googleapis.com:443")
# Instantiates a client
client = documentai.DocumentProcessorServiceClient(client_options=options)
Here is the full code:
import os
# TODO(developer): Uncomment these variables before running the sample.
project_id= 'your-project-id'
location = 'eu' # Format is 'us' or 'eu'
processor_id = 'your-processor-id' # Create processor in Cloud Console
file_path = '/file_path/invoice.pdf'
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/full_path/your_credentials.json"
def process_document_sample(
project_id: str, location: str, processor_id: str, file_path: str
):
from google.cloud import documentai_v1beta3 as documentai
from google.api_core.client_options import ClientOptions
# Set endpoint to EU
options = ClientOptions(api_endpoint="eu-documentai.googleapis.com:443")
# Instantiates a client
client = documentai.DocumentProcessorServiceClient(client_options=options)
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
with open(file_path, "rb") as image:
image_content = image.read()
# Read the file into memory
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "document": document}
# Recognizes text entities in the PDF document
result = client.process_document(request=request)
document = result.document
print("Document processing complete.")
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
document_pages = document.pages
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
paragraphs = page.paragraphs
for paragraph in paragraphs:
paragraph_text = get_text(paragraph.layout, document)
print(f"Paragraph text: {paragraph_text}")
Here is a snippet of the output:

KeyError in Python Code even though Key is present

I am trying to create a Channel Separator code to separate the transcribe that is printed in a JSON file.
I have the following code:
import json
import boto3
def lambda_handler(event, context):
if event:
s3 = boto3.client("s3")
s3_object = event["Records"][0]["s3"]
bucket_name = s3_object["bucket"]["name"]
file_name = s3_object["object"]["key"]
file_obj = s3.get_object(Bucket=bucket_name, Key=file_name)
transcript_result = json.loads(file_obj["Body"].read())
segmented_transcript = transcript_result["results"]["channel_labels"]
items = transcript_result["results"]["items"]
channel_text = []
flag = False
channel_json = {}
for no_of_channel in range (segmented_transcript["number_of_channels"]):
for word in items:
for cha in segmented_transcript["channels"]:
if cha["channel_label"] == "ch_"+str(no_of_channel):
end_time = cha["end_time"]
if "start_time" in word:
if cha["items"]:
for cha_item in cha["items"]:
if word["end_time"] == cha_item["end_time"] and word["start_time"] == cha_item["start_time"]:
channel_text.append(word["alternatives"][0]["content"])
flag = True
elif word["type"] == "punctuation":
if flag and channel_text:
temp = channel_text[-1]
temp += word["alternatives"][0]["content"]
channel_text[-1] = temp
flag = False
break
channel_json["ch_"+str(no_of_channel)] = ' '.join(channel_text)
channel_text = []
print(channel_json)
s3.put_object(Bucket="aws-speaker-separation", Key=file_name, Body=json.dumps(channel_json))
return{
'statusCode': 200,
'body': json.dumps('Channel transcript separated successfully!')
}
However, when I run it, I get an error on line 23 saying:
[ERROR] KeyError: 'end_time'
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 23, in lambda_handler
end_time = cha["end_time"]
I am confused as to why this error is happening as in my JSON code, the things to read are as follows:
JSON Code Parameters
Any ideas why this error is appearing?

cha is a channel, the end_time is a layer deeper in the items of your channel. To access the items of your channel do:
for item in cha["items"]:
print(item["end_time"])

Truncated file header while using multiprocessing

When I run the line:
def book_processing(pair, pool_length):
p = Pool(len(pool_length)*3)
temp_parameters = partial(book_call_mprocess, pair)
p.map_async(temp_parameters, pool_length).get(999999)
p.close()
p.join()
return exchange_books
I get the following error:
Traceback (most recent call last):
File "test_code.py", line 214, in <module>
current_books = book_call.book_processing(cp, book_list)
File "/home/user/Desktop/book_call.py", line 155, in book_processing
p.map_async(temp_parameters, pool_length).get(999999)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
zipfile.BadZipfile: Truncated file header
I feel as though there is some resource that is being used that didn't close during the last loop, but I am not sure how to close it (still learning about multiprocessing library). This error only occurs when my code repeats this section relatively quickly (within the same minute). This does not happen often, but is clear when it does.
Edit (adding the book_call code):
def book_call_mprocess(currency_pair, ex_list):
polo_error = 0
live_error = 0
kraken_error = 0
gdax_error = 0
ex_list = set([ex_list])
ex_Polo = 'Polo'
ex_Live = 'Live'
ex_GDAX = 'GDAX'
ex_Kraken = 'Kraken'
cp_polo = 'BTC_ETH'
cp_kraken = 'XETHXXBT'
cp_live = 'ETH/BTC'
cp_GDAX = 'ETH-BTC'
# Instances
polo_instance = poloapi.poloniex(polo_key, polo_secret)
fookraken = krakenapi.API(kraken_key, kraken_secret)
publicClient = GDAX.PublicClient()
flag = False
while not flag:
flag = False
err = False
# Polo Book
try:
if ex_Polo in ex_list:
polo_books = polo_instance.returnOrderBook(cp_polo)
exchange_books['Polo'] = polo_books
except:
err = True
polo_error = 1
# Livecoin
try:
if ex_Live in ex_list:
method = "/exchange/order_book"
live_books = OrderedDict([('currencyPair', cp_live)])
encoded_data = urllib.urlencode(live_books)
sign = hmac.new(live_secret, msg=encoded_data, digestmod=hashlib.sha256).hexdigest().upper()
headers = {"Api-key": live_key, "Sign": sign}
conn = httplib.HTTPSConnection(server)
conn.request("GET", method + '?' + encoded_data, '', headers)
response = conn.getresponse()
live_books = json.load(response)
conn.close()
exchange_books['Live'] = live_books
except:
err = True
live_error = 1
# Kraken
try:
if ex_Kraken in ex_list:
kraken_books = fookraken.query_public('Depth', {'pair': cp_kraken})
exchange_books['Kraken'] = kraken_books
except:
err = True
kraken_error = 1
# GDAX books
try:
if ex_GDAX in ex_list:
gdax_books = publicClient.getProductOrderBook(level=2, product=cp_GDAX)
exchange_books['GDAX'] = gdax_books
except:
err = True
gdax_error = 1
flag = True
if err:
flag = False
err = False
error_list = ['Polo', polo_error, 'Live', live_error, 'Kraken', kraken_error, 'GDAX', gdax_error]
print_to_excel('excel/error_handler.xlsx', 'Book Call Errors', error_list)
print "Holding..."
time.sleep(30)
return exchange_books
def print_to_excel(workbook, worksheet, data_list):
ts = str(datetime.datetime.now()).split('.')[0]
data_list = [ts] + data_list
wb = load_workbook(workbook)
if worksheet == 'active':
ws = wb.active
else:
ws = wb[worksheet]
ws.append(data_list)
wb.save(workbook)

The problem lies in the function print_to_excel
And more specifically in here:
wb = load_workbook(workbook)
If two processes are running this function at the same time, you'll run into the following race condition:
Process 1 wants to open error_handler.xlsx, since it doesn't exist it creates an empty file
Process 2 wants to open error_handler.xlsx, it does exist, so it tries to read it, but it is still empty. Since the xlsx format is just a zip file consisting of a bunch of XML files, the process expects a valid ZIP header which it doesn't find and it omits zipfile.BadZipfile: Truncated file header
What looks strange though is your error message as in the call stack I would have expected to see print_to_excel and load_workbook.
Anyway, Since you confirmed that the problem really is in the XLSX handling you can either
generate a new filename via tempfile for every process
use locking to ensure that only one process runs print_to_excel at a time

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWS Kinesis Consumer Python 3.4 Boto - python

Related

Python error when trying to execute a script to shutdown GCP VM

Exception: Failed to process waveform

Document AI: google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument

KeyError in Python Code even though Key is present

Truncated file header while using multiprocessing

Categories

Resources