I am trying to create a Channel Separator code to separate the transcribe that is printed in a JSON file.
I have the following code:
import json
import boto3
def lambda_handler(event, context):
if event:
s3 = boto3.client("s3")
s3_object = event["Records"][0]["s3"]
bucket_name = s3_object["bucket"]["name"]
file_name = s3_object["object"]["key"]
file_obj = s3.get_object(Bucket=bucket_name, Key=file_name)
transcript_result = json.loads(file_obj["Body"].read())
segmented_transcript = transcript_result["results"]["channel_labels"]
items = transcript_result["results"]["items"]
channel_text = []
flag = False
channel_json = {}
for no_of_channel in range (segmented_transcript["number_of_channels"]):
for word in items:
for cha in segmented_transcript["channels"]:
if cha["channel_label"] == "ch_"+str(no_of_channel):
end_time = cha["end_time"]
if "start_time" in word:
if cha["items"]:
for cha_item in cha["items"]:
if word["end_time"] == cha_item["end_time"] and word["start_time"] == cha_item["start_time"]:
channel_text.append(word["alternatives"][0]["content"])
flag = True
elif word["type"] == "punctuation":
if flag and channel_text:
temp = channel_text[-1]
temp += word["alternatives"][0]["content"]
channel_text[-1] = temp
flag = False
break
channel_json["ch_"+str(no_of_channel)] = ' '.join(channel_text)
channel_text = []
print(channel_json)
s3.put_object(Bucket="aws-speaker-separation", Key=file_name, Body=json.dumps(channel_json))
return{
'statusCode': 200,
'body': json.dumps('Channel transcript separated successfully!')
}
However, when I run it, I get an error on line 23 saying:
[ERROR] KeyError: 'end_time'
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 23, in lambda_handler
end_time = cha["end_time"]
I am confused as to why this error is happening as in my JSON code, the things to read are as follows:
JSON Code Parameters
Any ideas why this error is appearing?
cha is a channel, the end_time is a layer deeper in the items of your channel. To access the items of your channel do:
for item in cha["items"]:
print(item["end_time"])
Related
Im new to AWS Lambda so please take it easy on me :)
Im getting a lot of errors in my code and Im not sure the best way to troubleshoot other than look at the cloudwatch console and adjust things as necessary. If anyone has any tips for troubleshooting Id appreciate it!
Heres my plan for what I want to do and please let me know if this makes sense:
upload file to s3 bucket -> 2. upload triggers a lambda to run (inside this lambda is a python script that modifies the data source. The source data is a messy file) -> 3. store the output to the same s3 bucket in a separate folder - > 4. (future state) perform analysis on the new json file.
I have my s3 bucket created and I have setup the lambda to trigger when a new file is added. That part is working! I have added my python script (which works on my local drive) portion to the lambda function w/in the code section of lambda.
The errors am getting errors consist of saying that my 6 global variables (df_a1-df_aq) are not defined. If I move them out of the function then it works, however when I get to the merge portion of my code I am getting an error saying that says "cannot merge a series without a name" I gave them a name using the name= object and Im still getting this issue.
Here's my code that is in my aws lambda:
try:
import json
import boto3
import pandas as pd
import time
import io
print("All Modules are ok ...")
except Exception as e:
print("Error in Imports ")
s3_client = boto3.client('s3')
#df_a1 = pd.Series(dtype='object', name='test1')
#df_g1 = pd.Series(dtype='object', name='test2')
#df_j1 = pd.Series(dtype='object', name='test3')
#df_p1 = pd.Series(dtype='object', name='test4')
#df_r1 = pd.Series(dtype='object', name='test5')
#df_q1 = pd.Series(dtype='object', name='test6')
def Add_A1 (xyz, RC, string):
#DATA TO GRAB FROM STRING
global df_a1
IMG = boolCodeReturn(string[68:69].strip())
roa = string[71:73].strip()
#xyzName = string[71:73].strip()
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, IMG, roa], index=['XYZ', 'IMG', 'Roa'])
df_a1 = df_a1.append(series, ignore_index=True)
def Add_G1 (xyz, RC, string):
global df_g1
#DATA TO GRAB FROM STRING
gcode = string[16:30].strip()
ggname = string[35:95].strip()
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, gcode, ggname], index=['XYZ', 'Gcode', 'Ggname'])
df_g1 = df_g1.append(series, ignore_index=True)
def Add_J1 (xyz, RC, string):
#DATA TO GRAB FROM STRING
global df_j1
xyzName = string[56:81].strip()
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, xyzName], index=['XYZ', 'XYZName'])
df_j1 = df_j1.append(series, ignore_index=True)
def Add_P01 (xyz, RC, string):
global df_p1
#DATA TO GRAB FROM STRING
giname = string[50:90].strip()
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, giname], index=['XYZ', 'Giname'])
df_p1 = df_p1.append(series, ignore_index=True)
def Add_R01 (xyz, RC, string):
global df_r1
#DATA TO GRAB FROM STRING
Awperr = boolCodeReturn(string[16:17].strip())
#PPP= string[17:27].lstrip("0")
AUPO = int(string[27:40].lstrip("0"))
AUPO = AUPO / 100000
AupoED = string[40:48]
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, AUPO, Awperr, AupoED], index = ['XYZ', 'AUPO', 'Awperr', 'AupoED'])
df_r1 = df_r1.append(series, ignore_index=True)
def Add_Q01 (xyz, RC, string):
global df_q1
#DATA TO GRAB FROM STRING
#PPP= string[17:27].lstrip("0")
UPPWA = int(string[27:40].lstrip("0"))
UPPWA = UPPWA / 100000
EDWAPPP = string[40:48]
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, UPPWA, EDWAPPP], index = ['XYZ', 'UPPWA', 'EDWAPPPer'])
df_q1 = df_q1.append(series, ignore_index=True)
def boolCodeReturn (code):
if code == "X":
return 1
else:
return 0
def errorHandler(xyz, RC, string):
pass
def lambda_handler(event, context):
print(event)
#Get Bucket Name
bucket = event['Records'][0]['s3']['bucket']['name']
#get the file/key name
key = event['Records'][0]['s3']['object']['key']
response = s3_client.get_object(Bucket=bucket, Key=key)
print("Got Bucket! - pass")
print("Got Name! - pass ")
data = response['Body'].read().decode('utf-8')
print('reading data')
buf = io.StringIO(data)
print(buf.readline())
#data is the file uploaded
fileRow = buf.readline()
print('reading_row')
while fileRow:
currentString = fileRow
xyz = currentString[0:11].strip()
RC = currentString[12:15].strip() #this grabs the code the indicates what the data type is
#controls which function to run based on the code
switcher = {
"A1": Add_A1,
"J1": Add_J1,
"G1": Add_G1,
"P01": Add_P01,
"R01": Add_R01,
"Q01": Add_Q01
}
runfunc = switcher.get(RC, errorHandler)
runfunc (xyz, RC, currentString)
fileRow = buf.readline()
print(type(df_a1), "A1 FILE")
print(type(df_g1), 'G1 FILE')
buf.close()
##########STEP 3: JOIN THE DATA TOGETHER##########
df_merge = pd.merge(df_a1, df_g1, how="left", on="XYZ")
df_merge = pd.merge(df_merge, df_j1, how="left", on="XYZ")
df_merge = pd.merge(df_merge, df_p1, how="left", on="XYZ")
df_merge = pd.merge(df_merge, df_q1, how="left", on="XYZ")
df_merge = pd.merge(df_merge, df_r1, how="left", on="XYZ")
##########STEP 4: SAVE THE DATASET TO A JSON FILE##########
filename = 'Export-Records.json'
json_buffer = io.StringIO()
df_merge.to_json(json_buffer)
s3_client.put_object(Buket='file-etl',Key=filename, Body=json_buffer.getvalue())
t = time.localtime()
current_time = time.strftime("%H:%M:%S", t)
print("Finished processing at " + current_time)
response = {
"statusCode": 200,
'body': json.dumps("Code worked!")
}
return response
Here are some of the error messages:
[ERROR] NameError: name 'df_a1' is not defined
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 145, in lambda_handler
runfunc (ndc, recordcode, currentString)
File "/var/task/lambda_function.py", line 26, in Add_A1
df_a1 = df_a1.append(series, ignore_index=True)
[ERROR] NameError: name 'df_g1' is not defined
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 152, in lambda_handler
runfunc (ndc, recordcode, currentString)
File "/var/task/lambda_function.py", line 38, in Add_G1
df_g1 = df_g1.append(series, ignore_index=True)
[ERROR] ValueError: Cannot merge a Series without a name
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 160, in lambda_handler
df_merge = pd.merge(df_a1, df_g1, how="left", on="NDC")
File "/opt/python/pandas/core/reshape/merge.py", line 111, in merge
op = _MergeOperation(
File "/opt/python/pandas/core/reshape/merge.py", line 645, in __init__
_left = _validate_operand(left)
File "/opt/python/pandas/core/reshape/merge.py", line 2425, in _validate_operand
raise ValueError("Cannot merge a Series without a name")
I know there are loads of answers to this question but I'm still not getting it...
Following is sa_reporting.py
class saReport():
def __init__(self, body, to_table, normalise=False, date_col=None):
global directory
self.body = body
self.to_table = to_table
self.normalise = normalise
self.date_col = date_col if date_col is not None else []
directory = os.path.join('/Users','python', self.to_table)
if not os.path.exists(directory):
os.mkdir(directory)
def download_files(self, ...):
...
def download_reports(self, ...):
...
def get_files(self):
...
def read_file(self, file):
....
def load_to_db(self, sort_by=None): # THIS IS WHAT I THINK IS CAUSING THE ERROR
sort_by = sort_by if sort_by is not None else [] # THIS IS WHAT I TRIED TO FIX IT
def normalise_data(self, data):
dim_data = []
for row in data:
if row not in dim_data:
dim_data.append(row)
return dim_data
def convert_dates(self, data):
if self.date_col:
for row in data:
for index in self.date_col:
if len(row[index]) > 10:
row[index] = row[index][:-5].replace('T',' ')
row[index] = datetime.datetime.strptime(row[index], "%Y-%m-%d %H:%M:%S")
else:
row[index] = datetime.datetime.strptime(row[index], "%Y-%m-%d").date()
return data
print(f'\nWriting data to {self.to_table} table...', end='')
files = self.get_files()
for file in files:
print('Processing ' + file.split("sa360/",1)[1] + '...', end='')
csv_file = self.read_file(file)
csv_headers = ', '.join(csv_file[0])
csv_data = csv_file[1:]
if self.normalise:
csv_data = self.normalise_data(csv_data)
csv_data = self.convert_dates(csv_data)
if sort_by:
csv_data = sorted(csv_data, key=itemgetter(sort_by))
#...some other code that inserts into a database...
Executing the following script (sa_main.py):
import sa_reporting
from sa_body import *
dim_campaign_test = sa_reporting.saReport(
body=dim_campaign_body,
to_table='dimsa360CampaignTest',
normalise=True,
date_col=[4,5]
)
dim_campaign_test_download = dim_campaign_test.download_reports()
dim_campaign_test_download.load_to_db(sort_by=0) # THIS IS WHERE THE ERROR OCCURS
Output and error message:
Downloading reports...
The report is still generating...restarting
The report is ready
Processing...
Downloading fragment 0 for report AAAnOdc9I_GnxAB0
Files successfully downloaded
Traceback (most recent call last):
File "sa_main.py", line 43, in <module>
dim_campaign_test_download.load_to_db(sort_by=0)
AttributeError: 'NoneType' object has no attribute 'load_to_db'
Why am I getting this error? And how can I fix it?
I just want to make None be the default argument and if a user specifies the sort_by parameter then None will be replaced with whatever the user specifies (which should be an integer index)
This code would seem to suggest that dim_campaign_test_download is being set to None. As in the below line, you set it to the result of dim_campaign_test.download_reports(), it is likely that no reports are being found.
dim_campaign_test_download = dim_campaign_test.download_reports()
You might want to instead do the following, as dim_campaign_test is the saReport Object on which you probably want to operate:
dim_campaign_test.load_to_db(sort_by=0)
This code is pre-made in a Zapier forum to pull failed responses from another piece of software called iAuditor. When I plug in the code and update the API token and webhook URL this error pops up:
Traceback (most recent call last):
SyntaxError: invalid syntax (usercode.py, line 42)
Here is the code:
[code]
import json
import requests
auth_header = {'Authorization': 'a4fca847d3f203bd7306ef5d1857ba67a2b3d66aa455e06fac0ad0be87b9d226'}
webhook_url = 'https://hooks.zapier.com/hooks/catch/3950922/efka9n/'
api_url = 'https://api.safetyculture.io/audits/'
audit_id = input['audit_id']
audit_doc = requests.get(api_url + audit_id, headers=auth_header).json()
failed_items = []
audit_author = audit_doc['audit_data']['authorship']['author']
conducted_on = audit_doc['audit_data']['date_completed']
conducted_on = conducted_on[:conducted_on.index('T')]
audit_title = audit_doc['template_data']['metadata']['name']
for item in audit_doc['items']:
if item.get('responses') and item['responses'].get('failed') == True:
label = item.get('label')
if label is None:
label = 'no_label'
responses = item['responses']
response_label = responses['selected'][0]['label']
notes = responses.get('text')
if notes is None:
notes = ''
failed_items.append({'label': label,
'response_label': response_label,
'conducted_on': conducted_on,
'notes': notes,
'author': audit_author
})
for item in failed_items:
r = requests.post(webhook_url, data = item)
return response.json()
[/code]
This looks like an error from the platform. It looks like Zapier uses a script called usercode.py to bootstrap launching your script and the error seems to be coming from that part.
When I run the line:
def book_processing(pair, pool_length):
p = Pool(len(pool_length)*3)
temp_parameters = partial(book_call_mprocess, pair)
p.map_async(temp_parameters, pool_length).get(999999)
p.close()
p.join()
return exchange_books
I get the following error:
Traceback (most recent call last):
File "test_code.py", line 214, in <module>
current_books = book_call.book_processing(cp, book_list)
File "/home/user/Desktop/book_call.py", line 155, in book_processing
p.map_async(temp_parameters, pool_length).get(999999)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
zipfile.BadZipfile: Truncated file header
I feel as though there is some resource that is being used that didn't close during the last loop, but I am not sure how to close it (still learning about multiprocessing library). This error only occurs when my code repeats this section relatively quickly (within the same minute). This does not happen often, but is clear when it does.
Edit (adding the book_call code):
def book_call_mprocess(currency_pair, ex_list):
polo_error = 0
live_error = 0
kraken_error = 0
gdax_error = 0
ex_list = set([ex_list])
ex_Polo = 'Polo'
ex_Live = 'Live'
ex_GDAX = 'GDAX'
ex_Kraken = 'Kraken'
cp_polo = 'BTC_ETH'
cp_kraken = 'XETHXXBT'
cp_live = 'ETH/BTC'
cp_GDAX = 'ETH-BTC'
# Instances
polo_instance = poloapi.poloniex(polo_key, polo_secret)
fookraken = krakenapi.API(kraken_key, kraken_secret)
publicClient = GDAX.PublicClient()
flag = False
while not flag:
flag = False
err = False
# Polo Book
try:
if ex_Polo in ex_list:
polo_books = polo_instance.returnOrderBook(cp_polo)
exchange_books['Polo'] = polo_books
except:
err = True
polo_error = 1
# Livecoin
try:
if ex_Live in ex_list:
method = "/exchange/order_book"
live_books = OrderedDict([('currencyPair', cp_live)])
encoded_data = urllib.urlencode(live_books)
sign = hmac.new(live_secret, msg=encoded_data, digestmod=hashlib.sha256).hexdigest().upper()
headers = {"Api-key": live_key, "Sign": sign}
conn = httplib.HTTPSConnection(server)
conn.request("GET", method + '?' + encoded_data, '', headers)
response = conn.getresponse()
live_books = json.load(response)
conn.close()
exchange_books['Live'] = live_books
except:
err = True
live_error = 1
# Kraken
try:
if ex_Kraken in ex_list:
kraken_books = fookraken.query_public('Depth', {'pair': cp_kraken})
exchange_books['Kraken'] = kraken_books
except:
err = True
kraken_error = 1
# GDAX books
try:
if ex_GDAX in ex_list:
gdax_books = publicClient.getProductOrderBook(level=2, product=cp_GDAX)
exchange_books['GDAX'] = gdax_books
except:
err = True
gdax_error = 1
flag = True
if err:
flag = False
err = False
error_list = ['Polo', polo_error, 'Live', live_error, 'Kraken', kraken_error, 'GDAX', gdax_error]
print_to_excel('excel/error_handler.xlsx', 'Book Call Errors', error_list)
print "Holding..."
time.sleep(30)
return exchange_books
def print_to_excel(workbook, worksheet, data_list):
ts = str(datetime.datetime.now()).split('.')[0]
data_list = [ts] + data_list
wb = load_workbook(workbook)
if worksheet == 'active':
ws = wb.active
else:
ws = wb[worksheet]
ws.append(data_list)
wb.save(workbook)
The problem lies in the function print_to_excel
And more specifically in here:
wb = load_workbook(workbook)
If two processes are running this function at the same time, you'll run into the following race condition:
Process 1 wants to open error_handler.xlsx, since it doesn't exist it creates an empty file
Process 2 wants to open error_handler.xlsx, it does exist, so it tries to read it, but it is still empty. Since the xlsx format is just a zip file consisting of a bunch of XML files, the process expects a valid ZIP header which it doesn't find and it omits zipfile.BadZipfile: Truncated file header
What looks strange though is your error message as in the call stack I would have expected to see print_to_excel and load_workbook.
Anyway, Since you confirmed that the problem really is in the XLSX handling you can either
generate a new filename via tempfile for every process
use locking to ensure that only one process runs print_to_excel at a time
I am trying to build a kinesis consumer script using python 3.4 below is an example of my code. I want the records to be saved to a local file that I can later push to S3:
from boto import kinesis
import time
import json
# AWS Connection Credentials
aws_access_key = 'your_key'
aws_access_secret = 'your_secret key'
# Selected Kinesis Stream
stream = 'TwitterTesting'
# Aws Authentication
auth = {"aws_access_key_id": aws_access_key, "aws_secret_access_key": aws_access_secret}
conn = kinesis.connect_to_region('us-east-1',**auth)
# Targeted file to be pushed to S3 bucket
fileName = "KinesisDataTest2.txt"
file = open("C:\\Users\\csanders\\PycharmProjects\\untitled\\KinesisDataTest.txt", "a")
# Describe stream and get shards
tries = 0
while tries < 10:
tries += 1
time.sleep(1)
response = conn.describe_stream(stream)
if response['StreamDescription']['StreamStatus'] == 'ACTIVE':
break
else:
raise TimeoutError('Stream is still not active, aborting...')
# Get Shard Iterator and get records from stream
shard_ids = []
stream_name = None
if response and 'StreamDescription' in response:
stream_name = response['StreamDescription']['StreamName']
for shard_id in response['StreamDescription']['Shards']:
shard_id = shard_id['ShardId']
shard_iterator = conn.get_shard_iterator(stream,
shard_id, 'TRIM_HORIZON')
shard_ids.append({'shard_id': shard_id, 'shard_iterator': shard_iterator['ShardIterator']})
tries = 0
result = []
while tries < 100:
tries += 1
response = conn.get_records(shard_iterator, 100)
shard_iterator = response['NextShardIterator']
if len(response['Records'])> 0:
for res in response['Records']:
result.append(res['Data'])
print(result, shard_iterator)
For some reason when I run this script I get the following error each time:
Traceback (most recent call last):
File "C:/Users/csanders/PycharmProjects/untitled/Get_records_Kinesis.py", line 57, in <module>
response = json.load(conn.get_records(shard_ids, 100))
File "C:\Python34\lib\site-packages\boto-2.38.0-py3.4.egg\boto\kinesis\layer1.py", line 327, in get_records
body=json.dumps(params))
File "C:\Python34\lib\site-packages\boto-2.38.0- py3.4.egg\boto\kinesis\layer1.py", line 874, in make_request
body=json_body)
boto.exception.JSONResponseError: JSONResponseError: 400 Bad Request
{'Message': 'Start of list found where not expected', '__type': 'SerializationException'}
My end goal is to eventually kick this data into an S3 bucket. I just need to get these records to return and print first. The data going into the stream is JSON dump twitter data using the put_record function. I can post that code too if needed.
Updated that one line from response = json.load(conn.get_records(shard_ids, 100)) to response = conn.get_records(shard_iterator, 100)
response = json.load(conn.get_records(shard_ids, 100))
get_records expects a shard_id not an array of shards. when it's trying to get records it fails miserably (you see the 400 from Kinesis saying that the request is bad).
http://boto.readthedocs.org/en/latest/ref/kinesis.html?highlight=get_records#boto.kinesis.layer1.KinesisConnection.get_records
if you replace following will work ( "while" you set up according for how many record you would like to collect, you can make infinite "with == 0" and remove "tries += 1")
shard_iterator = conn.get_shard_iterator(stream,
shard_id, 'TRIM_HORIZON')
shard_ids.append({'shard_id': shard_id, 'shard_iterator': shard_iterator['ShardIterator']})
with following:
shard_iterator = conn.get_shard_iterator(stream,
shard_id, "LATEST")["ShardIterator"]
also to write to a file change("\n" is for new line):
print(result, shard_iterator)
to:
file.write(str(result) + "\n")
Hope it helps.