Im new to AWS Lambda so please take it easy on me :)
Im getting a lot of errors in my code and Im not sure the best way to troubleshoot other than look at the cloudwatch console and adjust things as necessary. If anyone has any tips for troubleshooting Id appreciate it!
Heres my plan for what I want to do and please let me know if this makes sense:
upload file to s3 bucket -> 2. upload triggers a lambda to run (inside this lambda is a python script that modifies the data source. The source data is a messy file) -> 3. store the output to the same s3 bucket in a separate folder - > 4. (future state) perform analysis on the new json file.
I have my s3 bucket created and I have setup the lambda to trigger when a new file is added. That part is working! I have added my python script (which works on my local drive) portion to the lambda function w/in the code section of lambda.
The errors am getting errors consist of saying that my 6 global variables (df_a1-df_aq) are not defined. If I move them out of the function then it works, however when I get to the merge portion of my code I am getting an error saying that says "cannot merge a series without a name" I gave them a name using the name= object and Im still getting this issue.
Here's my code that is in my aws lambda:
try:
import json
import boto3
import pandas as pd
import time
import io
print("All Modules are ok ...")
except Exception as e:
print("Error in Imports ")
s3_client = boto3.client('s3')
#df_a1 = pd.Series(dtype='object', name='test1')
#df_g1 = pd.Series(dtype='object', name='test2')
#df_j1 = pd.Series(dtype='object', name='test3')
#df_p1 = pd.Series(dtype='object', name='test4')
#df_r1 = pd.Series(dtype='object', name='test5')
#df_q1 = pd.Series(dtype='object', name='test6')
def Add_A1 (xyz, RC, string):
#DATA TO GRAB FROM STRING
global df_a1
IMG = boolCodeReturn(string[68:69].strip())
roa = string[71:73].strip()
#xyzName = string[71:73].strip()
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, IMG, roa], index=['XYZ', 'IMG', 'Roa'])
df_a1 = df_a1.append(series, ignore_index=True)
def Add_G1 (xyz, RC, string):
global df_g1
#DATA TO GRAB FROM STRING
gcode = string[16:30].strip()
ggname = string[35:95].strip()
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, gcode, ggname], index=['XYZ', 'Gcode', 'Ggname'])
df_g1 = df_g1.append(series, ignore_index=True)
def Add_J1 (xyz, RC, string):
#DATA TO GRAB FROM STRING
global df_j1
xyzName = string[56:81].strip()
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, xyzName], index=['XYZ', 'XYZName'])
df_j1 = df_j1.append(series, ignore_index=True)
def Add_P01 (xyz, RC, string):
global df_p1
#DATA TO GRAB FROM STRING
giname = string[50:90].strip()
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, giname], index=['XYZ', 'Giname'])
df_p1 = df_p1.append(series, ignore_index=True)
def Add_R01 (xyz, RC, string):
global df_r1
#DATA TO GRAB FROM STRING
Awperr = boolCodeReturn(string[16:17].strip())
#PPP= string[17:27].lstrip("0")
AUPO = int(string[27:40].lstrip("0"))
AUPO = AUPO / 100000
AupoED = string[40:48]
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, AUPO, Awperr, AupoED], index = ['XYZ', 'AUPO', 'Awperr', 'AupoED'])
df_r1 = df_r1.append(series, ignore_index=True)
def Add_Q01 (xyz, RC, string):
global df_q1
#DATA TO GRAB FROM STRING
#PPP= string[17:27].lstrip("0")
UPPWA = int(string[27:40].lstrip("0"))
UPPWA = UPPWA / 100000
EDWAPPP = string[40:48]
#ADD RECORD TO DATAFRAME
series = pd.Series (data=[xyz, UPPWA, EDWAPPP], index = ['XYZ', 'UPPWA', 'EDWAPPPer'])
df_q1 = df_q1.append(series, ignore_index=True)
def boolCodeReturn (code):
if code == "X":
return 1
else:
return 0
def errorHandler(xyz, RC, string):
pass
def lambda_handler(event, context):
print(event)
#Get Bucket Name
bucket = event['Records'][0]['s3']['bucket']['name']
#get the file/key name
key = event['Records'][0]['s3']['object']['key']
response = s3_client.get_object(Bucket=bucket, Key=key)
print("Got Bucket! - pass")
print("Got Name! - pass ")
data = response['Body'].read().decode('utf-8')
print('reading data')
buf = io.StringIO(data)
print(buf.readline())
#data is the file uploaded
fileRow = buf.readline()
print('reading_row')
while fileRow:
currentString = fileRow
xyz = currentString[0:11].strip()
RC = currentString[12:15].strip() #this grabs the code the indicates what the data type is
#controls which function to run based on the code
switcher = {
"A1": Add_A1,
"J1": Add_J1,
"G1": Add_G1,
"P01": Add_P01,
"R01": Add_R01,
"Q01": Add_Q01
}
runfunc = switcher.get(RC, errorHandler)
runfunc (xyz, RC, currentString)
fileRow = buf.readline()
print(type(df_a1), "A1 FILE")
print(type(df_g1), 'G1 FILE')
buf.close()
##########STEP 3: JOIN THE DATA TOGETHER##########
df_merge = pd.merge(df_a1, df_g1, how="left", on="XYZ")
df_merge = pd.merge(df_merge, df_j1, how="left", on="XYZ")
df_merge = pd.merge(df_merge, df_p1, how="left", on="XYZ")
df_merge = pd.merge(df_merge, df_q1, how="left", on="XYZ")
df_merge = pd.merge(df_merge, df_r1, how="left", on="XYZ")
##########STEP 4: SAVE THE DATASET TO A JSON FILE##########
filename = 'Export-Records.json'
json_buffer = io.StringIO()
df_merge.to_json(json_buffer)
s3_client.put_object(Buket='file-etl',Key=filename, Body=json_buffer.getvalue())
t = time.localtime()
current_time = time.strftime("%H:%M:%S", t)
print("Finished processing at " + current_time)
response = {
"statusCode": 200,
'body': json.dumps("Code worked!")
}
return response
Here are some of the error messages:
[ERROR] NameError: name 'df_a1' is not defined
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 145, in lambda_handler
runfunc (ndc, recordcode, currentString)
File "/var/task/lambda_function.py", line 26, in Add_A1
df_a1 = df_a1.append(series, ignore_index=True)
[ERROR] NameError: name 'df_g1' is not defined
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 152, in lambda_handler
runfunc (ndc, recordcode, currentString)
File "/var/task/lambda_function.py", line 38, in Add_G1
df_g1 = df_g1.append(series, ignore_index=True)
[ERROR] ValueError: Cannot merge a Series without a name
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 160, in lambda_handler
df_merge = pd.merge(df_a1, df_g1, how="left", on="NDC")
File "/opt/python/pandas/core/reshape/merge.py", line 111, in merge
op = _MergeOperation(
File "/opt/python/pandas/core/reshape/merge.py", line 645, in __init__
_left = _validate_operand(left)
File "/opt/python/pandas/core/reshape/merge.py", line 2425, in _validate_operand
raise ValueError("Cannot merge a Series without a name")
Related
I don't know what I'm doing wrong with the leftJoin() function.
I have a connection to a DB2 database and an Oracle database.
Queries return me a result, both in DB2 and Oracle.
I continue to get the primary key value and try to pass it as a variable to the leftJoin() function, but it doesn't work here.
The key consists of two fields. If I manually put the value of 'ID', 'VER' into on in df1 in merge it works.
import ibm_db
import ibm_db as db
import ibm_db_dbi
import pandas as pd
import cx_Oracle
import re
def connectDB2():
arg1 = "DATABASE=...;HOSTNAME=...;PORT=...;UID=...;PWD=...;"
conn = db.connect(arg1, "", "")
if conn:
print ('connection success')
# Run SQL
sql_generic = "SELECT distinct 'select ' || LISTAGG(COLNAME,', ') || ' from ' || trim(TABSCHEMA) || '.' || tabname || ' FETCH FIRST 2 ROWS ONLY' FROM SYSCAT.columns WHERE TABSCHEMA = '...' AND TABNAME = '...' AND COLNAME NOT IN ('CDC_STATUS','CDC_ODS_UPD') GROUP BY TABSCHEMA, TABNAME"
stmt = ibm_db.exec_immediate(conn, sql_generic)
result = ibm_db.fetch_both(stmt)
conn1 = ibm_db_dbi.Connection(conn)
connectDB2.df = pd.read_sql(result['1'], conn1)
print('df', connectDB2.df)
sql_PK = "SELECT COLNAMES FROM syscat.INDEXES WHERE TABSCHEMA='...' AND TABNAME = '...' AND UNIQUERULE='P'"
conn2 = ibm_db_dbi.Connection(conn)
connectDB2.df1 = pd.read_sql(sql_PK, conn2)
print('pk', connectDB2.df1)
d = connectDB2.df1.loc[:, "COLNAMES"]
print('d', d)
print('d0', d[0])
content_new1 = re.sub('$|^', '\'', d[0], flags=re.M)
content_new2 = re.sub('\'\+', '\'', content_new1, flags=re.M)
connectDB2.content_new3 = re.sub('\+', '\',\'', content_new2, flags=re.M)
print('c3', connectDB2.content_new3) --> format: 'ID','VER'
else:
print ('connection failed')
def connectOracle():
con = cx_Oracle.connect('...')
orders_sql = """select ... from ... FETCH FIRST 2 ROWS ONLY""";
connectOracle.df_orders = pd.read_sql(orders_sql, con)
print(connectOracle.df_orders)
def leftJoin():
df1 = pd.merge(connectOracle.df_orders, connectDB2.df, on=connectDB2.content_new3, how='left')
connectDB2()
connectOracle()
leftJoin()
I am adding below what the logs return.
Traceback (most recent call last):
File "C:\Users\PycharmProjects\pythonProject1\testConnection.py", line 68, in <module>
leftJoin()
File "C:\Users\PycharmProjects\pythonProject1\testConnection.py", line 57, in leftJoin
df1 = pd.merge(connectOracle.df_orders, connectDB2.df, on=connectDB2.content_new3, how='left')
File "C:\Users\PycharmProjects\pythonProject1\venv\lib\site-packages\pandas\core\reshape\merge.py", line 106, in merge
op = _MergeOperation(
File "C:\Users\PycharmProjects\pythonProject1\venv\lib\site-packages\pandas\core\reshape\merge.py", line 699, in __init__
) = self._get_merge_keys()
File "C:\Users\PycharmProjects\pythonProject1\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1096, in _get_merge_keys
right_keys.append(right._get_label_or_level_values(rk))
File "C:\Users\PycharmProjects\pythonProject1\venv\lib\site-packages\pandas\core\generic.py", line 1779, in _get_label_or_level_values
raise KeyError(key)
KeyError: "'ID','VER'"
You are using the merge command wrongly,
I dont know what is actually inside your given dfs
connectOracle.df_orders
connectDB2.df
But i know for sure, you are doing a left join. And you are passing a key found on your second df or "right" df per say.
pd.merge(connectOracle.df_orders, connectDB2.df, on = 'guy with the same index found in both your dfs', how='left')
If you dont have that guy, well them you should define your left key or your right that want to 'join', on the parameters
I have been working with the alpha vantage python API for a while now, but I have only needed to pull daily and intraday timeseries data. I am trying to pull extended intraday data, but am not having any luck getting it to work. Trying to run the following code:
from alpha_vantage.timeseries import TimeSeries
apiKey = 'MY API KEY'
ts = TimeSeries(key = apiKey, output_format = 'pandas')
totalData, _ = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
print(totalData)
gives me the following error:
Traceback (most recent call last):
File "/home/pi/Desktop/test.py", line 9, in <module>
totalData, _ = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 219, in _format_wrapper
self, *args, **kwargs)
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 160, in _call_wrapper
return self._handle_api_call(url), data_key, meta_data_key
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 354, in _handle_api_call
json_response = response.json()
File "/usr/lib/python3/dist-packages/requests/models.py", line 889, in json
self.content.decode(encoding), **kwargs
File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
What is interesting is that if you look at the TimeSeries class, it states that extended intraday is returned as a "time series in one csv_reader object" whereas everything else, which works for me, is returned as "two json objects". I am 99% sure this has something to do with the issue, but I'm not entirely sure because I would think that calling intraday extended function would at least return SOMETHING (despite it being in a different format), but instead just gives me an error.
Another interesting little note is that the function refuses to take "adjusted = True" (or False) as an input despite it being in the documentation... likely unrelated, but maybe it might help diagnose.
Seems like TIME_SERIES_INTRADAY_EXTENDED can return only CSV format, but the alpha_vantage wrapper applies JSON methods, which results in the error.
My workaround:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
apiKey = 'MY API KEY'
ts = TimeSeries(key = apiKey, output_format = 'csv')
#download the csv
totalData = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
#csv --> dataframe
df = pd.DataFrame(list(totalData[0]))
#setup of column and index
header_row=0
df.columns = df.iloc[header_row]
df = df.drop(header_row)
df.set_index('time', inplace=True)
#show output
print(df)
This is an easy way to do it.
ticker = 'IBM'
date= 'year1month2'
apiKey = 'MY API KEY'
df = pd.read_csv('https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol='+ticker+'&interval=15min&slice='+date+'&apikey='+apiKey+'&datatype=csv&outputsize=full')
#Show output
print(df)
import pandas as pd
symbol = 'AAPL'
interval = '15min'
slice = 'year1month1'
api_key = ''
adjusted = '&adjusted=true&'
csv_url = 'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol='+symbol+'&interval='+interval+'&slice='+slice+adjusted+'&apikey='+api_key
data = pd.read_csv(csv_url)
print(data.head)
I know there are loads of answers to this question but I'm still not getting it...
Following is sa_reporting.py
class saReport():
def __init__(self, body, to_table, normalise=False, date_col=None):
global directory
self.body = body
self.to_table = to_table
self.normalise = normalise
self.date_col = date_col if date_col is not None else []
directory = os.path.join('/Users','python', self.to_table)
if not os.path.exists(directory):
os.mkdir(directory)
def download_files(self, ...):
...
def download_reports(self, ...):
...
def get_files(self):
...
def read_file(self, file):
....
def load_to_db(self, sort_by=None): # THIS IS WHAT I THINK IS CAUSING THE ERROR
sort_by = sort_by if sort_by is not None else [] # THIS IS WHAT I TRIED TO FIX IT
def normalise_data(self, data):
dim_data = []
for row in data:
if row not in dim_data:
dim_data.append(row)
return dim_data
def convert_dates(self, data):
if self.date_col:
for row in data:
for index in self.date_col:
if len(row[index]) > 10:
row[index] = row[index][:-5].replace('T',' ')
row[index] = datetime.datetime.strptime(row[index], "%Y-%m-%d %H:%M:%S")
else:
row[index] = datetime.datetime.strptime(row[index], "%Y-%m-%d").date()
return data
print(f'\nWriting data to {self.to_table} table...', end='')
files = self.get_files()
for file in files:
print('Processing ' + file.split("sa360/",1)[1] + '...', end='')
csv_file = self.read_file(file)
csv_headers = ', '.join(csv_file[0])
csv_data = csv_file[1:]
if self.normalise:
csv_data = self.normalise_data(csv_data)
csv_data = self.convert_dates(csv_data)
if sort_by:
csv_data = sorted(csv_data, key=itemgetter(sort_by))
#...some other code that inserts into a database...
Executing the following script (sa_main.py):
import sa_reporting
from sa_body import *
dim_campaign_test = sa_reporting.saReport(
body=dim_campaign_body,
to_table='dimsa360CampaignTest',
normalise=True,
date_col=[4,5]
)
dim_campaign_test_download = dim_campaign_test.download_reports()
dim_campaign_test_download.load_to_db(sort_by=0) # THIS IS WHERE THE ERROR OCCURS
Output and error message:
Downloading reports...
The report is still generating...restarting
The report is ready
Processing...
Downloading fragment 0 for report AAAnOdc9I_GnxAB0
Files successfully downloaded
Traceback (most recent call last):
File "sa_main.py", line 43, in <module>
dim_campaign_test_download.load_to_db(sort_by=0)
AttributeError: 'NoneType' object has no attribute 'load_to_db'
Why am I getting this error? And how can I fix it?
I just want to make None be the default argument and if a user specifies the sort_by parameter then None will be replaced with whatever the user specifies (which should be an integer index)
This code would seem to suggest that dim_campaign_test_download is being set to None. As in the below line, you set it to the result of dim_campaign_test.download_reports(), it is likely that no reports are being found.
dim_campaign_test_download = dim_campaign_test.download_reports()
You might want to instead do the following, as dim_campaign_test is the saReport Object on which you probably want to operate:
dim_campaign_test.load_to_db(sort_by=0)
I am trying to create a Channel Separator code to separate the transcribe that is printed in a JSON file.
I have the following code:
import json
import boto3
def lambda_handler(event, context):
if event:
s3 = boto3.client("s3")
s3_object = event["Records"][0]["s3"]
bucket_name = s3_object["bucket"]["name"]
file_name = s3_object["object"]["key"]
file_obj = s3.get_object(Bucket=bucket_name, Key=file_name)
transcript_result = json.loads(file_obj["Body"].read())
segmented_transcript = transcript_result["results"]["channel_labels"]
items = transcript_result["results"]["items"]
channel_text = []
flag = False
channel_json = {}
for no_of_channel in range (segmented_transcript["number_of_channels"]):
for word in items:
for cha in segmented_transcript["channels"]:
if cha["channel_label"] == "ch_"+str(no_of_channel):
end_time = cha["end_time"]
if "start_time" in word:
if cha["items"]:
for cha_item in cha["items"]:
if word["end_time"] == cha_item["end_time"] and word["start_time"] == cha_item["start_time"]:
channel_text.append(word["alternatives"][0]["content"])
flag = True
elif word["type"] == "punctuation":
if flag and channel_text:
temp = channel_text[-1]
temp += word["alternatives"][0]["content"]
channel_text[-1] = temp
flag = False
break
channel_json["ch_"+str(no_of_channel)] = ' '.join(channel_text)
channel_text = []
print(channel_json)
s3.put_object(Bucket="aws-speaker-separation", Key=file_name, Body=json.dumps(channel_json))
return{
'statusCode': 200,
'body': json.dumps('Channel transcript separated successfully!')
}
However, when I run it, I get an error on line 23 saying:
[ERROR] KeyError: 'end_time'
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 23, in lambda_handler
end_time = cha["end_time"]
I am confused as to why this error is happening as in my JSON code, the things to read are as follows:
JSON Code Parameters
Any ideas why this error is appearing?
cha is a channel, the end_time is a layer deeper in the items of your channel. To access the items of your channel do:
for item in cha["items"]:
print(item["end_time"])
When I run the line:
def book_processing(pair, pool_length):
p = Pool(len(pool_length)*3)
temp_parameters = partial(book_call_mprocess, pair)
p.map_async(temp_parameters, pool_length).get(999999)
p.close()
p.join()
return exchange_books
I get the following error:
Traceback (most recent call last):
File "test_code.py", line 214, in <module>
current_books = book_call.book_processing(cp, book_list)
File "/home/user/Desktop/book_call.py", line 155, in book_processing
p.map_async(temp_parameters, pool_length).get(999999)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
zipfile.BadZipfile: Truncated file header
I feel as though there is some resource that is being used that didn't close during the last loop, but I am not sure how to close it (still learning about multiprocessing library). This error only occurs when my code repeats this section relatively quickly (within the same minute). This does not happen often, but is clear when it does.
Edit (adding the book_call code):
def book_call_mprocess(currency_pair, ex_list):
polo_error = 0
live_error = 0
kraken_error = 0
gdax_error = 0
ex_list = set([ex_list])
ex_Polo = 'Polo'
ex_Live = 'Live'
ex_GDAX = 'GDAX'
ex_Kraken = 'Kraken'
cp_polo = 'BTC_ETH'
cp_kraken = 'XETHXXBT'
cp_live = 'ETH/BTC'
cp_GDAX = 'ETH-BTC'
# Instances
polo_instance = poloapi.poloniex(polo_key, polo_secret)
fookraken = krakenapi.API(kraken_key, kraken_secret)
publicClient = GDAX.PublicClient()
flag = False
while not flag:
flag = False
err = False
# Polo Book
try:
if ex_Polo in ex_list:
polo_books = polo_instance.returnOrderBook(cp_polo)
exchange_books['Polo'] = polo_books
except:
err = True
polo_error = 1
# Livecoin
try:
if ex_Live in ex_list:
method = "/exchange/order_book"
live_books = OrderedDict([('currencyPair', cp_live)])
encoded_data = urllib.urlencode(live_books)
sign = hmac.new(live_secret, msg=encoded_data, digestmod=hashlib.sha256).hexdigest().upper()
headers = {"Api-key": live_key, "Sign": sign}
conn = httplib.HTTPSConnection(server)
conn.request("GET", method + '?' + encoded_data, '', headers)
response = conn.getresponse()
live_books = json.load(response)
conn.close()
exchange_books['Live'] = live_books
except:
err = True
live_error = 1
# Kraken
try:
if ex_Kraken in ex_list:
kraken_books = fookraken.query_public('Depth', {'pair': cp_kraken})
exchange_books['Kraken'] = kraken_books
except:
err = True
kraken_error = 1
# GDAX books
try:
if ex_GDAX in ex_list:
gdax_books = publicClient.getProductOrderBook(level=2, product=cp_GDAX)
exchange_books['GDAX'] = gdax_books
except:
err = True
gdax_error = 1
flag = True
if err:
flag = False
err = False
error_list = ['Polo', polo_error, 'Live', live_error, 'Kraken', kraken_error, 'GDAX', gdax_error]
print_to_excel('excel/error_handler.xlsx', 'Book Call Errors', error_list)
print "Holding..."
time.sleep(30)
return exchange_books
def print_to_excel(workbook, worksheet, data_list):
ts = str(datetime.datetime.now()).split('.')[0]
data_list = [ts] + data_list
wb = load_workbook(workbook)
if worksheet == 'active':
ws = wb.active
else:
ws = wb[worksheet]
ws.append(data_list)
wb.save(workbook)
The problem lies in the function print_to_excel
And more specifically in here:
wb = load_workbook(workbook)
If two processes are running this function at the same time, you'll run into the following race condition:
Process 1 wants to open error_handler.xlsx, since it doesn't exist it creates an empty file
Process 2 wants to open error_handler.xlsx, it does exist, so it tries to read it, but it is still empty. Since the xlsx format is just a zip file consisting of a bunch of XML files, the process expects a valid ZIP header which it doesn't find and it omits zipfile.BadZipfile: Truncated file header
What looks strange though is your error message as in the call stack I would have expected to see print_to_excel and load_workbook.
Anyway, Since you confirmed that the problem really is in the XLSX handling you can either
generate a new filename via tempfile for every process
use locking to ensure that only one process runs print_to_excel at a time