Error in Prefect flow when passing large dataframe as argument

Error in Prefect flow when passing large dataframe as argument - python

I am trying to implement orchestration of a ml-pipeline with Prefect. The entire pipeline works fine when passing a small enough batch of raw data. But once I pass a larger batch, some of the flows suddenly cannot be created.
The first flows in the pipeline still works fine, which perform some data cleaning and feature generation. But once I have a cleaned dataframe of about 10 000 rows and one column containing text data and try to processes it in a flow that makes predictions based on a pretrained model, I get one of the following errors:
1.
Exception has occurred: ReadTimeout 
The operation did not complete (read) (_ssl.c:2633)ssl.SSLWantReadError: The operation did not complete (read) (_ssl.c:2633)  
During handling of the above exception, another exception occurred:  
asyncio.exceptions.CancelledError:  
During handling of the above exception, another exception occurred:  
TimeoutError:  
During handling of the above exception, another exception occurred:  httpcore.ReadTimeout:  The above exception was the direct cause of the following exception:  File "C:\Users\EmilEne\OneDrive - Brand Delta\Documents\GitHub\airflow_ds_test\ds_nomad\model_age\age_prediction.py", line 108, in <module> test = new_predictions(df, m['model'], column='cleaned_message', language='italian', market='italy') httpx.ReadTimeout:
Or 2.
Exception has occurred: LocalProtocolError
1 
h2.exceptions.StreamClosedError: 1  
During handling of the above exception, another exception occurred:  
httpcore.LocalProtocolError: 1  
The above exception was the direct cause of the following exception:  
File "C:\Users\EmilEne\OneDrive - Brand Delta\Documents\GitHub\airflow_ds_test\ds_nomad\model_age\age_prediction.py", line 105, in <module> test = new_predictions(df, m['model'], column='cleaned_message', language='italian', market='italy') httpx.LocalProtocolError: 1
I tried using the debugger in vs code to see where in the function of the flow breaks but it never enters the function, it seems the flow is never even created. I tried replacing the whole function of the flow to just a print statement like this:
from prefect import task, flow
import pandas as pd
#flow()
def predictions(df):
print(df.info())
df_test = pd.read_parquet(".../path")
test = predicitons(df_test)
But it still doesn't create the flow and gives me the same error, the df_test is also one column and 7000 rows of cleaned text data.
I really don't understand the errors, I have no knowledge SSL or h2. Anybody that can point me in the right direction?

df_test = pd.read_parquet(".../path") isn't a valid relative file path.
If you want to go up two levels you would do df_test = pd.read_parquet("../../path")

Related

Smartsheet API in Python: how to access response data from client

I am trying to understand / plan for the rate limits, particularly for the get_cell_history method in the python api.
When I run the code below, I see the following response printed to my console:
{"response": {"statusCode": 429, "reason": "Too Many Requests", "content": {"errorCode": 4003, "message": "Rate limit exceeded.", "refId": "SOME_REF_ID"}}}
But I am tyring to force my if statement to read the response and check for the 429 status code
import smartsheet
smartsheet_client = smartsheet.Smartsheet('My_API_KEY')
smartsheet_client.errors_as_exceptions(True)
sheet_id = 'My_Sheet_ID'
edit_sheet = smartsheet_client.Sheets.get_sheet(sheet_id) # type: smartsheet.smartsheet.models.Sheet
for row in edit_sheet.rows: # type: smartsheet.smartsheet.models.Row
for cell in row.cells: # type: smartsheet.smartsheet.models.Cell
cell_history = smartsheet_client.Cells.get_cell_history(sheet_id, row.id, cell.column_id,
include_all=True)
if cell_history.request_response.status_code == 429:
print(f'Rate limit exceeded.')
elif cell_history.request_response.status_code == 200:
print(f'Found Cell History: {len(cell_history.data)} edits')
Where can I access this response? Why does my program run without printing out the "Rate limit exceeded" string?
I am using Python 3.8 and the package smartsheet-python-sdk==2.105.1

The following code shows how to successfully catch an API error when using the Smartsheet Python SDK. For simplicity, I've just used the Get Sheet operation and handled the error with status code 404 in this example, but you should be able to implement the same approach in your code. Note that you need to include the smartsheet_client.errors_as_exceptions() line (as shown in this example) so that Smartsheet errors get raised as exceptions.
# raise Smartsheet errors as exceptions
smartsheet_client.errors_as_exceptions()
try:
# call the Get Sheet operation
# if the request is successful, my_sheet contains the Sheet object
my_sheet = smartsheet_client.Sheets.get_sheet(4183817838716804)
except smartsheet.exceptions.SmartsheetException as e:
# handle the API error
if isinstance(e, smartsheet.exceptions.ApiError):
if (e.error.result.status_code == 404):
print(f'Sheet not found.')
else:
print(f'Another type of error occurred: ' + str(e.error.result.status_code))
Finally, here's a bit of additional info about error handling with the Smartsheet Python SDK that may be helpful (from this other SO thread).
SmartsheetException is the base class for all of the exceptions raised by the SDK. The two most common types of SmartsheetException are ApiError and HttpError. After trapping the exception, first determine the exception type using an isinstance test.
The ApiError exception has an Error class object accessible through the error property, and then that in turn points to an ErrorResult class accessible through the result property.
The details of the API error are stored in that ErrorResult.
Note that to make the Python SDK raise exceptions for API errors, you must call the errors_as_exceptions() method on the client object.
---------------------
UPDATE (retry logic):
To repro the behavior you've described in your first comment below, I used (most of) the code from your original post above, and added some print statements to it. My code is as follows:
# raise Smartsheet errors as exceptions
smartsheet_client.errors_as_exceptions()
sheet_id = 5551639177258884
edit_sheet = smartsheet_client.Sheets.get_sheet(sheet_id) # type: smartsheet.smartsheet.models.Sheet
# initialize ctr
row_ctr = 0
for row in edit_sheet.rows: # type: smartsheet.smartsheet.models.Row
row_ctr += 1
print('----ROW # ' + str(row_ctr) + '----')
cell_ctr = 0
for cell in row.cells: # type: smartsheet.smartsheet.models.Cell
cell_ctr += 1
try:
cell_history = smartsheet_client.Cells.get_cell_history(sheet_id, row.id, cell.column_id, include_all=True)
print(f'FOUND CELL HISTORY: {len(cell_history.data)} edits [cell #' + str(cell_ctr) + ' ... rowID|columnID= ' + str(row.id) + '|' + str(cell.column_id) + ']')
except smartsheet.exceptions.SmartsheetException as e:
# handle the exception
if isinstance(e, smartsheet.exceptions.ApiError):
if (e.error.result.status_code == 429):
print(f'RATE LIMIT EXCEEDED! [cell #' + str(cell_ctr) + ']')
Interestingly, it seems that the Smartsheet Python SDK has built-in retry logic (with exponential backoff). The exception handling portion of my code isn't ever reached -- because the SDK is recognizing the Rate Limiting error and automatically retrying (with exponential backoff) any requests that fail with that error, instead of raising an exception for my code to catch.
I can see this happening in my print output. For example, here's the cell history output from Row #2 in my sheet (each row has 8 columns / cells) -- where all Get Cell History calls were processed without any exceptions being thrown.
Things continue without error until the 7th row is being processed, when a Rate Limiting error occurs when it's trying to get cell history for the fifth column/cell in that row. Instead of this being a fatal error though -- I see processing pause for a few seconds (i.e., no additional output is printed to my console for a few seconds)...and then processing resumes. Another Rate Limiting error occurs, followed by a slightly longer pause in processing, after which processing resumes and cell history is successfully retrieved for the fifth through eighth cells in that row.
As my program continues to run, I see this same thing happen several times -- i.e., several API calls are successfully issued, then a Rate Limiting error occurs, processing pauses momentarily, and then processing resumes with API calls successfully issued, until another Rate Limiting error occurs, etc. etc. etc.
To confirm that the SDK is automatically retrying requests that fail with the Rate Limiting error, I dug into the code a bit. Sure enough, I see that the request_with_retry function within /smartsheet/smartsheet.py does indeed include logic to retry failing requests with exponential backoff (if should_retry is true).
Which begs the question, how does should_retry get set (i.e., in response to which error codes will the SDK automatically retry a failing request)? Once again, the answer is found within /smartsheet/smartsheet.py (SDK source code). This suggests that you don't need exception handling logic in your code for the Rate Limiting error (or for any of the other 3 errors where should_retry is set to True), as the SDK is built to automatically retry any requests that fail with that error.

Unable to pass custom values to AWS Lambda function

I am trying to pass custom input to my lambda function (Python 3.7 runtime) in JSON format from the rule set in CloudWatch.
However I am facing difficulty accessing elements from the input correctly.
Here's what the CW rule looks like.
Here is what the lambda function is doing.
import sqlalchemy # Package for accessing SQL databases via Python
import psycopg2
def lambda_handler(event,context):
today = date.today()
engine = sqlalchemy.create_engine("postgresql://some_user:userpassword#som_host/some_db")
con = engine.connect()
dest_table = "dest_table"
print(event)
s={'upload_date': today,'data':'Daily Ingestion Data','status':event["data"]} # Error points here
ingestion = pd.DataFrame(data = [s])
ingestion.to_sql(dest_table, con, schema = "some_schema", if_exists = "append",index = False, method = "multi")
When I test the event with default test event values, the print(event) statement prints the default test values ("key1":"value1") but the syntax for adding data to the database ingestion.to_sql() i.e the payload from input "Testing Input Data" is inserted to the database successfully.
However the lambda function still shows an error while running the function at event["data"] as Key error.
1) Am I accessing the Constant JSON input the right way?
2) If not then why is the data still being ingested as the way it is intended despite throwing an error at that line of code
3) The data is ingested when the function is triggered as per the schedule expression. When I test the event it shows an error with the Key. Is it the test event which is not similar to the actual input which is causing this error?
There is alot of documentation and articles on how to take input but I could not find anything that shows how to access the input inside the function. I have been stuck at this point for a while and it frustrates me that why is this not documented anywhere.
Would really appreciate if someone could give me some clarity to this process.
Thanks in advance
Edit:
Image of the monitoring Logs:
[ERROR] KeyError: 'data' Traceback (most recent call last): File "/var/task/test.py"

I am writing this answer based on the comments.
The syntax that was originally written is valid and I am able to access the data correctly. There was a need to implement a timeout as the function was constantly hitting that threshold followed by some change in the iteration.

How to handle the pandas gbq related exception? [duplicate]

I'm trying to use try/except to query BigQuery tables, sometimes the query may not be correct in which case pandas raises a GenericGBQException error.
My problem is I get name 'GenericGBQException' is not defined when trying to handle this error, example code below:
try:
df = pd.read_gbq(query, projID)
query_fail = 0
except GenericGBQException:
query_fail = 1
if query_fail == 1:
do some stuff
I can get by with catching all exceptions though obviously it's not ideal.

I suspect you want to catch pd.GenericGBQException. (Or perhaps gbq.GenericGBQException -- it depends on your imports. Are you importing the module that defines the exception you're trying to catch?)
Also, consider catching PandasError, the base class of all exceptions from the package: https://github.com/pydata/pandas/blob/master/pandas/io/gbq.py#L85

ignore_invalid_triggers not working

I am using the pytransitions library (documented here) to implement a Finite State Machine. One of the features outlined is the ability to ignore invalid triggers. Here is the example as per the documentation:
# Globally suppress invalid trigger exceptions
m = Machine(lump, states, initial='solid', ignore_invalid_triggers=True)
If the trigger is set to true, no error should be thrown for triggers that are invalid.
Here is a sample of the code I am trying to construct:
from transitions import Machine
states = ['changes ongoing', 'changes complete', 'changes pushed', 'code reviewed', 'merged']
triggers = ['git commit', 'git push', 'got plus2', 'merged']
# Initialize the state machine
git_user = Machine(states=states, initial=states[0], ignore_invalid_triggers=True, ordered_transitions=True)
# Create the FSM using the data provided
for i in range(len(triggers)):
git_user.add_transition(trigger=triggers[i], source=states[i], dest=states[i+1])
print(git_user.state)
git_user.trigger('git commit')
print(git_user.state)
git_user.trigger('invalid') # This line will throw an AttributeError
The produced error:
changes ongoing
changes complete
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/transitions/core.py", line 58, in _get_trigger
raise AttributeError("Model has no trigger named '%s'" % trigger_name)
AttributeError: Model has no trigger named 'invalid'
Process finished with exit code 1
I am unsure of why an error is being thrown when ignore_invalid_triggers=True.
There is limited information on this library besides the documentation on the official github page. If anyone has any insight on this I would appreciate the help.
Thanks in advance.

To be an invalid trigger under the rules set out in the documentation, the trigger name has to be valid somewhere in the model. For instance, try trigger "merged" from state "changes ongoing". You get an attribute error because "invalid" is not a trigger at all: you have a list of four, and that's not one of them.
To see the effect of establishing "invalid" as a trigger, add an end-to-start transition (the last line below) after your nice linear loop:
# Create the FSM using the data provided
for i in range(len(triggers)):
git_user.add_transition(trigger=triggers[i], source=states[i], dest=states[i+1])
git_user.add_transition(trigger="invalid", source=states[-1], dest=states[0])
Now your code should run as expected, ignoring that invalid transition.

Pandas GenericGBQException

I'm trying to use try/except to query BigQuery tables, sometimes the query may not be correct in which case pandas raises a GenericGBQException error.
My problem is I get name 'GenericGBQException' is not defined when trying to handle this error, example code below:
try:
df = pd.read_gbq(query, projID)
query_fail = 0
except GenericGBQException:
query_fail = 1
if query_fail == 1:
do some stuff
I can get by with catching all exceptions though obviously it's not ideal.

I suspect you want to catch pd.GenericGBQException. (Or perhaps gbq.GenericGBQException -- it depends on your imports. Are you importing the module that defines the exception you're trying to catch?)
Also, consider catching PandasError, the base class of all exceptions from the package: https://github.com/pydata/pandas/blob/master/pandas/io/gbq.py#L85

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error in Prefect flow when passing large dataframe as argument - python

df_test = pd.read_parquet(".../path") isn't a valid relative file path. If you want to go up two levels you would do df_test = pd.read_parquet("../../path")

Related

Smartsheet API in Python: how to access response data from client

Unable to pass custom values to AWS Lambda function

How to handle the pandas gbq related exception? [duplicate]

ignore_invalid_triggers not working

Pandas GenericGBQException

Categories

Resources