Using the default session in Placebo to unit test Boto3 code

Using the default session in Placebo to unit test Boto3 code - python

I have legacy Boto3 code that makes a lot of use of the default Boto3 session, e.g.
import boto3
client = boto3.client('ec2')
client.describe_images(DryRun=False)
...
I wish to write unit tests for this legacy code using placebo.
However, docs there seem to imply that the code-under-test would need to always manage the Boto3 session explicitly, i.e.
import boto3
import placebo
session = boto3.Session()
pill = placebo.attach(session, data_path='/path/to/response/directory')
pill.record()
client = session.client('ec2')
client.describe_images(DryRun=False)
...
My reading of the code (e.g.) is that this is quite a limitation of the Placebo Mock framework, although I am no expert Python programmer.
Am I misunderstanding something basic - is there any way to work-around this, or would I have to refactor all my legacy code to explicitly pass around a session?

placebo needs a Session object and the examples all show creating an explicit Session object but I think you could just reference the "built-in" Session object.
import boto3
import placebo
pill = placebo.attach(boto3.session, data_path='/path/to/response/directory')

I figured it out by reading through the Boto3 unit tests (ref).
To attach Placebo to the default session, it is necessary to explicitly setup the default session, before calling Placebo:
import boto3
import placebo
boto3.setup_default_session()
session = boto3.DEFAULT_SESSION
pill = placebo.attach(session, data_path='/path/to/response/directory')
pill.record()
client = boto3.client('ec2')
client.describe_images(DryRun=False)
Now, just by adding those four lines, I can record Boto3 calls in my legacy code, without further refactoring.
I will raise a pull request to add these notes in the Placebo README.

Related

How can I use SSPI to negotiate requests handled by external libraries?

I'll set expectations with the fact that I've been pushed well outside my area of expertise here. I'm behind a corporate firewall, and it's interfering with a lot of external code I use.
For example, I'm trying to use HuggingFace's from_pretrained method. Behind the scenes, this eventually makes a request similar to this:
import requests
requests.get('https://huggingface.co/distilbert-base-uncased-distilled-squad/resolve/main/tokenizer_config.json')
This requests fails with an error from my proxy telling me my credentials are missing, but it can be fixed with the following using this excellent library:
import requests
from requests_negotiate_sspi import HttpNegotiateAuth
s = requests.Session()
s.auth = HttpNegotiateAuth()
s.get('https://huggingface.co/distilbert-base-uncased-distilled-squad/resolve/main/tokenizer_config.json', verify='/path/to/cert.pem')
Unfortunately though, that request is made behind the scenes in the HuggingFace library. I can set env variables to save the path to the cert, but I can't use the SSPI negotiation unless I control that code directly (so far as I can tell). Is there any way around this problem?

list_schemas() method missing on Boto3 Glue client object

So, I think I'm running up against an issue with out of date documentation. According to the documentation here I should be able to use list_schemas() to get a list of schemas defined in the Hive Data Catalog: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.list_schemas
However, this method doesn't seem to exist:
import boto3
glue = boto3.client('glue')
glue.list_schemas()
AttributeError: 'Glue' object has no attribute 'list_schemas'
Other methods (e.g. list_crawlers()) still appear to be present and work just fine. Has this method been moved? Do I need to install some additional boto3 libraries for this to work?

Based on the comments.
The issue was caused by using old boto3. Upgrading to the newer version solved the issue.

You should make a session first, and use the client method of the session, then it should work:
import boto3
session = boto3.session.Session()
glue_client = session.client('glue')
schemas_name = glue_client.list_schemas()

Tracing a Python gRPC server deployed on Cloud Run with OpenTelemetry

I'm running a Python gRPC server on Cloud Run and attempting to add instrumentation to capture trace information. I have a basic setup currently, however I'm having trouble making use of propagation as shown in the OpenTelemetry docs.
Inbound requests have the x-cloud-trace-context header, and I can log the header value in the gRPC method I've been working with, however the traces created by the OpenTelemetry library always have a different ID than the trace ID from the request header.
This is the simple tracing.py module I've created to provide configuration and access to the current Tracer instance:
"""Utility functions for tracing."""
import opentelemetry.exporter.cloud_trace as cloud_trace
import opentelemetry.propagate as propagate
import opentelemetry.propagators.cloud_trace_propagator as cloud_trace_propagator
import opentelemetry.trace as trace
from opentelemetry.sdk import trace as sdk_trace
from opentelemetry.sdk.trace import export
import app_instance
def get_tracer() -> trace.Tracer:
"""Function that provides an object for tracing.
Returns:
trace.Tracer instance.
"""
return trace.get_tracer(__name__)
def configure_tracing() -> None:
trace.set_tracer_provider(sdk_trace.TracerProvider())
if app_instance.IS_LOCAL:
print("Configuring local tracing.")
span_exporter: export.SpanExporter = export.ConsoleSpanExporter()
else:
print(f"Configuring cloud tracing in environment {app_instance.ENVIRONMENT}.")
span_exporter = cloud_trace.CloudTraceSpanExporter()
propagate.set_global_textmap(cloud_trace_propagator.CloudTraceFormatPropagator())
trace.get_tracer_provider().add_span_processor(export.SimpleSpanProcessor(span_exporter))
This configure_tracing function is called by the entrypoint script run on container start, so it executes before any requests are handled. When running in Google Cloud, the CloudTraceFormatPropagator should be what's required to ensure trace propagation, however it doesn't seem to be working for me.
This is the simple gRPC method I've been implementing with:
import grpc
from opentelemetry import trace
import stripe
from common import cloud_logging, datastore_utils, proto_helpers, tracing
from services.payment_service import payment_service_pb2
from third_party import stripe_client
def GetStripeInvoice(
self, request: payment_service_pb2.GetStripeInvoiceRequest, context: grpc.ServicerContext
) -> payment_service_pb2.StripeInvoiceResponse:
tracer: trace.Tracer = tracing.get_tracer()
with tracer.start_as_current_span('GetStripeInvoice'):
print(f"trace ID from header: {dict(context.invocation_metadata()).get('x-cloud-trace-context')}")
cloud_logging.info(f"Getting Stripe invoice.")
order = datastore_utils.get_pb_with_pb_key(request.order)
try:
invoice: stripe.Invoice = stripe_client.get_invoice(
invoice_id=order.stripe_invoice_id
)
cloud_logging.info(f"Retrieved Stripe invoice. Amount due: {invoice['amount_due']}")
except stripe.error.StripeError as e:
cloud_logging.error(
f"Failed to retrieve invoice: {e}"
)
context.abort(code=grpc.StatusCode.INTERNAL, details=str(e))
return payment_service_pb2.StripeInvoiceResponse(
invoice=proto_helpers.create_struct(invoice)
)
I've even gone as far as adding the x-cloud-trace-context header to local client requests, to no avail - the included value isn't used when starting traces.
I'm not sure what I'm missing here - I can see traces in the Cloud Trace dashboard so I believe the basic instrumentation is correct, however there's obviously something going on with the configuration/usage of the CloudTraceFormatPropagator.

It turns out that my configuration wasn't correct - or, I should say, it wasn't complete. I'd followed this basic example from the docs for the Google Cloud OpenTelemetry library, but I didn't realize that manually instrumenting wasn't needed.
I removed the call to tracer.start_as_current_span in my gRPC method, installed the gRPC instrumentation package (opentelemetry-instrumentation-grpc), and added it to the tracing configuration step during startup of my gRPC server, which now looks something like this:
from opentelemetry.instrumentation import grpc as grpc_instrumentation
from common import tracing # from my original question
def main():
"""Starts up GRPC server."""
# Set up tracing
tracing.configure_tracing()
grpc_instrumentation.GrpcInstrumentorServer().instrument()
# Set up the gRPC server
server = grpc.server(futures.ThreadPoolExecutor(max_workers=100))
# set up services & start
This approach has solved the issue described in my question - my log messages are now threaded in the expected manner
As someone new to telemetry & instrumentation, I didn't realize that I'd need to take an extra step since I'm tracing gRPC requests, but it makes sense now.
I ended up finding some helpful examples in a different set of docs - I'm not sure why these are separate from the docs linked earlier in this answer.
EDIT: Ah, I believe the gRPC instrumentation, and thus the related docs, are part of a separate but related project wherein contributors can add packages that instrument libraries of interest (i.e. gRPC, redis, etc). It'd be helpful if it was unified, which is the topic of this issue in the main OpenTelemetry Python repo.

While reviewing Google Documentation of OpenTelemetry using Python, I found some configurations that could help with the issue of tracing the correct ID. Additionally, there is a troubleshooting document to view traces in your Google Cloud Project when you expect trace data to be present.
Python-OpenTelemetry - https://cloud.google.com/trace/docs/setup/python-ot
Google Cloud Trace Troubleshooting - https://cloud.google.com/trace/docs/troubleshooting
For secure channels, you need to pass in chanel_type=’secure’. It is explained in the following link: https://github.com/open-telemetry/opentelemetry-python-contrib/issues/365
You need to use the x-cloud-trace-context header to ensure your traces use the same trace ID as the load balancer and AppServer on Google Cloud Run, and all link up in Google Trace.
The code below works to see you logs alongside traces in Google Trace’s Trace List view:
from opentelemetry import trace
from opentelemetry.trace.span import get_hexadecimal_trace_id, get_hexadecimal_span_id
current_span = trace.get_current_span()
if current_span:
trace_id = current_span.get_span_context().trace_id
span_id = current_span.get_span_context().span_id
if trace_id and span_id:
logging_fields['logging.googleapis.com/trace'] = f"projects/{self.gce_project}/traces/{get_hexadecimal_trace_id(trace_id)}"
logging_fields['logging.googleapis.com/spanId'] = f"{get_hexadecimal_span_id(span_id)}"
logging_fields['logging.googleapis.com/trace_sampled'] = True
The documentation and code above were tested using Flask Framework.

SQLAlchemy used in Flask, Session management implementation

As I cannot use the Flask-SQLAlchemy due to models definitions and use of the database part of the app in other contexts than Flask, I found several ways to manage sessions and I am not sure what to do.
One thing that everyone seems to agree (including me) is that a new session should be created at the beginning of each request and be committed + closed when the request has been processed and the response is ready to be sent back to the client.
Currently, I implemented the session management that way:
I have a database initialization python script which creates the engine (engine = create_engine(app.config["MYSQL_DATABASE_URI"])) and defines the session maker Session = sessionmaker(bind=engine, expire_on_commit=False).
In another file I defined two function decorated with flask's before_request and teardown_request applications decorators.
#app.before_request
def create_db_session():
g.db_session = Session()
#app.teardown_request
def close_db_session(exception):
try:
g.db_session.commit()
except:
g.db_session.rollback()
finally:
g.db_session.close()
I then use the g.db_session when I need to perform queries: g.db_session.query(models.User.user_id).filter_by(username=username)
Is this a correct way to manage sessions ?
I also took a look at the scoped sessions proposed by SQLAlchemy and this might be anotherway of doing things, but I am not sure about how to change my system to use scoped sessions...
If I understood it well, I would not use the g variable, but I would instead always refer to the Session definition declared by Session = scoped_session(sessionmaker(bind=engine, expire_on_commit=False)) and I would not need to initialize a new session explicitly when a request arrives.
I could just perform my queries as usual with Session.query(models.User.user_id).filter_by(username=username) and I would just need to remove the session when the request ends:
#app.teardown_request
def close_db_session(exception):
Session.commit()
Session.remove()
I am a bit lost with this session management topic and I would need help to understand how to manage sessions. Is there a real difference between the two approaches above?

Your approach of managing the session via flask.g is completely acceptable to my point of view. Whatever we are trying to do with SQLAlchemy, one must remember the basic principles:
Always clean up after yourself. At web application runtime, if you spawn a lot of sessions without .close()ing them, this will eventually lead to connection overflow at your DB instance. You are handling this by calling finally: session.close()
Maintain session independence. It's not good if various application contexts ( requests, threads, etc..) share the same session instance, because it's not deterministic. You are doing this by ensuring only one session runs per one request.
The scoped_session can be considered as just an alternative of flask.g - it ensures that within one thread, each call to the Session() constructor returns the same object - https://docs.sqlalchemy.org/en/13/orm/contextual.html#unitofwork-contextual
It's a SQLA batteries included version of your session management code.
So far, if you are using Flask, which is a synchronous framework, I don't think you will have any issues with this setup.

Different between AWS boto3.session.Session() and boto3.Session()

I am trying to use AWS python library boto3 to create a session. I found out we can do that either
session = boto3.Session(profile_name='profile1')
or
session2 = boto3.session.Session(profile_name='profile2')
I have checked their docs, it suppose to use boto3.session.Session().
Why both ways work ? What the different of concept behind them ?

It is just for convenience; they both refer to the same class. What is happening here is that the __init__.py for the python boto3 package includes the following:
from boto3.session import Session
This just allows you to refer to the Session class in your python code as boto3.Session rather than boto3.session.Session.
This article provides more information about this python idiom:
One common thing to do in your __init__.py is to import selected Classes, functions, etc into the package level so they can be conveniently imported from the package.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using the default session in Placebo to unit test Boto3 code - python

placebo needs a Session object and the examples all show creating an explicit Session object but I think you could just reference the "built-in" Session object. import boto3 import placebo pill = placebo.attach(boto3.session, data_path='/path/to/response/directory')

Related

How can I use SSPI to negotiate requests handled by external libraries?

list_schemas() method missing on Boto3 Glue client object

Tracing a Python gRPC server deployed on Cloud Run with OpenTelemetry

SQLAlchemy used in Flask, Session management implementation

Different between AWS boto3.session.Session() and boto3.Session()

Categories

Resources