Passing pandas dataframe to fastapi - python

I wish to create an API using which I can take Pandas dataframe as an input and store that in my DB.
I am able to do so with the csv file. However, the problem with that is that, my datatype information is lost (column datatypes like: int, array, float and so on) which is important for what I am trying to do.
I have already read this: Passing a pandas dataframe to FastAPI for NLP ML
I cannot create a class like this:
class Data(BaseModel):
# id: str
project: str
messages: str
The reason being I don't have any fixed schema. the dataframe could be of any shape with varying data types. I have created a dynamic query to create a table as per coming data frame and insert into that dataframe as well.
However, being new to fastapi, I am not able to figure out if there is an efficient way of sending this changing (dynamic) dataframe requirement of mine and store it via the queries that I have created.
If the information is not sufficient, I can try to provide more examples.
Is there a way I can send pandas dataframe from my jupyter notebook itself.
Any guidance on this would be greatly appreciated.
#router.post("/send-df")
async def push_df_funct(
target_name: Optional[str] = Form(...),
join_key: str = Form(...),
local_csv_file: UploadFile = File(None),
db: Session = Depends(pg.get_db)
):
"""
API to upload dataframe to database
"""
return upload_dataframe(db, featureset_name, local_csv_file, join_key)
def registration_cassandra(self, feature_registation_dict):
'''
# Table creation in cassandra as per the given feature registration JSON
Takes:
1. feature_registration_dict: Feature registration JSON
Returns:
- Response stating that the table has been created in cassandra
'''
logging.info(feature_registation_dict)
target_table_name = feature_registation_dict.get('featureset_name')
join_key = feature_registation_dict.get('join_key')
metadata_list = feature_registation_dict.get('metadata_list')
table_name_delimiter = "__"
logging.info(metadata_list)
column_names = [ sub['name'] for sub in metadata_list ]
data_types = [ DataType.to_cass_datatype(eval(sub['data_type']).value) for sub in metadata_list ]
logging.info(f"Column names: {column_names}")
logging.info(f"Data types: {data_types}")
ls = list(zip(column_names, data_types))
target_table_name = target_table_name + table_name_delimiter + join_key
base_query = f"CREATE TABLE {self.keyspace}.{target_table_name} ("
# CREATE TABLE images_by_month5(tid object PRIMARY KEY , cc_num object,amount object,fraud_label object,activity_time object,month object);
# create_query_new = "CREATE TABLE vpinference_dev.images_by_month4 (month int,activity_time timestamp,amount double,cc_num varint,fraud_label varint,
# tid text,PRIMARY KEY (month, activity_time, tid)) WITH CLUSTERING ORDER BY (activity_time DESC, tid ASC)"
#CREATE TABLE group_join_dates ( groupname text, joined timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) )
flag = True
for name, data_type in ls:
base_query += " " + name
base_query += " " + data_type
#if flag :
# base_query += " PRIMARY KEY "
# flag = False
base_query += ','
create_query = base_query.strip(',').rstrip(' ') + ', month varchar, activity_time timestamp,' + ' PRIMARY KEY (' + f'month, activity_time, {join_key}) )' + f' WITH CLUSTERING ORDER BY (activity_time DESC, {join_key} ASC' + ');'
logging.info(f"Query to create table in cassandra: {create_query}")
try:
session = self.get_session()
session.execute((create_query))
except Exception as e:
logging.exception(f"Some error occurred while doing the registration in cassandra. Details :: {str(e)}")
raise AppException(f"Some error occurred while doing the registration in cassandra. Details :: {str(e)}")
response = f"Table created successfully in cassandra at: vpinference_dev.{target_table_name}__{join_key};"
return response
This is the dictionary that I am passing:
feature_registation_dict = {
'featureSetName': 'data_type_testing_29',
'teamName': 'Harsh',
'frequency': 'DAILY',
'joinKey': 'tid',
'model_version': 'v1',
'model_name': 'data type testing',
'metadata_list': [{'name': 'tid',
'data_type': 'text',
'definition': 'Credit Card Number (Unique)'},
{'name': 'cc_num',
'data_type': 'bigint',
'definition': 'Aggregated Metric: Average number of transactions for the card aggregated by past 10 minutes'},
{'name': 'amount',
'data_type': 'double',
'definition': 'Aggregated Metric: Average transaction amount for the card aggregated by past 10 minutes'},
{'name': 'datetime',
'data_type': 'text',
'definition': 'Required feature for event timestamp'}]}

Not sure I understood exactly what you need but I'll give it a try. To send any dataframe to fastapi, you could do something like:
#fastapi
#app.post("/receive_df")
def receive_df(df_in: str):
df = pd.DataFrame.read_json(df_in)
#jupyter
payload={"df_in":df.to_json()}
requests.post("localhost:8000/receive_df", data=payload)
Can't really test this right now, there's probably some mistakes in there but the gist is just serializing the DataFrame to json and then serializing it in the endpoint. If you need (json) validation, you can also use the pydantic.Json data type. If there is no fixed schema then you can't use BaseModel in any useful way. But just sending a plain json string should be all you need, if your data comes only from reliable sources (your jupyter Notebook).

Related

How to query for a list of all Mailchimp campaigns using Python?

There are campaigns; however, none of them are being returned from this sample script:
nicholas#mordor:~/python$
nicholas#mordor:~/python$ python3 chimp.py
key jfkdljfkl_key
user fdjkslafjs_user
password dkljfdkl_pword
server fjkdls_server
nicholas#mordor:~/python$
nicholas#mordor:~/python$ cat chimp.py
import os
from mailchimp3 import MailChimp
key=(os.environ['chimp_key'])
user=(os.environ['chimp_user'])
password=(os.environ['chimp_password'])
server=(os.environ['chimp_server'])
print ("key\t\t", key)
print ("user\t\t", user)
print ("password\t", password)
print ("server\t\t", server)
client = MailChimp(mc_api=key, mc_user=user)
client.lists.all(get_all=True, fields="lists.name,lists.id")
client.campaigns.all(get_all=True)
nicholas#mordor:~/python$
do I need to send additional information to get back a list of campaigns? Just looking to log some basic responses from Mailchimp.
(obviously, I've not posted my API key, nor other other sensitive info.)
This is what I use and works for me. Just call the get_all_campaigns function with the MailChimp client. I added n_days for my specific needs, but you can choose to delete that part of the code if you do not need it. You can also customize the renames and drop columns as per your needs.
from typing import Optional, Union, List, Tuple
from datetime import timedelta, date
import pandas as pd # type: ignore
from mailchimp3 import MailChimp # type: ignore
default_campaign_fields = [
'id',
'send_time',
'emails_sent',
'recipients.recipient_count',
'settings.title',
'settings.from_name',
'settings.reply_to',
'report_summary'
]
def get_campaigns(client: MailChimp, n_days: int = 7, fields: Optional[Union[str, List[str]]] = None) -> pd.DataFrame:
"""
Gets the statistics for all sent campaigns in the last 'n_days'
client: (Required) MailChimp client object
n_days: (int) Get campaigns for the last n_days
fields: Specific fields to return. Default is None which gets some predefined columns.
"""
keyword = 'campaigns'
if fields is None:
fields = default_campaign_fields
# If it is a string (single field), convert to List so that join operation works properly
if isinstance(fields, str):
fields = [fields]
fields = [keyword + '.' + field for field in fields]
fields = ",".join(fields)
now = date.today()
last_ndays = now - timedelta(days=n_days)
rvDataFrame = pd.json_normalize(
client.campaigns.all(
get_all=True,
since_send_time=last_ndays,
fields=fields).get(keyword))
if 'send_time' in rvDataFrame.columns:
rvDataFrame.sort_values('send_time', ascending=False, inplace=True)
mapper = {
"id": "ID",
"emails_sent": "Emails Sent",
"settings.title": "Campaign Name",
"settings.from_name": "From",
"settings.reply_to": "Email",
"report_summary.unique_opens": "Opens",
"report_summary.open_rate": "Open Rate (%)",
"report_summary.subscriber_clicks": "Unique Clicks",
"report_summary.click_rate": "Click Rate (%)"
}
drops = [
"recipients.recipient_count",
"report_summary.opens",
"report_summary.clicks",
"report_summary.ecommerce.total_orders",
"report_summary.ecommerce.total_spent",
"report_summary.ecommerce.total_revenue"]
rvDataFrame.drop(columns=drops, inplace=True)
rvDataFrame.rename(columns=mapper, inplace=True)
rvDataFrame.loc[:,"Open Rate (%)"] = round(rvDataFrame.loc[:,"Open Rate (%)"]*100,2)
rvDataFrame.loc[:,"Click Rate (%)"] = round(rvDataFrame.loc[:,"Click Rate (%)"]*100,2)
return rvDataFrame

Azure Cosmos DB, Delete IDS (definitely exist)

This is probably a very simple and silly mistake but I am unsure of how this is failing. I have used the https://github.com/Azure/azure-cosmos-python#insert-data tutorial. How can I query a database then used those ids to delete and then they don't exist.
Can anyone help before the weekend sets in? Thanks, struggling to see how this fails!
Error:
azure.cosmos.errors.HTTPFailure: Status code: 404
{"code":"NotFound","message":"Entity with the specified id does not exist in the system., \r\nRequestStartTime: 2020-02-07T17:08:48.1413131Z,
RequestEndTime: 2020-02-07T17:08:48.1413131Z,
Number of regions attempted:1\r\nResponseTime: 2020-02-07T17:08:48.1413131Z,
StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northeurope1-fd24.documents.azure.com:14363/apps/dedf1644-3129-4bd1-9eaa-8efc450341c4/services/956a2aa9-0cad-451f-a172-3f3c7d8353ef/partitions/bac75b40-384a-4019-a973-d2e85ada9c87/replicas/132248272332111641p/,
LSN: 79, GlobalCommittedLsn: 79, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404,
SubStatusCode: 0, RequestCharge: 1.24, ItemLSN: -1, SessionToken: 0#79#13=-1,
UsingLocalLSN: False, TransportException: null, ResourceType: Document,
OperationType: Delete\r\n, Microsoft.Azure.Documents.Common/2.9.2"}
This is my code...
def get_images_to_check(database_id, container_id):
images = client.QueryItems("dbs/" + database_id + "/colls/" + container_id,
{
'query': 'SELECT * FROM c r WHERE r.manually_reviewed=#manrev',
'parameters': [
{'name': '#manrev', 'value': False}
]
},
{'enableCrossPartitionQuery': True})
return list(images)
def delete_data(database_id, container_id, data):
for item in data:
print(item['id'])
client.DeleteItem("dbs/" + database_id + "/colls/" + container_id + "/docs/" + item['id'], {'partitionKey': 'class'})
database_id = 'ModelData'
container_id = 'ImagePredictions'
container_id = 'IncorrectPredictions'
images_to_check = get_images_to_check(database_id, container_id)
delete_data(database_id, container_id, images_to_check)```
When specifying the partition key in the call to client.DeleteItem(), you chose:
{'partitionKey': 'class'}
The extra parameter to DeleteItem() should be specifying the value of your partition key.
Per your comments, /class is your partition key. So I believe that, if you change your parameter to something like:
{'partitionKey': 'value-of-partition-key'}
This should hopefully work.
More than likely the issue is with mismatched id and PartitionKey value for the document. A document is uniquely identified in a collection by combination of it's id and PartitionKey value.
Thus in order to delete a document, you need to specify correct values for both the document's id and it's PartitionKey value.

sed recognition response to DynamoDB table using Lambda-python

I am using Lambda to detect faces and would like to send the response to a Dynamotable.
This is the code I am using:
rekognition = boto3.client('rekognition', region_name='us-east-1')
dynamodb = boto3.client('dynamodb', region_name='us-east-1')
# --------------- Helper Functions to call Rekognition APIs ------------------
def detect_faces(bucket, key):
response = rekognition.detect_faces(Image={"S3Object": {"Bucket": bucket,
"Name": key}}, Attributes=['ALL'])
TableName = 'table_test'
for face in response['FaceDetails']:
table_response = dynamodb.put_item(TableName=TableName, Item='{0} - {1}%')
return response
My problem is in this line:
for face in response['FaceDetails']:
table_response = dynamodb.put_item(TableName=TableName, Item= {'key:{'S':'value'}, {'S':'Value')
I am able to see the result in the console.
I don't want to add specific item(s) to the table- I need the whole response to be transferred to the table.
Do do this:
1. What to add as a key and partition key in the table?
2. How to transfer the whole response to the table
i have been stuck in this for three days now and can't figure out any result. Please help!
******************* EDIT *******************
I tried this code:
rekognition = boto3.client('rekognition', region_name='us-east-1')
# --------------- Helper Functions to call Rekognition APIs ------------------
def detect_faces(bucket, key):
response = rekognition.detect_faces(Image={"S3Object": {"Bucket": bucket,
"Name": key}}, Attributes=['ALL'])
TableName = 'table_test'
for face in response['FaceDetails']:
face_id = str(uuid.uuid4())
Age = face["AgeRange"]
Gender = face["Gender"]
print('Generating new DynamoDB record, with ID: ' + face_id)
print('Input Age: ' + Age)
print('Input Gender: ' + Gender)
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['test_table'])
table.put_item(
Item={
'id' : face_id,
'Age' : Age,
'Gender' : Gender
}
)
return response
It gave me two of errors:
1. Error processing object xxx.jpg
2. cannot concatenate 'str' and 'dict' objects
Can you pleaaaaase help!
When you create a Table in DynamoDB, you must specify, at least, a Partition Key. Go to your DynamoDB table and grab your partition key. Once you have it, you can create a new object that contains this partition key with some value on it and the object you want to pass itself. The partition key is always a MUST upon creating a new Item in a DynamoDB table.
Your JSON object should look like this:
{
"myPartitionKey": "myValue",
"attr1": "val1",
"attr2:" "val2"
}
EDIT: After the OP updated his question, here's some new information:
For problem 1)
Are you sure the image you are trying to process is a valid one? If it is a corrupted file Rekognition will fail and throw that error.
For problem 2)
You cannot concatenate a String with a Dictionary in Python. Your Age and Gender variables are dictionaries, not Strings. So you need to access an inner attribute within these dictionaries. They have a 'Value' attribute. I am not a Python developer, but you need to access the Value attribute inside your Gender object. The Age object, however, has 'Low' and 'High' as attributes.
You can see the complete list of attributes in the docs
Hope this helps!

Empty a DynamoDB table with boto

How can I optimally (in terms financial cost) empty a DynamoDB table with boto? (as we can do in SQL with a truncate statement.)
boto.dynamodb2.table.delete() or boto.dynamodb2.layer1.DynamoDBConnection.delete_table() deletes the entire table, while boto.dynamodb2.table.delete_item() boto.dynamodb2.table.BatchTable.delete_item() only deletes the specified items.
While i agree with Johnny Wu that dropping the table and recreating it is much more efficient, there may be cases such as when many GSI's or Tirgger events are associated with a table and you dont want to have to re-associate those. The script below should work to recursively scan the table and use the batch function to delete all items in the table. For massively large tables though, this may not work as it requires all items in the table to be loaded into your computer
import boto3
dynamo = boto3.resource('dynamodb')
def truncateTable(tableName):
table = dynamo.Table(tableName)
#get the table keys
tableKeyNames = [key.get("AttributeName") for key in table.key_schema]
"""
NOTE: there are reserved attributes for key names, please see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ReservedWords.html
if a hash or range key is in the reserved word list, you will need to use the ExpressionAttributeNames parameter
described at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Table.scan
"""
#Only retrieve the keys for each item in the table (minimize data transfer)
ProjectionExpression = ", ".join(tableKeyNames)
response = table.scan(ProjectionExpression=ProjectionExpression)
data = response.get('Items')
while 'LastEvaluatedKey' in response:
response = table.scan(
ProjectionExpression=ProjectionExpression,
ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
with table.batch_writer() as batch:
for each in data:
batch.delete_item(
Key={key: each[key] for key in tableKeyNames}
)
truncateTable("YOUR_TABLE_NAME")
As Johnny Wu mentioned, deleting a table and re-creating it is more efficient than deleting individual items. You should make sure your code doesn't try to create a new table before it is completely deleted.
def deleteTable(table_name):
print('deleting table')
return client.delete_table(TableName=table_name)
def createTable(table_name):
waiter = client.get_waiter('table_not_exists')
waiter.wait(TableName=table_name)
print('creating table')
table = dynamodb.create_table(
TableName=table_name,
KeySchema=[
{
'AttributeName': 'YOURATTRIBUTENAME',
'KeyType': 'HASH'
}
],
AttributeDefinitions= [
{
'AttributeName': 'YOURATTRIBUTENAME',
'AttributeType': 'S'
}
],
ProvisionedThroughput={
'ReadCapacityUnits': 1,
'WriteCapacityUnits': 1
},
StreamSpecification={
'StreamEnabled': False
}
)
def emptyTable(table_name):
deleteTable(table_name)
createTable(table_name)
Deleting a table is much more efficient than deleting items one-by-one. If you are able to control your truncation points, then you can do something similar to rotating tables as suggested in the docs for time series data.
This builds on the answer given by Persistent Plants. If the table already exists, you can extract the table definitions and use that to recreate the table.
import boto3
dynamodb = boto3.resource('dynamodb', region_name='us-east-2')
def delete_table_ddb(table_name):
table = dynamodb.Table(table_name)
return table.delete()
def create_table_ddb(table_name, key_schema, attribute_definitions,
provisioned_throughput, stream_enabled, billing_mode):
settings = dict(
TableName=table_name,
KeySchema=key_schema,
AttributeDefinitions=attribute_definitions,
StreamSpecification={'StreamEnabled': stream_enabled},
BillingMode=billing_mode
)
if billing_mode == 'PROVISIONED':
settings['ProvisionedThroughput'] = provisioned_throughput
return dynamodb.create_table(**settings)
def truncate_table_ddb(table_name):
table = dynamodb.Table(table_name)
key_schema = table.key_schema
attribute_definitions = table.attribute_definitions
if table.billing_mode_summary:
billing_mode = 'PAY_PER_REQUEST'
else:
billing_mode = 'PROVISIONED'
if table.stream_specification:
stream_enabled = True
else:
stream_enabled = False
capacity = ['ReadCapacityUnits', 'WriteCapacityUnits']
provisioned_throughput = {k: v for k, v in table.provisioned_throughput.items() if k in capacity}
delete_table_ddb(table_name)
table.wait_until_not_exists()
return create_table_ddb(
table_name,
key_schema=key_schema,
attribute_definitions=attribute_definitions,
provisioned_throughput=provisioned_throughput,
stream_enabled=stream_enabled,
billing_mode=billing_mode
)
Now call use the function:
table_name = 'test_ddb'
truncate_table_ddb(table_name)

psycopg2 execute returns datetime instead of a string

cur.execute("SELECT \
title, \
body, \
date \ # This pgsql type is date
FROM \
table \
WHERE id = '%s';", id)
response = cur.fetchall()
print response
As an example this gives me: -
[('sample title', 'sample body', datetime.date(2012, 8, 5))]
Which can't be passed to things like json.dumps so I'm having to do this: -
processed = []
for row in response:
processed.append({'title' : row[0],
'body' : row[1],
'date' : str(row[2])
})
Which feels like poor form, does anyone know of a better way of handling this?
First of all, what did you expect to be returned from a field with a "date" data type? Explicitly, date, and driver obviously performs as expected here.
So your task is actually to find out how to say json encoder to encode instances of datetime.date class. Simple answer, improve encoder by subclassing a built-in one:
from datetime import date
import json
class DateEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, date):
return str(obj)
return json.JSONEncoder.default(self, obj)
Usage (you need to explicitly say you're using custom encoder):
json.dumps(_your_dict, cls=DateEncoder)
As the previous answer suggests, this is the expected result from a query on a date field.However, one can simplify a lot more in the query itself.
If you go through the postgres docs you can find the to_char() function.
This will lead to simple change in your query
cur.execute("SELECT \
title, \
body, \
to_char(date, 'YYY-MM-DD') \ # This pgsql type is date
FROM \
table \
WHERE id = '%s';", id)
Instead of adding encoding, you can customise the psycopg2 mapping between SQL and Python data types, as explained here:
https://www.psycopg.org/docs/advanced.html#type-casting-of-sql-types-into-python-objects
The code template is
date_oid = 1082 # id of date type, see docs how to get it from db
def casting_fn(val,cur):
# process as you like, e.g. string formatting
# register custom mapping
datetype_casted = psycopg2.extensions.new_type((date_oid,), "date", casting_fn)
psycopg2.extensions.register_type(datetype_casted)
Once this is done, instead
[('sample title', 'sample body', datetime.date(2012, 8, 5))]
you will receive
[('sample title', 'sample body', '2012-08-05')]

Categories

Resources