How to write a dataframe to dynamodb using AWS Lambda

How to write a dataframe to dynamodb using AWS Lambda - python

I'M having a Lambda function set up in AWS Cloudformation. The runtime is python3.8.
The purpose is to pull some weather data from an API and write it to DynamoDB once a day.
So far the Lambda Test on AWS checks out, all green ...but the function doesnt write any values to the dynamodb.
Is there an error in indenting maybe?
Here is the code:
import boto3
import pyowm
import time
import json
import requests
from datetime import datetime, date, timedelta, timezone
import pandas as pd
from geopy.geocoders import Nominatim
def lambda_handler(event, context):
api_key = "xxxxxxx" #Enter your own API Key
owm = pyowm.OWM(api_key)
city = 'Berlin, DE'
geolocator = Nominatim(user_agent='aerieous#myserver.com')
location = geolocator.geocode(city)
lat = location.latitude
lon = location.longitude
# set the date to pull the data from to yesterday
# format = '2021-09-09 00:00:00'
x = (datetime.now() - timedelta(days = 1 ))
d = x.isoformat(' ', 'seconds')
# convert time to epoch
p = '%Y-%m-%d %H:%M:%S'
dt = int(time.mktime(time.strptime(d,p)))
url = "https://api.openweathermap.org/data/2.5/onecall/timemachine?lat=%s&lon=%s& dt=%s&appid=%s&units=metric" % (lat, lon, dt, api_key)
response = requests.get(url)
data_history = json.loads(response.text)
# here we flatten only the nested list "hourly"
df_history2 = pd.json_normalize(data_history, record_path='hourly', meta=['lat', 'lon', 'timezone'],
errors='ignore')
# convert epoch to timestamp
df_history2['dt'] = pd.to_datetime(df_history2['dt'],unit='s').dt.strftime("%m/%d/%Y %H:%M:%S")
# replace the column header
df_history2 = df_history2.rename(columns={'dt': 'timestamp'})
df_history2['uuid'] = df_history2[['timestamp','timezone']].agg('-'.join, axis=1)
df_select_hist2 = df_history2[['uuid','lat','lon', 'timezone', 'timestamp', 'temp', 'feels_like', 'humidity', 'pressure']]
df_select_hist2 = df_select_hist2.astype(str)
df_select_hist2
content = df_select_hist2.to_dict('records')
return content
dynamodb = boto3.resource(
'dynamodb',
aws_access_key_id='xx',
aws_secret_access_key='xx',
region_name='eu-west-1')
table = dynamodb.Table("Dev_Weather")
for item in content:
uuid = item['uuid']
timezone = item['timezone']
timestamp = item['timestamp']
lat = item['lat']
lon = item['lon']
temp = item['temp']
feels_like = item['feels_like']
humidity = item['humidity']
pressure = item['pressure']
table.put_item(
Item={
'pk_id': uuid,
'sk': timestamp,
'gsi_1_pk': lat,
'gsi_1_sk': lon,
'gsi_2_pk': temp,
'gsi_2_sk': feels_like,
'humidity': humidity,
'pressure': pressure,
'timezone': timezone
}
)
Thank you for any help in advance.
A

The line return content ends your lambda function. It basically tells the script: I'm done and this is the result. Nothing after it is executed. Remove the line to be able to execute code afterwards. Also, the indentation in your code example seems off (a space too little when starting the dynamodb stuff), so I'm a bit confused over why this wouldn't give syntax errors.
Also: there is no need to specify an access key, region etc. when creating the dynamodb resource. It's fetched by lambda automatically. Just make sure the lambda role has the right permissions to call dynamodb.

Related

Pyspark - Converting a stringtype nested json to columns in dataframe

I am working on processing a CDC data recieved via kafka tables, and load them into databricks delta tables. I am able to get it working all, except for a nested JSON string which is not getting loaded when using from_json, spark.read.json.
When I try to fetch schema of the json from level 1, using "spark.read.json(df.rdd.map(lambda row: row.value)).schema", the column INPUT_DATA is considered as string loaded as a string object. Am giving sample json string, the code that I tried, and the expected results.
I have many topics to process and each topic will have different schema, so I would like to process dynamically, and do not prefer to store the schemas, since the schema may change over time, and i would like to have my code handle the changes automatically.
Appreciate any help as I have spent whole day to figure out, and still trying. Thanks in advance.
Sample Json with nested tree:
after = {
"id_transaction": "121",
"product_id": 25,
"transaction_dt": 1662076800000000,
"creation_date": 1662112153959000,
"product_account": "40012",
"input_data": "{\"amount\":[{\"type\":\"CASH\",\"amount\":1000.00}],\"currency\":\"USD\",\"coreData\":{\"CustId\":11021,\"Cust_Currency\":\"USD\",\"custCategory\":\"Premium\"},\"context\":{\"authRequired\":false,\"waitForConfirmation\":false,\"productAccount\":\"CA12001\"},\"brandId\":\"TOYO-2201\",\"dealerId\":\"1\",\"operationInfo\":{\"trans_Id\":\"3ED23-89DKS-001AA-2321\",\"transactionDate\":1613420860087},\"ip_address\":null,\"last_executed_step\":\"PURCHASE_ORDER_CREATED\",\"last_result\":\"OK\",\"output_dataholder\":\"{\"DISCOUNT_AMOUNT\":\"0\",\"BONUS_AMOUNT_APPLIED\":\"10000\"}",
"dealer_id": 1,
"dealer_currency": "USD",
"Cust_id": 11021,
"process_status": "IN_PROGRESS",
"tot_amount": 10000,
"validation_result_code": "OK_SAVE_AND_PROCESS",
"operation": "Create",
"timestamp_ms": 1675673484042
}
I have created following script to get all the columns of the json structure:
import json
# table_column_schema = {}
json_keys = {}
child_members = []
table_column_schema = {}
column_schema = []
dbname = "mydb"
tbl_name = "tbl_name"
def get_table_keys(dbname):
table_values_extracted = "select value from {mydb}.{tbl_name} limit 1"
cmd_key_pair_data = spark.sql(table_values_extracted)
jsonkeys=cmd_key_pair_data.collect()[0][0]
json_keys = json.loads(jsonkeys)
column_names_as_keys = json_keys["after"].keys()
value_column_data = json_keys["after"].values()
column_schema = list(column_names_as_keys)
for i in value_column_data:
if ("{" in str(i) and "}" in str(i)):
a = json.loads(i)
for i2 in a.values():
if (str(i2).startswith("{") and str(i2).endswith('}')):
column_schema = column_schema + list(i2.keys())
table_column_schema['temp_table1'] = column_schema
return 0
get_table_keys("dbname")
The following code is used to process the json and create a dataframe with all nested jsons as the columns:
from pyspark.sql.functions import from_json, to_json, col
from pyspark.sql.types import StructType, StructField, StringType, LongType, MapType
import time
dbname = "mydb"
tbl_name = "tbl_name"
start = time.time()
df = spark.sql(f'select value from {mydb}.{tbl_name} limit 2')
tbl_columns = table_column_schema[tbl_name]
data = []
for i in tbl_columns:
if i == 'input_data':
# print('FOUND !!!!')
data.append(StructField(f'{i}', MapType(StringType(),StringType()), True))
else:
data.append(StructField(f'{i}', StringType(), True))
schema2 = spark.read.json(df.rdd.map(lambda row: row.value)).schema
print(type(schema2))
df2 = df.withColumn("value", from_json("value", schema2)).select(col('value.after.*'), col('value.op'))
Note: The VALUE is a column in my delta table (bronze layer)
Current dataframe output:
Expected dataframe output:

You can use rdd to get the schema and from_json to read the value as json.
schema = spark.read.json(df.rdd.map(lambda r: r.input_data)).schema
df = df.withColumn('input_data', f.from_json('input_data', schema))
new_cols = df.columns + df.select('input_data.*').columns
df = df.select('*', 'input_data.*').toDF(*new_cols).drop('input_data')
df.show(truncate=False)
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
|Cust_id|creation_date |dealer_currency|dealer_id|id_transaction|operation|process_status|product_account|product_id|timestamp_ms |tot_amount|transaction_dt |validation_result_code|amount |brandId |context |coreData |currency|dealerId|ip_address|last_executed_step |last_result|operationInfo |output_dataholder|
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+
|11021 |1662112153959000|USD |1 |121 |Create |IN_PROGRESS |40012 |25 |1675673484042|10000 |1662076800000000|OK_SAVE_AND_PROCESS |[{1000.0, CASH}]|TOYO-2201|{false, CA12001, false}|{11021, USD, Premium}|USD |1 |null |PURCHASE_ORDER_CREATED|OK |{3ED23-89DKS-001AA-2321, 1613420860087}|{10000, 0} |
+-------+----------------+---------------+---------+--------------+---------+--------------+---------------+----------+-------------+----------+----------------+----------------------+----------------+---------+-----------------------+---------------------+--------+--------+----------+----------------------+-----------+---------------------------------------+-----------------+

Is there a way to add dates in queries in quickbooks api using Python?

I want to make queries for accounts for a specific time, for example, for their values in 2022-31-01. I have the following code using the documentation (specified here:https://pypi.org/project/python-quickbooks/):
Instantiating client
auth_client = AuthClient(
client_id = client_id_texto,
client_secret = client_secret_texto,
redirect_uri = redirect_uri_texto,
environment = 'sandbox',
)
auth_client.refresh(refresh_token = refresh_token_texto)
client = QuickBooks(
auth_client=auth_client,
refresh_token = refresh_token_texto,
company_id = company_id_texto,
)
from quickbooks.objects import Account
from datetime import datetime
account = Account.all(str(datetime(2022, 1, 31, 0, 0, 0)),qb=client)
However, the last line of the code returns error. What I want is to get the value of the different accountability accounts for a specific date.

Read json data from covid19 api using python

I was trying to import timeseries data from link Covid_data to get the daily historical and 7 day moving average data.But my code doesn't work. I am new to this so maybe my key value pair is not correct. The structure of the file is given here json_structure_link.
My Code
import requests
import pandas as pd
response = requests.get("https://api.covid19india.org/v4/min/timeseries.min.json")
if response.status_code == 200:
historical_day_numbers = response.json()
DATE = []
STATE = []
TOTAL_CASES = []
RECOVERED = []
DECEASED = []
TESTED = []
VACCINATED = []
for state in historical_day_numbers.keys():
STATE.append(state)
DATE.append(historical_day_numbers[state]["dates"])
TOTAL_CASES.append(historical_day_numbers[state]["dates"]["delta"]["confirmed"])
RECOVERED.append(historical_day_numbers[state]["dates"]["delta"]["recovered"])
DECEASED.append(historical_day_numbers[state]["dates"]["delta"]["deceased"])
TESTED.append(historical_day_numbers[state]["dates"]["delta"]["tested"])
VACCINATED.append(historical_day_numbers[state]["dates"]["delta"]["vaccinated"])
Covid19_historical_data = pd.DataFrame(
{
"STATE/UT": STATE,
"DATE": DATE,
"TOTAL_CASES": TOTAL_CASES,
"RECOVERED": RECOVERED,
"DECEASED": DECEASED,
"TESTED": TESTED,
"VACCINATED": VACCINATED,
}
)
#print(data.head())
else:
print("Error while calling API: {}".format(response.status_code, response.reason))
The error I am getting
KeyError: 'delta'
But I see the delta present.

historical_day_numbers[state]['dates'].keys()
Output: dict_keys(['2020-04-06', '2020-04-07', '2020-04-08', '2020-04-09', '2020-04-10', '2020-04-11', '2020-04-12', '2020-04-13', '2020-04-14', '2020-04-15', '2020-04-16', '2020-04-17', '2020-04-18', '2020-04-19', '2020-04-20', '2020-04-21',...])
When you type, you will realize that there is a key for each date and there is no key called 'delta' here.
If you edit your code as follows, you will not get this error.
historical_day_numbers[state]['dates']['2021-07-25']['delta']

Filter aws ec2 snapshots by current date

How to filter AWS EC2 snapshots by current day?
I'm filtering snapshots by tag:Disaster_Recovery with value:Full, using python code below, and I need also filter it by the current day.
import boto3
region_source = 'us-east-1'
client_source = boto3.client('ec2', region_name=region_source)
# Getting all snapshots as per specified filter
def get_snapshots():
response = client_source.describe_snapshots(
Filters=[{'Name': 'tag:Disaster_Recovery', 'Values': ['Full']}]
)
return response["Snapshots"]
print(*get_snapshots(), sep="\n")

solve it, by code below:
import boto3
from datetime import date
region_source = 'us-east-1'
client_source = boto3.client('ec2', region_name=region_source)
date_today = date.isoformat(date.today())
# Getting all snapshots as per specified filter
def get_snapshots():
response = client_source.describe_snapshots(
Filters=[{'Name': 'tag:Disaster_Recovery', 'Values': ['Full']}]
)
return response["Snapshots"]
# Getting snapshots were created today
snapshots = [s for s in get_snapshots() if s["StartTime"].strftime('%Y-%m-%d') == date_today]
print(*snapshots, sep="\n")

This could do the trick:
import boto3
from datetime import date
region_source = 'us-east-1'
client_source = boto3.client('ec2', region_name=region_source)
# Getting all snapshots as per specified filter
def get_snapshots():
response = client_source.describe_snapshots(
Filters=[{'Name': 'tag:Disaster_Recovery', 'Values': ['Full']}]
)
snapshotsInDay = []
for snapshots in response["Snapshots"]:
if(snapshots["StartTime"].strftime('%Y-%m-%d') == date.isoformat(date.today())):
snapshotsInDay.append(snapshots)
return snapshotsInDay
print(*get_snapshots(), sep="\n")
After reading the docs the rest is a simple date comparision

Need help in fetching a particular value from the json output

I need to obtain the Tag values from the below code, it initially fetches the Id and then passes this to the describe_cluster, the value is then in the json format. Tryging to fetch a particular value from this "Cluster" json using "GET". However, it returns a error message as "'str' object has no attribute 'get'", Please suggest.
Here is a reference link of boto3 which I'm referring:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.describe_cluster
import boto3
import json
from datetime import timedelta
REGION = 'us-east-1'
emrclient = boto3.client('emr', region_name=REGION)
snsclient = boto3.client('sns', region_name=REGION)
def lambda_handler(event, context):
EMRS = emrclient.list_clusters(
ClusterStates = ['STARTING', 'RUNNING', 'WAITING']
)
clusters = EMRS["Clusters"]
for cluster_details in clusters :
id = cluster_details.get("Id")
describe_cluster = emrclient.describe_cluster(
ClusterId = id
)
cluster_values = describe_cluster["Cluster"]
for details in cluster_values :
tag_values = details.get("Tags")
print(tag_values)

The error is in the last part of the code.
describe_cluster = emrclient.describe_cluster(
ClusterId = id
)
cluster_values = describe_cluster["Cluster"]
for details in cluster_values: # ERROR HERE
tag_values = details.get("Tags")
print(tag_values)
The returned value from describe_cluster is a dictionary. The Cluster is also a dictionary. So you don't need to iterate over it. You can directly access cluster_values.get("Tags")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write a dataframe to dynamodb using AWS Lambda - python

Related

Pyspark - Converting a stringtype nested json to columns in dataframe

Is there a way to add dates in queries in quickbooks api using Python?

Read json data from covid19 api using python

Filter aws ec2 snapshots by current date

Need help in fetching a particular value from the json output

Categories

Resources