How do I create a CSV in Lambda using Python? - python

I would like to create a report in Lambda using Python that is saved in a CSV file. So you will find the code of the function:
import boto3
import datetime
import re
def lambda_handler(event, context):
client = boto3.client('ce')
now = datetime.datetime.utcnow()
end = datetime.datetime(year=now.year, month=now.month, day=1)
start = end - datetime.timedelta(days=1)
start = datetime.datetime(year=start.year, month=start.month, day=1)
start = start.strftime('%Y-%m-%d')
end = end.strftime('%Y-%m-%d')
response = client.get_cost_and_usage(
TimePeriod={
'Start': "2019-02-01",
'End': "2019-08-01"
},
Granularity='MONTHLY',
Metrics=['BlendedCost'],
GroupBy=[
{
'Type': 'TAG',
'Key': 'Project'
},
]
)
How can I create a CSV file from it?

Here is a sample function to create a CSV file in Lambda using Python:
Assuming that the variable 'response' has the required data for creating the report for you, the following piece of code will help you create a temporary CSV file in the /tmp folder of the lambda function:
import csv
temp_csv_file = csv.writer(open("/tmp/csv_file.csv", "w+"))
# writing the column names
temp_csv_file.writerow(["Account Name", "Month", "Cost"])
# writing rows in to the CSV file
for detail in response:
temp_csv_file.writerow([detail['account_name'],
detail['month'],
detail['cost']
])
Once you have created the CSV file, you can upload it S3 and send it as an email or share it as link using the following piece of code:
client = boto3.client('s3')
client.upload_file('/tmp/csv_file.csv', BUCKET_NAME,'final_report.csv')
Points to remember:
The /tmp is a directory storage of size 512 MB which can be used to store a few in memory/ temporary files
You should not rely on this storage to maintain state across sub-sequent lambda functions.

The above answer by Repakula Srushith is correct but will be creating an empty csv as the file is not being closed. You can change the code to
f = open("/tmp/csv_file.csv", "w+")
temp_csv_file = csv.writer(f)
temp_csv_file.writerow(["Account Name", "Month", "Cost"])
# writing rows in to the CSV file
for detail in response:
temp_csv_file.writerow([detail['account_name'],
detail['month'],
detail['cost']
])
f.close()

Related

AWS Glue create_partition using boto3 successful, but Athena not showing results for query

I have a glue script to create new partitions using create_partition(). The glue script is running successfully, and i could see the partitions in the Athena console when using SHOW PARTITIONS. For glue script create_partitions, I did refer to this sample code here : https://medium.com/#bv_subhash/demystifying-the-ways-of-creating-partitions-in-glue-catalog-on-partitioned-s3-data-for-faster-e25671e65574
When I try to run a Athena query for a given partition which was newly added, I am getting no results.
Is it that I need to trigger the MSCK command, even if I add the partitions using create_partitions. Appreciate any suggestions please
.
I have got the solution myself, wanted to share with SO community, so it would be useful someone. The following code when run as a glue job, creates partitions, and can also be queried in Athena for the new partition columns. Please change/add the parameter values db name, table name, partition columns as needed.
import boto3
import urllib.parse
import os
import copy
import sys
# Configure database / table name and emp_id, file_id from workflow params?
DATABASE_NAME = 'my_db'
TABLE_NAME = 'enter_table_name'
emp_id_tmp = ''
file_id_tmp = ''
# # Initialise the Glue client using Boto 3
glue_client = boto3.client('glue')
#get current table schema for the given database name & table name
def get_current_schema(database_name, table_name):
try:
response = glue_client.get_table(
DatabaseName=DATABASE_NAME,
Name=TABLE_NAME
)
except Exception as error:
print("Exception while fetching table info")
sys.exit(-1)
# Parsing table info required to create partitions from table
table_data = {}
table_data['input_format'] = response['Table']['StorageDescriptor']['InputFormat']
table_data['output_format'] = response['Table']['StorageDescriptor']['OutputFormat']
table_data['table_location'] = response['Table']['StorageDescriptor']['Location']
table_data['serde_info'] = response['Table']['StorageDescriptor']['SerdeInfo']
table_data['partition_keys'] = response['Table']['PartitionKeys']
return table_data
#prepare partition input list using table_data
def generate_partition_input_list(table_data):
input_list = [] # Initializing empty list
part_location = "{}/emp_id={}/file_id={}/".format(table_data['table_location'], emp_id_tmp, file_id_tmp)
input_dict = {
'Values': [
emp_id_tmp, file_id_tmp
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': table_data['input_format'],
'OutputFormat': table_data['output_format'],
'SerdeInfo': table_data['serde_info']
}
}
input_list.append(input_dict.copy())
return input_list
#create partition dynamically using the partition input list
table_data = get_current_schema(DATABASE_NAME, TABLE_NAME)
input_list = generate_partition_input_list(table_data)
try:
create_partition_response = glue_client.batch_create_partition(
DatabaseName=DATABASE_NAME,
TableName=TABLE_NAME,
PartitionInputList=input_list
)
print('Glue partition created successfully.')
print(create_partition_response)
except Exception as e:
# Handle exception as per your business requirements
print(e)

(python) Iterating through a list of Salesforce tables to extract and load into AWS S3

Good Morning All!
I'm trying to have a routine iterate through a table list. The below code works on a single table 'contact'. I want to iterate through all of the tables listed in my tablelist.csv. I bolded the selections below which would need to be dynamically modified in the code. My brain is pretty fried at this point from working through two nights and I'm fully prepared for the internet to tell me that this is in chapter two of intro to Python but I could use the help just to get over this hurdle.
import pandas as pd
import boto3
from simple_salesforce import salesforce
li = pd.read_csv('tablelist.csv', header=none)
desc = sf.**Contact**.describe()
field_names = [field['name'] for field in desc['fields']]
soql = "SELECT {} FROM **Contact**".format(','.join(field_names))
results = sf.query_all(soql)
sf_df = pd.DataFrame(results['records']).drop(columns='attributes')
sf_df.to_csv('**contact**.csv')
s3 = boto3.client('s3')
s3.upload_file('contact.csv', 'mybucket', 'Ops/20201027/contact.csv')
Would help if you could provide a sample of the tablelist file, but here's a stab at...you really just need to get list of tables and loop through it.
#assuming table is a column somewhere in the file
df_tablelist = pd.read_csv('tablelist.csv', header=none)
for Contact in df_tablelist['yourtablecolumttoiterateon'].tolist():
desc = sf.**Contact**.describe()
field_names = [field['name'] for field in desc['fields']]
soql = "SELECT {} FROM {}".format(','.join(field_names), Contact)
results = sf.query_all(soql)
sf_df = pd.DataFrame(results['records']).drop(columns='attributes')
sf_df.to_csv(Contact + '.csv')
s3 = boto3.client('s3')
s3.upload_file(Contact + '.csv', 'mybucket', 'Ops/20201027/' + Contact + '.csv')

Add a date loaded field when uploading csv to big query

Using Python.
Is there any way to add an extra field while processing a csv file to Big Query.
I'd like to add a date_loaded field with the current date ?
Google code example I have used ..
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING')
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv'
load_job = client.load_table_from_uri(
uri,
dataset_ref.table('us_states'),
job_config=job_config) # API request
print('Starting job {}'.format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print('Job finished.')
destination_table = client.get_table(dataset_ref.table('us_states'))
print('Loaded {} rows.'.format(destination_table.num_rows))
By modifying this Python example to fit your issue you open and read the original CSV file from my local PC, edit it by adding a column and append timestamps at the end of each line to avoid having an empty column. This link explains how to get a timestamp in Python with custom date and time.
Then you write the resulting data to an output file and load it to Google Storage. Here you can find the information on how to run external commands from a Python file.
I hope this helps.
#Import the dependencies
import csv,datetime,subprocess
from google.cloud import bigquery
#Replace the values for variables with the appropriate ones
#Name of the input csv file
csv_in_name = 'us-states.csv'
#Name of the output csv file to avoid messing up the original
csv_out_name = 'out_file_us-states.csv'
#Name of the NEW COLUMN NAME to be added
new_col_name = 'date_loaded'
#Type of the new column
col_type = 'DATETIME'
#Name of your bucket
bucket_id = 'YOUR BUCKET ID'
#Your dataset name
ds_id = 'YOUR DATASET ID'
#The destination table name
destination_table_name = 'TABLE NAME'
# read and write csv files
with open(csv_in_name,'r') as r_csvfile:
with open(csv_out_name,'w') as w_csvfile:
dict_reader = csv.DictReader(r_csvfile,delimiter=',')
#add new column with existing
fieldnames = dict_reader.fieldnames + [new_col_name]
writer_csv = csv.DictWriter(w_csvfile,fieldnames,delimiter=',')
writer_csv.writeheader()
for row in dict_reader:
#Put the timestamp after the last comma so that the column is not empty
row[new_col_name] = datetime.datetime.now()
writer_csv.writerow(row)
#Copy the file to your Google Storage bucket
subprocess.call('gsutil cp ' + csv_out_name + ' gs://' + bucket_id , shell=True)
client = bigquery.Client()
dataset_ref = client.dataset(ds_id)
job_config = bigquery.LoadJobConfig()
#Add a new column to the schema!
job_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING'),
bigquery.SchemaField(new_col_name, col_type)
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#Address string of the output csv file
uri = 'gs://' + bucket_id + '/' + csv_out_name
load_job = client.load_table_from_uri(uri,dataset_ref.table(destination_table_name),job_config=job_config) # API request
print('Starting job {}'.format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print('Job finished.')
destination_table = client.get_table(dataset_ref.table(destination_table_name))
print('Loaded {} rows.'.format(destination_table.num_rows))
You can keep loading your data as you are loading, but into a table called old_table.
Once loaded, you can run something like:
bq --location=US query --destination_table mydataset.newtable --use_legacy_sql=false --replace=true 'select *, current_date() as date_loaded from mydataset.old_table'
This basically loads the content of old table with a new column of date_loaded at the end to the new_table. This way, you now have a new column without downloading locally or all the mess.

Using Boto3 to create loop on specific folder

I am testing the new data feeds as in XML. Those data will be stored in S3 in the following format:
2018\1\2\1.xml
2018\1\3\1.xml
2018\1\3\2.xml
etc. So, multiple .xml files are possible on one day. Also, important to note that there are folders in this bucket that I do NOT want to pull. So I have to target a very specific directory.
There is no date time stamp within the file, so I need to use created, modified, something to go off of. To do this I think of using a dictionary of key, values with folder+xml file as the key, created/modified timestamp as the value. Then, use that dict to essentially re-pull all the objects.
Here's what I've tried..
i
mport boto3
from pprint import pprint
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(
Bucket='bucket',
Prefix='folder/folder1/folder2')
bucket_object_list = []
for page in result:
pprint(page)
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
pprint(keyString)
bucket_object_list.append(keyString)
s3 = boto3.resource('s3')
obj = s3.Object('bucket','bucket_object_list')
obj.get()["Contents"].read().decode('utf-8')
pprint(obj.get())
sys.exit()
This is throwing an error from the key within the obj = s3.Object('cluster','key') line.
Traceback (most recent call last):
File "s3test2.py", line 25, in <module>
obj = s3.Object('cluster', key)
NameError: name 'key' is not defined
The Maxitems is purely for testing purposes although it's interesting since this translates to 1000 when run.
NameError: name 'key' is not defined
As far as error is concerned, it's because key is not defined.
From this documentation:
Object(bucket_name, key)
Creates a Object resource.:
object = s3.Object('bucket_name','key')
Parameters
bucket_name(string) -- The Object's bucket_name identifier. This must be set.
key(string) -- The Object's key identifier. This must be set.
You need to assign an object key name to the 'key' you're using in the code
The keyName is the "name" (=unique identifier) by which your file will be stored in the S3 bucket
Code based on what you posted:
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate( Bucket='bucket_name', Prefix='folder/folder1/folder2')
bucket_object_list = []
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
print(keyString)
bucket_object_list.append(keyString)
print bucket_object_list
s3 = boto3.resource('s3')
for file_name in bucket_object_list:
obj = s3.Object('bucket_name',file_name)
print(obj.get())
print(obj.get()["Body"].read().decode('utf-8'))

Why influx performance is so slow

I am storing some data in influx and it is quite confusing that influx is 4-5 times slow as Mysql. I try to test by inserting 10000 rows in mysql and then in influxdb.
and the stats are below.
Mysql
real: 6m 39sec
user: 2.956sec
sys: 0.504sec
Influxdb
real: 6m 17.193sec
user: 11.860sec
sys: 0.328sec
my code for influx is given below, I used the same pattern to store in mysql.
#!/usr/bin/env python
# coding: utf-8
import time
import csv
import sys
import datetime
import calendar
import pytz
from influxdb import client as influxdb
from datetime import datetime
host = 'localhost'
port = 8086
user = "admin"
password = "admin"
db_name = "testdatabase"
db = influxdb.InfluxDBClient(database=db_name)
def read_data():
with open(file) as f:
reader = f.readlines()[4:]
for line in reader:
yield (line.strip().split(','))
fmt = '%Y-%m-%d %H:%M:%S'
file = '/home/rob/mycsvfile.csv'
csvToInflux = read_data()
body = []
for metric in csvToInflux:
timestamp = datetime.strptime(metric[0][1: len(metric[0]) - 1], fmt)
new_value = float(metric[1])
body.append({
'measurement': 'mytable1',
'time': timestamp,
'fields': {
'col1': metric[1],
'col2': metric[2],
'col3': metric[3],
'col4': metric[4],
'col5': metric[5],
'col6': metric[6],
'col7': metric[7],
'col8': metric[8],
'col9': metric[9]
}
})
db.write_points(body)
Can someone give me an idea how can I improve it. I think it might be due to cache. is cache option is off by default in Influx db? and can someone guide me to do batch processing in influx. I try to look over SO and google but couldn't solve my problem. I am newbie to influx db. I am trying to make it faster.
Thanks for any help or tips.
Inserting one by one into influxdb is slow, you should do it in batches. For example, trying with a CSV of 10000 lines (one by one):
with open('/tmp/blah.csv') as f:
lines = f.readlines()
import influxdb
inf = influxdb.InfluxDBClient('localhost', 8086, 'root', 'root', 'example1')
for line in lines:
parts = line.split(',')
json_body = [{
'measurement': 'one_by_one',
'time': parts[0],
'fields':{
'my_value': int(parts[1].strip())
}
}]
inf.write_points(json_body)
This gives me a result of
└─ $ ▶ time python influx_one.py
real 1m43.655s
user 0m19.547s
sys 0m3.266s
And doing a small change to insert all the lines of the CSV in one go:
json_body = []
for line in lines:
parts = line.split(',')
json_body.append({
'measurement': 'one_batch',
'time': parts[0],
'fields':{
'my_value': int(parts[1].strip())
}
})
inf.write_points(json_body)
The result is much much better:
└─ $ ▶ time python influx_good.py
real 0m2.693s
user 0m1.797s
sys 0m0.734s

Categories

Resources