I have a script that gathers data from an API, and running this manually on my local machine I can save the data to a CSV or SQLite .db file.
If I put this on AWS lambda how can I store and retrieve data?
TL;DR
You can save data in an instance of a lambda function, only you don't really want to use it as permanent storage. Instead, you want to use a cloud service that specializes in storing data, which one will depend on your use case.
Some background info
When using lambda you have to think about it as an ephemeral instance in which you only have access to the /tmp directory and can save up to 512MB (see lambda limits). The data stored in the /tmp directory may be only available during the execution of the function, and there are no guarantees that any information you save there will be available in future executions.
Considerations
That is why you should consider using other cloud services to store data, e.g. Simple Storage Service (S3) for storing files, RDS for relational databases, or DynamoDB as a NoSQL database solution.
There are many other options and it will all depend on the use case.
Working solution
With python, it is very simple to store files in S3 using boto3. The code uses the library requests to do a GET request to google.com and saves the output to an S3 bucket. As an additional step, it also creates a signed URL that you can use to download the file
# lambda_function.py
import os
import boto3
from botocore.client import Config
import requests
s3 = boto3.resource('s3')
client = boto3.client('s3', config=Config(signature_version='s3v4'))
# This environment variable is set via the serverless.yml configuration
bucket = os.environ['FILES_BUCKET']
def lambda_handler(event, conntext):
# Make the API CALL
response = requests.get('https://google.com')
# Get the data you care and transform it to the desire format
body = response.text
# Save it to local storage
tmp_file_path = "/tmp/website.html"
with open(tmp_file_path, "w") as file:
file.write(body)
s3.Bucket(bucket).upload_file(tmp_file_path, 'website.html')
# OPTIONAL: Generar signed URL to download the file
url = client.generate_presigned_url(
ClientMethod='get_object',
Params={
'Bucket': bucket,
'Key': 'website.html'
},
ExpiresIn=604800 # 7 days
)
return url
Deployment
To deploy the lambda function I highly recommend using a deployment tool like Serverless or LambdaSharp. The following is a serverless.yml file for the serverless framework to package and deploy the code, it also creates the S3 bucket and sets the proper permissions to put objects and generate the signed url:
# serverless.yml
service: s3upload
provider:
name: aws
runtime: python3.7
versionFunctions: false
memorySize: 128
timeout: 30
# you can add statements to the Lambda function's IAM Role here
iamRoleStatements:
- Effect: "Allow"
Action:
- s3:PutObject
- s3:GetObject
Resource:
- Fn::Join: ["/", [Fn::GetAtt: [FilesBucket, Arn], "*"]]
- Fn::GetAtt: [FilesBucket, Arn]
# Package information
package:
artifact: package.zip
functions:
s3upload-function:
handler: lambda_function.lambda_handler
environment:
FILES_BUCKET:
Ref: FilesBucket
events:
# THIS LAMBDA FUNCTION WILL BE TRIGGERED EVERY 10 MINUTES
# CHECK OUT THE SERVERLESS DOCS FOR ALTERNATIVE WAYS TO
# TRIGGER THE FUNCTION
- schedule:
rate: rate(10 minutes)
# you can add CloudFormation resource templates here
resources:
Resources:
FilesBucket:
Type: AWS::S3::Bucket
Properties:
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
Now package and deploy
#!/usr/bin/env bash
# deploy.sh
mkdir package
pip install -r requirements.txt --target=./package
cp lambda_function.py package/
$(cd package; zip -r ../package.zip .)
serverless deploy --verbose
Conclusion
When you run lambda functions, you must think of them as stateless. If you want to save the state of your application, it is better to use other cloud services that are well suited for your use case. For storing CSVs, S3 is an ideal solution as it is a highly available storage system that is very easy to get started using python.
with aws lambda you can use database like dynamo db which is not sql database and from there you can download csv file.
with lambda to dynamo bd integration is so easy lambda is serverless and dynamo db is nosql database.
so you can save data into dynamo db also you can use RDS(Mysql) and use man other service but best way will be dynamo db.
It really depends on what you want to do with the information afterwards.
If you want to keep it in a file, then simply copy it to Amazon S3. It can store as much data as you like.
If you intend to query the information, you might choose to put it into a database instead. There are a number of different database options available, depending on your needs.
Related
data source is from SaaS Server's API endpoints, aim to use python to move data into AWS S3 Bucket(Python's Boto3 lib)
API is assigned via authorized Username/password combination and unique api-key.
then every time initially call API need get Token for further info fetch.
have 2 question:
how to manage those secrets above, save to a head file (*.ini, *.json *.yaml) or saved via AWS's Secret-Manager?
the Token is a bit challenging, the basically way is each Endpoint, fetch a new token and do the API call
then that's end of too many pipeline (like if 100 Endpoints info need per downstream business needs) then
need to craft 100 pipeline like an universal template repeating 100 times.
I am new to Python programing world, you all feel free to comment to share any user-case.
Much appreciate !!
I searched and read this show-case
[saving-from-api-to-s3-bucket/74648533]
saving from api to s3 bucket
and
"how-to-write-a-file-or-data-to-an-s3-object-using-boto3"
How to write a file or data to an S3 object using boto3
I found this has been helpful:
#Python-decopule summary: store parameters in .ini or .env files;
#few options of manage(hiding) sensitive info
a. IAM role
b. Store Secrets using **Parameter Store**
c. Store Secrets using **Secrets Manager** - Current method
recommended by AWS
I am looking for a way to perform the equivalent of the AWS CLI's method aws configure get varname [--profile profile-name] using boto3 in python. Does anyone know if this possible without either:
Parsing the AWS config file myself
Somehow interacting with the AWS CLI itself from my python script
For more context, I am writing a python cli tool that will interact with AWS APIs using boto3. The python tool uses an AWS session token stored in a profile in the ~/.aws/credentials file. I am using the saml2aws cli to fetch AWS credentials from my company's identity provider, which writes the aws_access_key_id, aws_secret_access_key, aws_session_token, aws_security_token, x_principal_arn, and x_security_token_expires parameters to the ~/.aws/credentials file like so:
[saml]
aws_access_key_id = #REMOVED#
aws_secret_access_key = #REMOVED#
aws_session_token = #REMOVED#
aws_security_token = #REMOVED#
x_principal_arn = arn:aws:sts::000000000123:assumed-role/MyAssumedRole
x_security_token_expires = 2019-08-19T15:00:56-06:00
By the nature of my python cli tool, sometimes the tool will execute past the expiration time of the AWS session token, which are enforced to be quite short by my company. I want the python cli tool to check the expiration time before it starts its critical task to verify that it has enough time to complete its task, and if not, alerting the user to refresh their session token.
Using the AWS CLI, I can fetch the expiration time of the AWS session token from the ~/.aws/credentials file using like this:
$ aws configure get x_security_token_expires --profile saml
2019-08-19T15:00:56-06:00
and I am curious if boto3 has a mechanism I was unable to find to do something similar.
As an alternate solution, given an already generated AWS session token, is it possible to fetch the expiration time of it? However, given the lack of answers on questions such as Ways to find out how soon the AWS session expires?, I would guess not.
Since the official AWS CLI is powered by boto3, I was able to dig into the source to find out how aws configure get is implemented. It's possible to read the profile configuration through the botocore Session object. Here is some code to get the config profile and value used in your example:
import botocore.session
# Create an empty botocore session directly
session = botocore.session.Session()
# Get config of desired profile. full_config is a standard python dictionary.
profiles_config = session.full_config.get("profiles", {})
saml_config = profiles_config.get("saml", {})
# Get config value. This will be None if the setting doesn't exist.
saml_security_token_expires = saml_config.get("x_security_token_expires")
I'm using code similar to the above as part of a transparent session cache. It checks for a profile's role_arn so I can identify a cached session to load if one exists and hasn't expired.
As far as the alternate question of knowing how long a given session has before expiring, you are correct in that there is currently no API call that can tell you this. Session expiration is only given when the session is created, either through STS get_session_token or assume_role API calls. You have to hold onto the expiration info yourself after that.
I have a very big folder in Google Cloud Storage and I am currently deleting the folder with the following django - python code while using Google App Engine within a 30 seconds default http timeout.
def deleteStorageFolder(bucketName, folder):
from google.cloud import storage
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
logging.info("Deleting : " + folder)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
logging.info(str(e.message))
It is really unbelievable that Google Cloud is expecting the application to request the information for the objects inside the folder one by one and then delete them one by one.
Obviously, this fails due to the timeout. What would be the best strategy here ?
(There should be a way that we delete the parent object in the bucket, it should delete all the associated child objects somewhere in the background and we remove the associated data from our model. Then Google Storage is free to delete the data whenever it wants. Yet, per my understanding, this is not how things are implemented)
2 simple options in my mind until the client library supports deleting in batch - see https://issuetracker.google.com/issues/142641783 :
if the GAE image includes the gsutil cli, you could execute gsutil -m rm ... in a subprocess
my favorite, use gcsfs library instead of the G library. It supports batch-deleting by default - see https://gcsfs.readthedocs.io/en/latest/_modules/gcsfs/core.html#GCSFileSystem.rm
There is a workaround. You can do this in 2 steps
"Move" your file to delete into another bucket with Transfert
Create a transfert from your bucket, with the filters that you want to another bucket (create a temporary one if needed). Check "delete from source after transfer" checkbox
After the successful transfer, delete the temporary bucket. If it's too long, you have another workaround.
Go to bucket page
Click on lifecycle
Set up a lifecycle where you delete file with age > 0 day
In both cases, you rely on Google Cloud batch feature because by yourselves is too, too, too long!
For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage.
The files are available on public HTTP URL (http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home), but they are very large (between 30 and 300 Megabytes). Size of files is the main issue.
After looking at previous stackoverflow topics, I have tried two unsuccessful methods:
1/ First attempt via urlfetch in Google App Engine
from google.appengine.api import urlfetch
url = "http://dcpc-nwp.meteo.fr/servic..."
result = urlfetch.fetch(url)
[...] # Code to save in a Google Cloud Storage bucket
But I get the following error message on the urlfetch line :
DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL
2/ Second attempt via the Cloud Storage Transfert Service
According to the documentation, it is possible to retrieve HTTP Data into Cloud Storage directly via the Cloud Storage Transfert Service :
https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata
But it requires the size and md5 of the files before the download. This option cannot work in my case because the website does not provide those information.
3/ Any ideas ?
Do you see any solution to retrieve automatically large file on HTTP into my Cloud Storage bucket?
3/ Workaround with a Compute Engine instance
Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.
This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.
For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :
It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)
The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.
Another option would be to use a serverless Cloud Function. It could look like something below in Python.
import requests
def download_url_file(url):
try:
print('[ INFO ] Downloading {}'.format(url))
req = requests.get(url)
if req.status_code==200:
# Download and save to /tmp
output_filepath = '/tmp/{}'.format(url.split('/')[-1])
output_filename = '{}'.format(url.split('/')[-1])
open(output_filepath, 'wb').write(req.content)
print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
return output_filename
else:
print('[ ERROR ] Status Code: {}'.format(req.status_code))
except Exception as e:
print('[ ERROR ] {}'.format(e))
return output_filename
Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.
Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.
I am using Redshift and have to write some custom scripts to generate reports. I am using AWS datapipeline CustomShellActivity for running my custom logic. I am using python and boto3. I am wondering what is the safest way and in fact, best practice to provide database password in python script. I am sure that hardcoding password in script is not good practice. What other options do I have or should I explore?
A pretty standard approach is to store credentials in a secure S3 bucket and download them as part of the deployment/launch process using an IAM role with access to the secure bucket. For limited runtime cases like lambda or datapipeline you could download from S3 to an in memory buffer using boto.Key.get_contents_as_string() at startup, parse the file and set up your credentials.
For increased security you can incorporate KMS secret management. Here is an example that combines the two.
I usually store them as an environment variables. I am not sure about the AWS data pipeline deployment, but on a standard Linux box (EC2), you could do:
# ~/.profile or /etc/profile
export MY_VAR="my_value"
And then you can access them in Python like this:
# python script
import os
my_var_value = os.environ['MY_VAR'] if 'MY_VAR' in os.environ else 'default'