Combining many log files in Amazon S3 and read in locally - python

I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?

There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.

Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.

Related

Google Cloud Storage JSONs to Pandas Dataframe to Warehouse

I am a newbie in ETL. I just managed to extract a lot of information in form of JSONs to GCS. Each JSON file includes identical key-value pairs and now I would like to transform them into dataframes on the basis of certain key values.
The next step would be loading this into a data warehouse like Clickhouse, I guess? I was not able to find any tutorials on this process.
TLDR 1) Is there a way to transform JSON data on GCS in Python without downloading the whole data?
TLDR 2) How can I set this up to run periodically or in real time?
TLDR 3) How can I go about loading the data into a warehouse?
If these are too much, I would love it if you can point me to resources around this. Appreciate the help
There are some ways to do this.
You can add files to storage, then a Cloud Functions is activated every time a new file is added (https://cloud.google.com/functions/docs/calling/storage) and will call an endpoint in Cloud Run (container service - https://cloud.google.com/run/docs/building/containers) running a Python application to transform these JSONs in a dataframe. Note that the container image will be stored in Container Registry. Then the Python notebook running on Cloud Run will save the rows incrementally to BigQuery (warehouse). After that you can have analytics with Looker Studio.
If you need to scale the solution to millions/billions of rows, you can add files to storage, Cloud Functions is activated and calls Dataproc, a service where you can run Python, Anaconda, etc. (How to call google dataproc job from google cloud function). Then this Dataproc cluster will structurate the JSONs as a dataframe and save to the warehouse (BigQuery).

AWS MediaConvert Docs are obnoxious and unclear

I have been messing with AWS MediaConvert for boto3 for the python library and I find the docs incredibly confusing.
There are so many settings.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mediaconvert.html
and amazon does a absoultely terrible job of labeling what is necessary to do a basic job.
what would be the correct json for a simple job.
taking a video with audio file and turning it into a CMAF file
and taking a audio only file and turning it into A CMAF file.
I am trying to establish the baseline use of this technology. And there is so much extra that I don't know what I absolutely need and what is extra settings for specific use cases.
The solve for this is to use the MediaConvert UI in AWS then use the copy json button save it and then use the json created.
Never mind trying to create it yourself. Unless you like pain.

SageMaker experiments store

I just started using aws sagemaker for running and maintaining models, experiments. just wanted to know is there any persistent layer for the sagemaker from where i can get data of my experiments/models instead of looking into the sagemaker studio. Does sagemaker saves the experiments or its data like s3 location in any table something like modelsdb?
SageMaker Studio is using the SageMaker API to pull all of the data its displaying. Essentially there's no secret API here getting invoked.
Quite a bit of what's being displayed with respect to experiments is from the search results, the rest coming from either List* or Describe* calls. Studio is taking the results from the search request and displaying it in the table format that you're seeing. Search results when searching over resource ExperimentTrialComponent that have a source (such as a training job) will be enhanced with the original sources data ([result]::SourceDetail::TrainingJob) if supported (work is ongoing to add additional source detail resource types).
All of the metadata that is related to resources in SageMaker is available via the APIs; there is no other location (in the cloud) like s3 for that data.
As of this time there is no effort to determine if it's possible to add support into modeldb for SageMaker that I'm aware of. Given that modeldb appears to make some assumptions about it's talking to a relational database it would appear unlikely to be something that would be doable. (I only read the overview very quickly so this might be inaccurate.)

Create a zip with files from an AWS S3 path

Is there a way to provide a single URL for a user to download all the content from an S3 path?
Otherwise, is there a way to create a zip with all files found on an S3 path recursively?
ie. my-bucket/media/123/*
Each path usually has 1K+ images and 10+ videos.
There's no built-in way. You have to download all files, compact them "locally", re-upload it, and then you'll have a single URL for download.
As mentioned before, there's no built-in way to do it. But from another hand, you don't need to download and upload back your files. You could create a serverless solution in the same AWS region/location.
You could implement it in different ways:
API Gateway + Lambda Function
In this case, you will trigger your lambda function via API Gateway. Lambda function will create an archive from your bucket's files and upload the result back to S3. Lambda function will return URL to this archive***.
Drawbacks of this way: Lambda can't execute more than 5 min and if you have too many files, it will not have enough time to process them. Be aware, that S3 max file size is 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, you should consider using the Multipart Upload capability.
Example: Full guide to developing REST API’s with AWS API Gateway and AWS Lambda
Step Function (API Gateway + Lambda Function that calls Step Function)
5 min should be enough to create an archive, but if you are going to do some preprocessing I recommend you to use Step Function. SF has the limitation with the maximum number of registered activities/states and request size (you can't pass you archive in a request) but it is easy to avoid it (if you take it to consideration during designing). Check out more there.
Personally, I am using both ways for different cases.
*** It is bad practice - give to user path to your real file on S3. It is better to use CloudFront CDN. CloudFront allows you to control the lifetime of URL and provide different ways of security and restrictions.
There is no single call you can make to s3 to download as a .zip. You would have to create a service recursively download all of the objects and compress them. It is important to keep in mind the size limit of your S3 objects though. The limit is 5TB per object. You will want to add a check to verify the size of the .zip before re-upload.

Using EC2 with auto scaling group for a batch image-processing application on AWS

I am new to AWS and am trying to port a python-based image processing application to the cloud. Our application scenario is similar to the Batch Processing scenario described here
[media.amazonwebservices.com/architecturecenter/AWS_ac_ra_batch_03.pdf]
Specifically the steps involved are:
Receive a large number of images (>1000) and one CSV file containing image metadata
Parse CSV file and create a database (using dynamoDB).
Push images to the cloud (using S3), and push messages of form (bucketname, keyname)
to an input queue (using SQS).
"Pop" messages from the input queue
Fetch appropriate image data from S3, and metadata from dynamoDB.
Do the processing
Update the corresponding entry for that image in dynamoDB
Save results to S3
Save a message in output queue (SQS) which feeds the next part of
the pipeline.
Steps 4-9 would involve the use of EC2 instances.
From the boto documentation and tutorials online, I have understood how to incorporate S3, SQS, and dynamoDB into the pipeline. However, I am unclear on how exactly to proceed with the EC2 inclusion. I tried looking at some example implementations online, but couldn't figure out what the EC2 machine should do to make our batch image processing application work
Use a BOOTSTRAP_SCRIPT with an infinite loop that constantly polls
the input queue ad processes messages if available. This is what I
think is being done in the Django-PDF example on AWS blog
http://aws.amazon.com/articles/Python/3998
Use boto.services to take care of all the details of reading
messages, retrieving and storing files in S3, writing messages etc.
This is what is used in the monster muck mash-up example
http://aws.amazon.com/articles/Python/691
Which of the above methods is preferred for batch processing applications, or is there a better way? Also, for each of the above how do I incorporate the use of Auto-scaling group to manage EC2 machines based on load in the input queue.
Any help in this regards would be really appreciated. Thank you.
You should write an application (using Python and Boto for example) that will do the SQS polling and interact with S# and DynamoDB.
This application must be installed at boot time on the EC2 instance. Several options are available (CloudFormation, Chef, CloudInit and user-data or Custom AMI) but I would suggest you to start with User-Data as described here http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
You also must ensure your instances has proper privileges to talk to S3, SQS and DynamodDB. You must create IAM permissions for this. Then attach the permissions to a role and the role to your instance. Detailled procedure is available in the doc at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html

Categories

Resources