AWS S3 continuous byte stream upload? - python

I am doing some batch processing and occasionally end up with a corrupt line of (string) data. I would like to upload these to an S3 file.
Now, I would really want to add all the lines to a single file, and upload it in a after my script finished executing, but my client asked me to use a socket connection instead and add each line one by one as they come up, simulating a single slow upload.
It sounds like he's done this before, but I couldn't find any reference for anything like it (not talking about multi-part uploads). Has anyone done something like this before?

Related

Dataflow streaming job processes the same element many times at the same time

Short description:
Dataflow is processing the same input element many times, even at the same time in parallel (so this is not fail-retry build-in mechanism of dataflow, because previous process didn't fail).
Long description:
Pipeline gets pubsub message in which path to GCS file is stored.
In next step (DoFn class) this file is open and read line by line, so sometimes for very big files this is long process and takes up to 1 hour (per file).
Many times (very often) those big files are processing at the same time.
I see it based on logs messages, that first process loads already 500k rows, another one 300k rows and third one just started, all of them are related to the same file and all of them based on the same pubsub message (the same message_id).
Also pubsub queue chart is ugly, those messages are not acked so unacked chart does not decrease.
Any idea what is going on? Have you experienced something similar?
I want to underline that this is not a issue related to fail and retry process.
If first process fails and second one started for the same file - that is fine and expected.
Unexpected is, if those two processes lives at the same time.
When a file is added on Cloud storage and fire an automatic notification to Pub Sub, multiple notifications can be sent :
- OBJECT_FINALIZE Sent when a new object (or a new generation of an existing object) is successfully created in the bucket. This includes copying or rewriting an existing object. A failed upload does not trigger this event.
- OBJECT_METADATA_UPDATE Sent when the metadata of an existing object changes.
...
pubsub-notifications doc
You can access to the attributes of the PubSubMessage in Beam and filter only messages with attribute event type and OBJECT_FINALIZE value.
In this case, only one message per file will be treaten by your Dataflow job and then the DoFn will open this file and treat elements only once.
Here is a likely possibility:
reading the file is being "fused" with reading the message from Cloud Pubsub, so that the hour of processing happens before the result is saved to Dataflow's internal storage and the message can be ACKed
since your processing is so long, Cloud Pubsub will deliver the message again
there is no way that Dataflow cancel's your DoFn processing, so you will see them both processing at the same time, even though one of them is expired and will be rejected when processing is complete
What you really want is for the large file reads to be split and parallelized. Beam can do this easily (and currently I believe is the only framework that can). You pass the filenames to TextIO.readFiles() transform and the reading of each large file will be split and performed in parallel, and there will be enough checkpointing that the pubsub message will be ACKed before it expires.
One thing you might try is to put a Reshuffle in between the PubsubIO.read() and your processing.

Buffer for a continuously generated CSV files to upload to MongoDB

I'm trying to figure out a way in my Flask application to store the multiple csvs that are processed by each thread continuously inside a buffer before uploading it to a Mongo database. The reason I would like to use the buffer is to guarantee some level of persistence and proper handling of errors (in case of network failure, I want to try uploading the csv into Mongo again).
I thought about using a Task Queue such as Celery with a message broker (rabbitmq), but wasn't sure if that was the right way to go. Sorry if this isn't a question suitable for SO -- I just wanted clarification on how to go about doing this. Thank you in advance.
Sounds like you want something like the linux tail command. Tail prints each line of file as soon as it is updated. I'm assuming this csv file is generated by a seperate program that is running at the same time. See How can I tail a log file in Python? on how to implement tail in python.
Note: You might be better off dumping the CSV's in batches it won't be realtime but if thats not important it'll be more efficient

How do I transfer files from s3 to my ec2 instance whenever I add a new file to s3?

I have a py script which is in my ec2 instance. That requires a video file as input which is in an S3 bucket. How do I automate the process where the ec2 instance starts running every time a new file is added to that bucket? I want the ec2 instance to recognize this new file and then add it to its local directory where the py script can use it and process it and create the output file. I want to then send this output file back to the bucket and store it there.
I know boto3 library is used to connect s3 to ec2 , however I am unclear how to trigger this automatically and look for new files without having to manually start my instance and copy everything
Edit:
I have a python program which basically takes a video file(mp4) and then breaks it into frames and stitches it to create a bunch of small panorama images and stores it in a folder named 'Output'. Now as the program needs a video as input, in the program I refer to a particular directory where it is supposed pick the mp4 file from and read it as input. So what I now want is that, there is going to be an s3 bucket that is going to receive a video file from elsewhere. it is going to be inside a folder inside a particular bucket. I want any new mp4 file entering that bucket to be copied or sent to the input directory in my instance. Also, when this happens, I want my python program stored in that instance to be automatically executed and find this new video file in the input directory to process it and make the small panoramas and then store it in the output directory or even better, send it to an output folder in the same s3 bucket.
There are many ways in which you could design a solution for that. They will vary depending on how often you get your videos, should it be scalable, fault tolerant, how many videos do you want to process in parallel and more. I will just provide one, on the assumption that the new videos are uploaded occasionally and no auto-scaling groups are needed for processing large number of videos at the same time.
On the above assumption, one way could be as follows:
Upload of a new video triggers a lambda function using S3 event notifications.
Lambda gets the video details (e.g. s3 path) from the S3 event, submits the video details to a SQS queue and starts your instance.
Your application on the instance, once started, pulls the SQS queue for details of the video file to process. This would require your application to be designed in a way that its starts a instance start, which can be done using modified user data, systemd unit files and more.
Its a very basic solution, and as I mentioned many other ways are possible, involving auto-scaling group, scaling policies based on sqs size, ssm run commands, and more.

Is there a way to stream very large amounts of upload data directly from /dev/urandom to AWS S3?

I'm trying to run some tests for upload speeds on AWS S3 for very large files (500GB-5TB). I'm currently using boto3, the AWS SDK for Python. Rather than creating and storing massive files on my own hard drive, I'd prefer to stream directly from /dev/urandom (or at least /dev/zero). boto3's put_object() can upload data from a stream, but it seems to have a hard limit of 5GB, which is far less than I need to test.
I tried boto3's upload_fileobj(), which handles larger objects by using multipart uploads automatically. It works just fine on actual files, but I can't seem to figure out a way to get it to upload data directly from a stream. I also looked at using the AWS S3 Command Line Interface (CLI) instead of the boto3 SDK, but again couldn't figure out a way to upload data directly from a stream.
Is there a comparatively easy way to upload a large amount of data to AWS S3 directly from /dev/urandom?
You don't want to stream directly from /dev/urandom, because it is actually CPU-limited rather than IO-limited (you can see this by running top while using dd to stream random data into a file, or by comparing times to copy an existing 1GB file that's not already in disk cache).
Using Boto3, the calls you want are create_multipart_upload to initiate the upload, upload_part to send each part, and complete_multipart_upload to finish the upload. You can either pass a file or a byte array to upload_part, so you can either generate a byte array using the built-in random number generator (which will be sufficiently random to avoid GZip compression), or repeatedly read the same file (in similar tests I use a 1GB file containing data from urandom -- Gzip isn't going to give you any compression over that large an input space).
However, the entire exercise is pointless. Unless you have a gigabit pipe directly into the Internet backbone, AWS is going to be faster than your network. So all you're really testing is how fast your network can push bytes into the Internet, and there are a bunch of "speed test" sites that will tell you that throughput. Plus, you won't learn much more sending 1 TB than sending 1 GB: the entire point of S3 is that it can handle anything.

Python: file-based thread-safe queue

I am creating an application (app A) in Python that listens on a port, receives NetFlow records, encapsulates them and securely sends them to another application (app B). App A also checks if the record was successfully sent. If not, it has to be saved. App A waits few seconds and then tries to send it again etc. This is the important part. If the sending was unsuccessful, records must be stored, but meanwhile many more records can arrive and they need to be stored too. The ideal way to do that is a queue. However I need this queue to be in file (on the disk). I found for example this code http://code.activestate.com/recipes/576642/ but it "On open, loads full file into memory" and that's exactly what I want to avoid. I must assume that this file with records will have up to couple of GBs.
So my question is, what would you recommend to store these records in? It needs to handle a lot of data, on the other hand it would be great if it wasn't too slow because during normal activity only one record is saved at a time and it's read and removed immediately. So the basic state is an empty queue. And it should be thread safe.
Should I use a database (dbm, sqlite3..) or something like pickle, shelve or something else?
I am a little consfused in this... thank you.
You can use Redis as a database for this. It is very very fast, does queuing amazingly well, and it can save its state to disk in a few manners, depending on the fault tolerance level you want. being an external process, you might not need to have it use a very strict saving policy, since if your program crashes, everything is saved externally.
see here http://redis.io/documentation , and if you want more detailed info on how to do this in redis, I'd be glad to elaborate.

Categories

Resources