Software Paradigm for Pushing Data Through a System

Software Paradigm for Pushing Data Through a System - python

tl-dr: I wanted you feedback if the correct software design pattern to use would be a Push/Pull Pipeline pattern.
Details:
Let's say I have several software algorithms/blocks which process data coming into a software system:
[Download Data] --> [Pre Process Data] --> [ML Classification] --> [Post Results]
The download data block simply loiters until midnight when new data is available and then downloads new data. The pre-process data simply loiters until newly available downloaded data is present, and then preprocesses the data. The Machine Learning (ML) Classification block simply loiters until new data is available to classify, etc.
The entire system seems to be event driven and I think fits the push/pull paradigm perfectly?
The [Download Data] block would be a producer? The consumers would be all the subsequent blocks with the exception of the [Plot Results] which would be a results collector?
Producer = pull
Consumer = pull then push
result collector = pull
I'm working within a python framework. This implementation looked ideal:
https://learning-0mq-with-pyzmq.readthedocs.io/en/latest/pyzmq/patterns/pushpull.html
https://github.com/ashishrv/pyzmqnotes
Push/Pull Pipeline Pattern
I'm totally open to using another software paradigm other than push/pull if I've missed the mark here. I'm also open to using another repo as well.
Thanks in advance for your help with the above!

I've done similar pipelines many many times and very much like to break it into blocks like that. Why? Mainly for automatic recovery from any errors. If something gets delayed, it will auto recover next hour. If something needs to be fixed mid-pipeline, fix it and name it so it gets picked up next cycle. (That and the fact smaller blocks are easier to design, build, and test).
For example, your [Download Data] should run every hour to look for waiting data: if none, go back to sleep; if some, download it to a file with a name containing a timestamp and state: 2020-0103T2153.downloaded.json. [Pre Process Data] should run every hour to look for files named *.downloaded.json: if none, go back to sleep; if one or more, pre-processes each in increasing timestamp order with output to <same-timestamp>.pre-processed.json. Etc, etc for each step.
Doing it this way meant may unplanned events auto recovered and nobody would know unless they looked in the log files (you should log each so you know what happened). Easy to sleep at night :)
In these scenarios, the event driving this is just time-of-day via crontab. When "awoken", each step in the pipeline just looks to see if it has any work waiting for it. Trying to make the file-creation event initiate things was non-simple especially if you need to re-initiate things (would need to re-create the file).
I wouldn't use a message queue as that's more complicated and more suited when you have to handle incoming messages as they arrive. Your case is more simple batch file processing so keep it simple and sleep at night.

Related

RabbitMQ cleanup after consumers

I having problem handling the following scenario:
I have one publisher which wants to upload a lot of binary information (Like images), so instead I want it to save the image and upload a path or some reference to that file.
I have multiple different consumers which are reading from this MQ and do different things.
In order to do that, I simply send the information in fan-out to some exchange and define several queues for each different consumers.
This could work just fine, except for the trashing of the FS. Since no one is responsible for deleting the saved images. I need some way of defining a hook to the time every consumer is done consuming a message from an exchnage? Maybe setting some callback for the cleanup of the message in the exchnage?
Few notes:
Everything happens locally, we can assume that everything is on the same FS for simplicity.
I know that I can simply let the publisher save the image and give FS links for the different consumers, but this solution is problematic, since I want the publisher to be oblivious to the consumers. I don't want to update the publisher's code every time a new consumer may be used (or one can be removed).
I am working with python. (pika module)
I am new to Message Queues, so if you have a better suggestion to get things done, I would love to learn about it.

Once the image is processed by consumer publish message FileProcessed with the information related to the file. That message can be picked up by another consumer which is in charge of cleaning up the messages, so that consumer will remove the file.
Additionally, make sure that your messages is re-queued in case of failure, so they will be picked up later and their processing will be retried. Make sure the retry count is limited, when the limit is reached route the message to dead letter exchange.
Some useful links below:
pika.BasicProperties for handling retries.
RabbitMQ tutorial
Pika DLX Implementation

Real-time data collection and 'offline' processing

I have a continuous stream of data. I want to do a small amount of processing to the data in real-time (mostly just compression, rolling some data off the end, whatever needs doing) and then store the data. Presumably no problem. HDF5 file format should do great! OOC data, no problem. Pytables.
Now the trouble. Occasionally, as a completely separate process so that data is still being gathered, I would like to perform a time consuming calculation involving the data (order minutes). This involving reading the same file I'm writing.
How do people do this?
Of course reading a file that you're currently writing should be challenging, but it seems that it must have come up enough in the past that people have considering some sort of slick solution---or at least a natural work-around.
Partial solutions:
It seems that HDF5-1.10.0 has a capability SWMR - Single Write, Multiple Read. This seems like exactly what I want. I can't find a python wrapper for this recent version, or if it exists I can't get Python to talk to the right version of hdf5. Any tips here would be welcomed. I'm using Conda package manager.
I could imagine writing to a buffer, which is occasionally flushed and added to the large database. How do I ensure that I'm not missing data going by while doing this?
This also seems like it might be computationally expensive, but perhaps there's no getting around that.
Collect less data. What's the fun in that?

I suggest you take a look at adding Apache Kafka to your pipeline, it can act as a data buffer and help you separate different tasks done on the data you collect.
pipeline example:
raw data ===> kafka topic (raw_data) ===> small processing ====> kafak topic (light_processing) ===> a process read from light_processing topic and writes to db or file
At the same time you can read with another process the same data from light_processing topic or any other topic and do your heavy processing and so on.
if both the light processing and the heavy processing connect to kafka topic with the same groupId the data will be replicated and both processes will get the same stream
hope it helped.

set_sequential_download() and set_piece_deadline() in libtorrent

i'm working on my project which is to make a streaming client over libtorrent.
i'm using the python client (python binding).
i searched a lot about these functions set_sequential_download() and set_piece_deadline() and i couldn't find a good answer on how to force download pieces in order, which means first piece 1 and then 2,3,4 etc..
i saw people are asking this in forums, but none of them got a good answer on the changes need to be done in order it to succeed.
i understood that the set_sequential_download() just asks for the pieces in order but in fact they are randomly downloaded. i tried to change the deadline of the pieces using set_piece_deadline() , increment each piece but it doesn't work for me at all.
** UPDATE
the goal i'm trying to acomplish , it's downloading one piece at a time so i can make a streaming throgh torrents.
i hope some of you can help me,
thanks Ben.

set_sequential_download() will request pieces in order. However:
all peers may not have all pieces. If the next piece you want to download is 3 and one of your peers doesn't have 3 but the next it has is 5, libtorrent will start requesting blocks from piece 5 from that peer.
peers provide varying upload rates, which means that some peers will satisfy your request sooner than others.
This makes it possible for the pieces to complete out-of-order.
set_piece_deadline() is a more flexible way to specify piece priority. It supports arbitrary range requests (as described by Jacob Zelek). Its main feature, though, is that it uses a different approach to requesting blocks. Instead of considering a peer at a time, and asking "what should I request from this peer", it considers a piece at a time, asking "which peer should I request this block from".
This makes it deliberately attempt to make pieces complete in the order of their deadlines. It is still an estimate based on historical download rates from peers, and if the bottleneck for download rates is your own download capacity, it may be very difficult to make predictions of future download rates for peers. A few important things to keep in mind when using the `set_piece_deadline()`` API are:
It's not important that the deadline is in the future. If the deadline cannot be met given the current download or upload capacity, the pieces will be prioritized in the order they were asked to be completed.
If a deadline is far out in the future, libtorrent may wait to prioritize it until it believe it needs to request it to make the deadline. If you're streaming a large file, and you know the bit-rate, you can set up deadlines for every piece, and if your capacity is higher than the bitrate, you'll still request some pieces in rarest-first order. Improves swarm quality.
When streaming data, it's absolutely critical to read-ahead. If you don't set the deadline until you want the piece, you'll always fall behind. There's typically a pretty long round-trip between requesting a piece and completing it. If you don't keep the request pipe full of deadline-pieces, libtorrent will start requesting other pieces again, and you'll get non-prioritized pieces interleaved with your high-priority pieces. You should probably keep a few seconds and at least a few pieces as read-ahead. For video, I would imagine tens of megabytes is appropriate (but experimentation and measurement is the best way to tweak it).
If you are in fact looking to stream video to a player or web browser over HTTP, you may want to take a look at (or use and submit pull requests to):
https://github.com/arvidn/libtorrent-webui/blob/master/src/file_downloader.cpp
that's a file-downloader provider that fits into simple http framework in that repository.
UPDATE:
If all you want is to guarantee that piece 1 completes before piece 2 (at any cost, specifically very poor performance), you can set the priority of all pieces to 0, except for the one piece you want to download. Once it completes, you'll be notified by an alert and you can set the priority of the next piece you want to 1. And so on.
This will be incredibly slow, since you'll pause the download constantly, and be in constant end-game mode (where you may download the same block from multiple peers, if one is slow). For instance, if you have more peers than there are blocks in one piece, you will leave download bandwidth unused, by not being able to request from all peers.

I've ran into the same problem as you. Setting a torrent to sequential download means the pieces will be downloaded in a somewhat ordered fashion. This may be the intuitive solution for streaming. However, streaming video is more complicated then just downloading all the pieces in order.
Video files come in different containers (e.g. mkv, mp4, avi) and different codes (h264, theora, etc). Some codecs/containers store metadata/headers in different locations in a file. I can't remember off the top of my head but a certain container/codec stores all header information at the end of the file. Such a file may not stream well if downloaded sequentially.
Unless you write the code for determining which pieces are needed to start streaming, you will have to rely on an existing mechanisms. Take for example Peerflix which spawns a browser video player, VLC, of Mplayer. These applications have a good idea of what byte ranges they need for various containers/codecs. When Peerflix launches VLC to play, lets say, an AVI file, VLC will attempt to read the first several bytes and last several bytes (headers).
The genius behind Peerflix is that it tries to serve the video file through it's own web server and therefore knows what byte ranges of the file VLC is seeking. It then determines which pieces the byte ranges fall into and prioritizes those pieces. Peerflix uses some Node.js BitTorrent library, whose exact piece prioritization mechanisms are unknown to me. However, in the case of libtorrent-rasterbar, the set_piece_deadline() function allows you to signal the library to what pieces you need. In my experience, once I determined the pieces needed, I would call set_piece_deadline() with a short deadline (50ms or so) and wait for the arrival. Please note that using set_piece_dealine() is incompatible with sequential downloads (just set them to false).
One thing to note, libtorrent-rasterbar will not write the piece to the hard drive as soon as it gets it. This is a trap I fell into because I tried to read that byte range from the file when the piece arrived. For this you will need to run a thread to catch the alerts that libtorrent-rasterbar passes to your application. More specifically you will receive the raw binary data for that piece in a read_piece_alert.

Task queue for deferred tasks in GAE with python

I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!

You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.

To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).

Should I learn/use MapReduce, or some other type of parallelization for this task?

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.
This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.
Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:
Is this a flow that would be well suited to parallelizing with MapReduce?
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? (I'm not sure exactly what counts as a "Job", so I don't know exactly how I'll be billed)
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
Are there any problems you see with the way I've designed this job flow?
Ok, now onto the details:
The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.
There are two calls I can make:
Get Followed Users -- Which returns all the users being followed by the requested user, and
Get Favorite Items -- Which returns all the favorite items of the requested user.
After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:
Jobs in this flow include:
Start Updating Queue for user -- kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
Get Favorites for user -- Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
Calculate New Queue for user -- Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.
So, again, my questions are:
Is this a flow that would be well suited to parallelizing with MapReduce? I don't know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX's queue only after that's all done.
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? Is there a limit on how many "threads" I can have waiting on open API requests if I use their module?
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
Are there any problems you see with the way I've designed this job flow?
Thanks for reading, I'm looking forward to some discussion with you all.
Edit, in response to JimR:
Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.
What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.
So, what I'm leaning towards is:
A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it's low overhead and scales well.
Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
Possibly using Amazon's Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It's very cheap. May be worth implementing instead of a custom job-queue system.

The work you describe is probably a good fit for either a queue, or a combination of a queue and job server. It certainly could work as a set of MapReduce steps as well.
For a job server, I recommend looking at Gearman. The documentation isn't awesome, but the presentations do a great job documenting it, and the Python module is fairly self-explanatory too.
Basically, you create functions in the job server, and these functions get called by clients via an API. The functions can be called either synchronously or asynchronously. In your example, you probably want to asynchronously add the "Start update" job. That will do whatever preparatory tasks, and then asynchronously call the "Get followed users" job. That job will fetch the users, and then call the "Update followed users" job. That will submit all the "Get Favourites for UserA" and friend jobs together in one go, and synchronously wait for the result of all of them. When they have all returned, it will call the "Calculate new queue" job.
This job-server-only approach will initially be a bit less robust, since ensuring that you handle errors and any down servers and persistence properly is going to be fun.
For a queue, SQS is an obvious choice. It is rock-solid, and very quick to access from EC2, and cheap. And way easier to set up and maintain than other queues when you're just getting started.
Basically, you will put a message onto the queue, much like you would submit a job to the job server above, except you probably won't do anything synchronously. Instead of making the "Get Favourites For UserA" and so forth calls synchronously, you will make them asynchronously, and then have a message that says to check whether all of them are finished. You'll need some sort of persistence (a SQL database you're familiar with, or Amazon's SimpleDB if you want to go fully AWS) to track whether the work is done - you can't check on the progress of a job in SQS (although you can in other queues). The message that checks whether they are all finished will do the check - if they're not all finished, don't do anything, and then the message will be retried in a few minutes (based on the visibility_timeout). Otherwise, you can put the next message on the queue.
This queue-only approach should be robust, assuming you don't consume queue messages by mistake without doing the work. Making a mistake like that is hard to do with SQS - you really have to try. Don't use auto-consuming queues or protocols - if you error out, you might not be able to ensure that you put a replacement message back on the queue.
A combination of queue and job server may be useful in this case. You can get away with not having a persistence store to check job progress - the job server will allow you to track job progress. Your "get favourites for users" message could place all the "get favourites for UserA/B/C" jobs into the job server. Then, put a "check all favourites fetching done" message on the queue with a list of tasks that need to be complete (and enough information to restart any jobs that mysteriously disappear).
For bonus points:
Doing this as a MapReduce should be fairly easy.
Your first job's input will be a list of all your users. The map will take each user, get the followed users, and output lines for each user and their followed user:
"UserX" "UserA"
"UserX" "UserB"
"UserX" "UserC"
An identity reduce step will leave this unchanged. This will form the second job's input. The map for the second job will get the favourites for each line (you may want to use memcached to prevent fetching favourites for UserX/UserA combo and UserY/UserA via the API), and output a line for each favourite:
"UserX" "UserA" "Favourite1"
"UserX" "UserA" "Favourite2"
"UserX" "UserA" "Favourite3"
"UserX" "UserB" "Favourite4"
The reduce step for this job will convert this to:
"UserX" [("UserA", "Favourite1"), ("UserA", "Favourite2"), ("UserA", "Favourite3"), ("UserB", "Favourite4")]
At this point, you might have another MapReduce job to update your database for each user with these values, or you might be able to use some of the Hadoop-related tools like Pig, Hive, and HBase to manage your database for you.
I'd recommend using Cloudera's Distribution for Hadoop's ec2 management commands to create and tear down your Hadoop cluster on EC2 (their AMIs have Python set up on them), and use something like Dumbo (on PyPI) to create your MapReduce jobs, since it allows you to test your MapReduce jobs on your local/dev machine without access to Hadoop.
Good luck!

Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.
Thanks for the answers, they were a lot of help finding the solution I was looking for.

I am working with a similar problem that i need to solve. I was also looking at MapReduce and using the Elastic MapReduce service from Amazon.
I'm pretty convinced MapReduce will work for this problem. The implementation is where I'm getting hung up, becauase I'm not sure my reducer even needs to do anything.
I'll answer your questions as I understand your (and my) problem, and hopefully it helps.
Yes I think it'll be suited well. You could look at leveraging the Elastic MapReduce service's multiple steps option. You could use 1 Step to fetch a the people a user is following, and another step to compile a list of tracks for each of those followers, and the reducer for that 2nd step would probably be the one to build the cache.
Depends on how big your data-set is and how often you'll be running it. It's hard to say without knowing how big the data-set is (or is going to get) if it'll be cost effective or not. Initially, it'll probably be quite cost-effective, as you won't have to manage your own hadoop cluster, nor have to pay for EC2 instances (assuming that's what you use) to be up all the time. Once you reach the point where you're actually crunching this data for a long period of time, it probably will make less and less sense to use Amazon's MapReduce service, because you'll constantly have nodes online all the time.
A job is basically your MapReduce task. It can consist of multiple steps (each MapReduce task is a step). Once your data has been processed and all steps have been completed, your Job is done. So you're effectively paying for CPU time for each node in the Hadoop cluster. so, T*n where T is the Time (in hours) it takes to process your data, and n is the number of nodes you tell Amazon to spin up.
I hope this helps, good luck. I'd like to hear how you end up implementing your Mappers and Reducers, as I'm solving a very similar problem and I'm not sure my approach is really the best.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.