set_sequential_download() and set_piece_deadline() in libtorrent

set_sequential_download() and set_piece_deadline() in libtorrent - python

i'm working on my project which is to make a streaming client over libtorrent.
i'm using the python client (python binding).
i searched a lot about these functions set_sequential_download() and set_piece_deadline() and i couldn't find a good answer on how to force download pieces in order, which means first piece 1 and then 2,3,4 etc..
i saw people are asking this in forums, but none of them got a good answer on the changes need to be done in order it to succeed.
i understood that the set_sequential_download() just asks for the pieces in order but in fact they are randomly downloaded. i tried to change the deadline of the pieces using set_piece_deadline() , increment each piece but it doesn't work for me at all.
** UPDATE
the goal i'm trying to acomplish , it's downloading one piece at a time so i can make a streaming throgh torrents.
i hope some of you can help me,
thanks Ben.

set_sequential_download() will request pieces in order. However:
all peers may not have all pieces. If the next piece you want to download is 3 and one of your peers doesn't have 3 but the next it has is 5, libtorrent will start requesting blocks from piece 5 from that peer.
peers provide varying upload rates, which means that some peers will satisfy your request sooner than others.
This makes it possible for the pieces to complete out-of-order.
set_piece_deadline() is a more flexible way to specify piece priority. It supports arbitrary range requests (as described by Jacob Zelek). Its main feature, though, is that it uses a different approach to requesting blocks. Instead of considering a peer at a time, and asking "what should I request from this peer", it considers a piece at a time, asking "which peer should I request this block from".
This makes it deliberately attempt to make pieces complete in the order of their deadlines. It is still an estimate based on historical download rates from peers, and if the bottleneck for download rates is your own download capacity, it may be very difficult to make predictions of future download rates for peers. A few important things to keep in mind when using the `set_piece_deadline()`` API are:
It's not important that the deadline is in the future. If the deadline cannot be met given the current download or upload capacity, the pieces will be prioritized in the order they were asked to be completed.
If a deadline is far out in the future, libtorrent may wait to prioritize it until it believe it needs to request it to make the deadline. If you're streaming a large file, and you know the bit-rate, you can set up deadlines for every piece, and if your capacity is higher than the bitrate, you'll still request some pieces in rarest-first order. Improves swarm quality.
When streaming data, it's absolutely critical to read-ahead. If you don't set the deadline until you want the piece, you'll always fall behind. There's typically a pretty long round-trip between requesting a piece and completing it. If you don't keep the request pipe full of deadline-pieces, libtorrent will start requesting other pieces again, and you'll get non-prioritized pieces interleaved with your high-priority pieces. You should probably keep a few seconds and at least a few pieces as read-ahead. For video, I would imagine tens of megabytes is appropriate (but experimentation and measurement is the best way to tweak it).
If you are in fact looking to stream video to a player or web browser over HTTP, you may want to take a look at (or use and submit pull requests to):
https://github.com/arvidn/libtorrent-webui/blob/master/src/file_downloader.cpp
that's a file-downloader provider that fits into simple http framework in that repository.
UPDATE:
If all you want is to guarantee that piece 1 completes before piece 2 (at any cost, specifically very poor performance), you can set the priority of all pieces to 0, except for the one piece you want to download. Once it completes, you'll be notified by an alert and you can set the priority of the next piece you want to 1. And so on.
This will be incredibly slow, since you'll pause the download constantly, and be in constant end-game mode (where you may download the same block from multiple peers, if one is slow). For instance, if you have more peers than there are blocks in one piece, you will leave download bandwidth unused, by not being able to request from all peers.

I've ran into the same problem as you. Setting a torrent to sequential download means the pieces will be downloaded in a somewhat ordered fashion. This may be the intuitive solution for streaming. However, streaming video is more complicated then just downloading all the pieces in order.
Video files come in different containers (e.g. mkv, mp4, avi) and different codes (h264, theora, etc). Some codecs/containers store metadata/headers in different locations in a file. I can't remember off the top of my head but a certain container/codec stores all header information at the end of the file. Such a file may not stream well if downloaded sequentially.
Unless you write the code for determining which pieces are needed to start streaming, you will have to rely on an existing mechanisms. Take for example Peerflix which spawns a browser video player, VLC, of Mplayer. These applications have a good idea of what byte ranges they need for various containers/codecs. When Peerflix launches VLC to play, lets say, an AVI file, VLC will attempt to read the first several bytes and last several bytes (headers).
The genius behind Peerflix is that it tries to serve the video file through it's own web server and therefore knows what byte ranges of the file VLC is seeking. It then determines which pieces the byte ranges fall into and prioritizes those pieces. Peerflix uses some Node.js BitTorrent library, whose exact piece prioritization mechanisms are unknown to me. However, in the case of libtorrent-rasterbar, the set_piece_deadline() function allows you to signal the library to what pieces you need. In my experience, once I determined the pieces needed, I would call set_piece_deadline() with a short deadline (50ms or so) and wait for the arrival. Please note that using set_piece_dealine() is incompatible with sequential downloads (just set them to false).
One thing to note, libtorrent-rasterbar will not write the piece to the hard drive as soon as it gets it. This is a trap I fell into because I tried to read that byte range from the file when the piece arrived. For this you will need to run a thread to catch the alerts that libtorrent-rasterbar passes to your application. More specifically you will receive the raw binary data for that piece in a read_piece_alert.

Related

Document AI - Improving batch process time for a single document?

I'm working on a GCP Document AI project. First, let me say this - the OCR works fine :-). I'm curios to know about possibilities of improvement if possible.
What happens now
I have a python module written, which will get the OCR done for a tiff file which is uploaded via a portal or collected by an agent in the system. The module is written in a way to avoid local usage of original file content, as the file is readily available in a cloud bucket. But, the price I have to pay is to use the batch_process_documents() API instead of process_document().
An observation
This is an obvious one, as the document if submitted via inline API gets OCR back in less than 5 seconds most time. But, the batch (with a single document :-|) takes more than 45 seconds almost every time. Sometimes it goes beyond a minute or more.
I'm searching for a solution, to reduce the OCR call time. The inline API does not support gcs uris as much as I'm aware, so either I need to download the content, upload it back via inline api and then do an OCR - or I need to live with the performance reduction.
Is there any one who has handled a similar case ? Or if there are any ways to tackle this without using batch api or downloading the content ? Any help is appreciated.

As per your requirement, since your concern is related to the latency when comparing the response time between the process and batchProcess method calls for the Document API, using a single document with results of 5 and 45 seconds respectively.
The process_documents() method has limits on the number of pages and file size that can be sent and it only allows for one document file per API call.
The batch_process_documents() method allows asynchronous processing of larger files and batch processing of multiple files.
Single requests are oriented to smaller amounts of data that usually takes a very small amount of time to process but may have low performance when dealing with a big amount of data, on the other hand batch requests are oriented to handle bigger amounts of data which would have better performance over the single request but may have lower performance when processing a small amount of data.
Regarding your concerns about the latency on these two method calls, looking into the documentation,I am able to find that for the single request or synchronous ("online") operations ( i.e immediate response) the document data is processed in memory and not persisted to disk. Following this in asynchronous offline batch operations the documents are processed in disk, due that the file could be significatively bigger that could not fit in memory. That's why the asynchronous operations take around 10x time vs the synchronous operations.
Each of these method calls has a particular use case, in this case the choice of which one to use would rely on the trade off that's better for you. If the time response is critical and you would like to have the response as soon as possible, you could split the files to fit the size and make the requests as synchronous operations, keeping in mind the quotas and limitations of the API.
This issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this Link.

Since this was originally posted, the Document AI API added a feature to specify a field_mask in the processing request, which limits the fields returned in the Document object output. This can reduce the latency in some requests since the response will be a smaller size.
https://cloud.google.com/document-ai/docs/send-request#online-processor

Software Paradigm for Pushing Data Through a System

tl-dr: I wanted you feedback if the correct software design pattern to use would be a Push/Pull Pipeline pattern.
Details:
Let's say I have several software algorithms/blocks which process data coming into a software system:
[Download Data] --> [Pre Process Data] --> [ML Classification] --> [Post Results]
The download data block simply loiters until midnight when new data is available and then downloads new data. The pre-process data simply loiters until newly available downloaded data is present, and then preprocesses the data. The Machine Learning (ML) Classification block simply loiters until new data is available to classify, etc.
The entire system seems to be event driven and I think fits the push/pull paradigm perfectly?
The [Download Data] block would be a producer? The consumers would be all the subsequent blocks with the exception of the [Plot Results] which would be a results collector?
Producer = pull
Consumer = pull then push
result collector = pull
I'm working within a python framework. This implementation looked ideal:
https://learning-0mq-with-pyzmq.readthedocs.io/en/latest/pyzmq/patterns/pushpull.html
https://github.com/ashishrv/pyzmqnotes
Push/Pull Pipeline Pattern
I'm totally open to using another software paradigm other than push/pull if I've missed the mark here. I'm also open to using another repo as well.
Thanks in advance for your help with the above!

I've done similar pipelines many many times and very much like to break it into blocks like that. Why? Mainly for automatic recovery from any errors. If something gets delayed, it will auto recover next hour. If something needs to be fixed mid-pipeline, fix it and name it so it gets picked up next cycle. (That and the fact smaller blocks are easier to design, build, and test).
For example, your [Download Data] should run every hour to look for waiting data: if none, go back to sleep; if some, download it to a file with a name containing a timestamp and state: 2020-0103T2153.downloaded.json. [Pre Process Data] should run every hour to look for files named *.downloaded.json: if none, go back to sleep; if one or more, pre-processes each in increasing timestamp order with output to <same-timestamp>.pre-processed.json. Etc, etc for each step.
Doing it this way meant may unplanned events auto recovered and nobody would know unless they looked in the log files (you should log each so you know what happened). Easy to sleep at night :)
In these scenarios, the event driving this is just time-of-day via crontab. When "awoken", each step in the pipeline just looks to see if it has any work waiting for it. Trying to make the file-creation event initiate things was non-simple especially if you need to re-initiate things (would need to re-create the file).
I wouldn't use a message queue as that's more complicated and more suited when you have to handle incoming messages as they arrive. Your case is more simple batch file processing so keep it simple and sleep at night.

mpi4py recv data cap?

I am working on a program that is communication intensive with a group of people. I'm not particularly good at debugging distributed programs, but I have a strong suspicion that I am sending too many messages at once to a process. I have reimplemented the actor model in mpi4py. Each process has a "mailbox" of jobs and when they finish with their mailbox they decide to go into CHECK_FOR_UPDATES mode, where they see if there is any new messages they can receive.
I had issues with the program that a group of students and I have been working on. When the load became too big it would start to crash, but we couldn't figure out where the issue was because we're all pretty bad at debugging stuff.
I asked some people at my school if he had any ideas and suggested that, as we are reimplementing the actor system, we should consider using Akka. A student this year said that there may still be a problem, that one actor may get inundated with messages and crash. I asked about it here. The stream model seems not to be what we want (see my comment for more details) and I have since then looked back at the mpi4py program as I had not accounted for this problem before.
In the plain C or Fortran implementation, it appears that there is a count parameter for MPI_Recv. I noticed that comm.recv has no count parameter and suspect that when a process goes into CHECK_FOR_UPDATES mode it just consume a ton of messages from a variety of sources and dies. (Technically, I don't know for sure, but we suspect it might be the case.) Is there a way to cap the amount of data comm.recv accepts?
(Note: I want to avoid using comm.Recv variant as it restricts the user to using numpy arrays.)

Found the answer:
The recv() and irecv() methods may be passed a buffer object that can be repeatedly used to receive messages avoiding internal memory allocation. The buffer must be sufficiently large to accomodate the transmitted messages.
Emphasis mine. Therefore, I have to use Send and Recv.

Outgoing load balancer

I have a big threaded feed retrieval script in python.
My question is, how can I load balance outgoing requests so that I don't hit any one host too often?
This is a big problem for feedburner, since a large percentage of sites proxy their RSS through feedburner and to further complicate matters many sites will alias a subdomain on their domain to feedburner to obscure the fact that they're using it (e.g. "mysite" sets its RSS url to feeds.mysite.com/mysite, where feeds.mysite.com bounces to feedburner). Sometimes it blocks me for awhile and redirects to their "automated requests" error page.

You should probably do a one-time request (per week/month, whatever fits). for each feed and follow redirects to get the "true" address. Regardless of your throttling situation at the time, you should be able to resolve all feeds, save that data and then just do it once for every new feed you add to the list. You can look at urllib's geturl() as it returns the final url from the URL you put into it. When you do ping the feeds, be sure to use the original (keep the "real" simply for load-balancing) to make sure it redirects properly if the user has moved it or similar.
Once that is done, you can simply devise a load mechanism such as only X requests per hour for a given domain, going through each feed and skipping feeds whose hosts have hit the limit. If feedburner keeps their limits public (not likely) you can use that for X, but otherwise you will just have to estimate it and make a rough estimate that you know to be below the limit. Knowing google however, their limits might measure patterns and not have a specific hard limit.
Edit: Added suggestion from comment.

If your problem is related to Feedburner "throttling you", it most certainly does this because of the source IP of your bot. The way to "load balance to Feedburner" would be to have multiple different source IPs to start from.
Now there are numerous ways to achieving this, 2 of them being:
Multi-homed server: multiple IPs on the same machine
Multiple discrete machines
Of course, don't you go a put a NAT box in front of them now ;-)
The above takes care of the possible "throttling problems", now for the "scheduling part". You should maintain a "virtual scheduler" per "destination" and make sure not to exceed the parameters of the Web Service (e.g. Feedburner) in question. Now, the tricky part is to get hold of these "limits"... sometimes they are advertised and sometimes you need to figure them out experimentally.
I understand this is "high level architectural guidelines" but I am not ready to be coding this for you... I hope you forgive me ;-)

"how can I load balance outgoing requests so that I don't hit any one host too often?"
Generally, you do this by designing a better algorithm.
For example, randomly scramble your requests.
Or shuffle them 'fairly' so so that you round-robin through the sources. That would be a simple list of queues where you dequeue one request from each host.

How should I stress test / load test a client server application? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I develop a client-server style, database based system and I need to devise a way to stress / load test the system. Customers inevitably want to know such things as:
• How many clients can a server support?
• How many concurrent searches can a server support?
• How much data can we store in the database?
• Etc.
Key to all these questions is response time. We need to be able to measure how response time and performance degrades as new load is introduced so that we could for example, produce some kind of pretty graph we could throw at clients to give them an idea what kind of performance to expect with a given hardware configuration.
Right now we just put out fingers in the air and make educated guesses based on what we already know about the system from experience. As the product is put under more demanding conditions, this is proving to be inadequate for our needs going forward though.
I've been given the task of devising a method to get such answers in a meaningful way. I realise that this is not a question that anyone can answer definitively but I'm looking for suggestions about how people have gone about doing such work on their own systems.
One thing to note is that we have full access to our client API via the Python language (courtesy of SWIG) which is a lot easier to work with than C++ for this kind of work.
So there we go, I throw this to the floor: really interested to see what ideas you guys can come up with!

Test 1: Connect and Disconnect clients like mad, to see how well you handle the init and end of sessions, and just how much your server will survive under spikes, also while doing this measure how many clients fail to connect. That is very important
Test 2: Connect clients and keep them logged on for say a week, doing random actions (FuzzTest). Time the round-trip of each action. Also keep record of the order of actions, because this way your "clients" will find loopholes in your usecases (very important, and VERY hard to test rationally).
Test 3 & 4: Determine major use cases for your system, and write up scripts that do these tasks. Then run several clients doing same task(test 3), and also several clients doing different tasks(test 4).
Series:
Now the other dimension you need here is amount of clients.
A nice series would be:
5,10,50,100,500,1000,5000,10000,...
This way you can get data for each series of tests with different work loads.
Also congrats on SWIGing your clients api to Python! That is a great way to get things ready.
Note: IBM has a sample of fuzz testing on Java, which is unrealted to your case, but will help you design a good fuzztest for your system

If you are comfortable coding tests in Python, I've found funkload to be very capable. You don't say your server is http-based, so you may have to adapt their test facilities to your own client/server style.
Once you have a test in Python, funkload can run it on many threads, monitoring response times, and summarizing for you at the end of the test.

For performance you are looking at two things: latency (the responsiveness of the application) and throughput (how many ops per interval). For latency you need to have an acceptable benchmark. For throughput you need to have a minimum acceptable throughput.
These are you starting points. For telling a client how many xyz's you can do per interval then you are going to need to know the hardware and software configuration. Knowing the production hardware is important to getting accurate figures. If you do not know the hardware configuration then you need to devise a way to map your figures from the test hardware to the eventual production hardware.
Without knowledge of hardware then you can really only observe trends in performance over time rather than absolutes.
Knowing the software configuration is equally important. Do you have a clustered server configuration, is it load balanced, is there anything else running on the server? Can you scale your software or do you have to scale the hardware to meet demand.
To know how many clients you can support you need to understand what is a standard set of operations. A quick test is to remove the client and write a stub client and the spin up as many of these as you can. Have each one connect to the server. You will eventually reach the server connection resource limit. Without connection pooling or better hardware you can't get higher than this. Often you will hit a architectural issue before here but in either case you have an upper bounds.
Take this information and design a script that your client can enact. You need to map how long your script takes to perform the action with respect to how long it will take the expected user to do it. Start increasing your numbers as mentioned above to you hit the point where the increase in clients causes a greater decrease in performance.
There are many ways to stress test but the key is understanding expected load. Ask your client about their expectations. What is the expected demand per interval? From there you can work out upper loads.
You can do a soak test with many clients operating continously for many hours or days. You can try to connect as many clients as you can as fast you can to see how well your server handles high demand (also a DOS attack).
Concurrent searches should be done through your standard behaviour searches acting on behalf of the client or, write a script to establish a semaphore that waits on many threads, then you can release them all at once. This is fun and punishes your database. When performing searches you need to take into account any caching layers that may exist. You need to test both caching and without caching (in scenarios where everyone makes unique search requests).
Database storage is based on physical space; you can determine row size from the field lengths and expected data population. Extrapolate this out statistically or create a data generation script (useful for your load testing scenarios and should be an asset to your organisation) and then map the generated data to business objects. Your clients will care about how many "business objects" they can store while you will care about how much raw data can be stored.
Other things to consider: What is the expected availability? What about how long it takes to bring a server online. 99.9% availability is not good if it takes two days to bring back online the one time it does go down. On the flip side a lower availablility is more acceptable if it takes 5 seconds to reboot and you have a fall over.

If you have the budget, LoadRunner would be perfect for this.

On a related note: Twitter recently OpenSourced their load-testing framework. Could be worth a go :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.