How should I stress test / load test a client server application? [closed]

How should I stress test / load test a client server application? [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I develop a client-server style, database based system and I need to devise a way to stress / load test the system. Customers inevitably want to know such things as:
• How many clients can a server support?
• How many concurrent searches can a server support?
• How much data can we store in the database?
• Etc.
Key to all these questions is response time. We need to be able to measure how response time and performance degrades as new load is introduced so that we could for example, produce some kind of pretty graph we could throw at clients to give them an idea what kind of performance to expect with a given hardware configuration.
Right now we just put out fingers in the air and make educated guesses based on what we already know about the system from experience. As the product is put under more demanding conditions, this is proving to be inadequate for our needs going forward though.
I've been given the task of devising a method to get such answers in a meaningful way. I realise that this is not a question that anyone can answer definitively but I'm looking for suggestions about how people have gone about doing such work on their own systems.
One thing to note is that we have full access to our client API via the Python language (courtesy of SWIG) which is a lot easier to work with than C++ for this kind of work.
So there we go, I throw this to the floor: really interested to see what ideas you guys can come up with!

Test 1: Connect and Disconnect clients like mad, to see how well you handle the init and end of sessions, and just how much your server will survive under spikes, also while doing this measure how many clients fail to connect. That is very important
Test 2: Connect clients and keep them logged on for say a week, doing random actions (FuzzTest). Time the round-trip of each action. Also keep record of the order of actions, because this way your "clients" will find loopholes in your usecases (very important, and VERY hard to test rationally).
Test 3 & 4: Determine major use cases for your system, and write up scripts that do these tasks. Then run several clients doing same task(test 3), and also several clients doing different tasks(test 4).
Series:
Now the other dimension you need here is amount of clients.
A nice series would be:
5,10,50,100,500,1000,5000,10000,...
This way you can get data for each series of tests with different work loads.
Also congrats on SWIGing your clients api to Python! That is a great way to get things ready.
Note: IBM has a sample of fuzz testing on Java, which is unrealted to your case, but will help you design a good fuzztest for your system

If you are comfortable coding tests in Python, I've found funkload to be very capable. You don't say your server is http-based, so you may have to adapt their test facilities to your own client/server style.
Once you have a test in Python, funkload can run it on many threads, monitoring response times, and summarizing for you at the end of the test.

For performance you are looking at two things: latency (the responsiveness of the application) and throughput (how many ops per interval). For latency you need to have an acceptable benchmark. For throughput you need to have a minimum acceptable throughput.
These are you starting points. For telling a client how many xyz's you can do per interval then you are going to need to know the hardware and software configuration. Knowing the production hardware is important to getting accurate figures. If you do not know the hardware configuration then you need to devise a way to map your figures from the test hardware to the eventual production hardware.
Without knowledge of hardware then you can really only observe trends in performance over time rather than absolutes.
Knowing the software configuration is equally important. Do you have a clustered server configuration, is it load balanced, is there anything else running on the server? Can you scale your software or do you have to scale the hardware to meet demand.
To know how many clients you can support you need to understand what is a standard set of operations. A quick test is to remove the client and write a stub client and the spin up as many of these as you can. Have each one connect to the server. You will eventually reach the server connection resource limit. Without connection pooling or better hardware you can't get higher than this. Often you will hit a architectural issue before here but in either case you have an upper bounds.
Take this information and design a script that your client can enact. You need to map how long your script takes to perform the action with respect to how long it will take the expected user to do it. Start increasing your numbers as mentioned above to you hit the point where the increase in clients causes a greater decrease in performance.
There are many ways to stress test but the key is understanding expected load. Ask your client about their expectations. What is the expected demand per interval? From there you can work out upper loads.
You can do a soak test with many clients operating continously for many hours or days. You can try to connect as many clients as you can as fast you can to see how well your server handles high demand (also a DOS attack).
Concurrent searches should be done through your standard behaviour searches acting on behalf of the client or, write a script to establish a semaphore that waits on many threads, then you can release them all at once. This is fun and punishes your database. When performing searches you need to take into account any caching layers that may exist. You need to test both caching and without caching (in scenarios where everyone makes unique search requests).
Database storage is based on physical space; you can determine row size from the field lengths and expected data population. Extrapolate this out statistically or create a data generation script (useful for your load testing scenarios and should be an asset to your organisation) and then map the generated data to business objects. Your clients will care about how many "business objects" they can store while you will care about how much raw data can be stored.
Other things to consider: What is the expected availability? What about how long it takes to bring a server online. 99.9% availability is not good if it takes two days to bring back online the one time it does go down. On the flip side a lower availablility is more acceptable if it takes 5 seconds to reboot and you have a fall over.

If you have the budget, LoadRunner would be perfect for this.

On a related note: Twitter recently OpenSourced their load-testing framework. Could be worth a go :)

Related

Middleware to optimize postgres

In my company, we have an ingestion service written in Go whose job is to take messages from a HTTP end point and store them in Postgres. It receives a peak throughput of 50,000 messages/second. However, our database can handle a maximum of 30,000 messages/second.
Is it possible to write a middleware in Python to optimize this? If so please explain.

It seems to be pretty unrelated to Python or any particular programming language.
These are typical questions to be asked and answers to be given:
Are there duplicates? If yes, don't save every message immediately but rather wait for duplicates (for what some kind of RAM-originated cache is required, the simplest one is <thread-safe?> hashtable).
Batch your message into large enough packs and then dump them into PostgreSQL all-at-once. You have to determine what is "large enough" based on load tests.
Can you drop some of those messages? If your data is not of critical importance, or at least not all of it, then you may detect overload by tracking number of pending messages and start to throw incoming stuff away until load becomes acceptable.

Clarification of use-cases for Hadoop versus RabbitMQ+Celery

I know that there are similar questions to this, such as:
https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.
So, I'm a python user who wants to make programs that either/both:
Are too large to
Take too long to
do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.
In asking my usual smart CS advice-givers, I have phrased my question as:
"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."
According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:
Why are they considered so separate, so different?
Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)
I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.

They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.
Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases.
Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.
So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....
Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.
Resume:
Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
Have pain with no setup/config time: Sockets

Writing a Python data analysis server for a Java interface

I want to write data analysis plugins for a Java interface. This interface is potentially run on different computers. The interface will send commands and the Python program can return large data. The interface is distributed by a Java Webstart system. Both access the main data from a MySQL server.
What are the different ways and advantages to implement the communication? Of course, I've done some research on the internet. While there are many suggestions I still don't know what the differences are and how to decide for one. (I have no knowledge about them)
I've found a suggestion to use sockets, which seems fine. Is it simple to write a server that dedicates a Python analysis process for each connection (temporary data might be kept after one communication request for that particular client)?
I was thinking to learn how to use sockets and pass YAML strings.
Maybe my main question is: What is the relation to and advantage of systems like RabbitMQ, ZeroMQ, CORBA, SOAP, XMLRPC?
There were also suggestions to use pipes or shared memory. But that wouldn't fit to my requirements?
Does any of the methods have advantages for debugging or other pecularities?
I hope someone can help me understand the technology and help me decide on a solution, as it is hard to judge from technical descriptions.
(I do not consider solutions like Jython, JEPP, ...)

Offering an opinion on the merits you described, it sounds like you are dealing with potentially large data/queries that may take a lot of time to fetch and serialize, in which case you definitely want to go with something that can handle concurrent connections without stacking up threads. Thereby, in the Python domain, I can't recommend any networking library other than Twisted.
http://twistedmatrix.com/documents/current/core/examples/
Whether you decide to use vanilla HTTP or your own protocol, twisted is pretty much the one stop shop for concurrent networking. Sure, the name gets thrown around alot, and the documentation is Atlantean, but if you take the time to learn it there is very little in the networking domain you cannot accomplish. You can extend the base protocols and factories to make one server that can handle your data in a reactor-based event loop and respond to deferred request when ready.
The serialization format really depends on the nature of the data. Will there be any binary in what is output as a response? Complex types? That rules out JSON if so, though that is becoming the most common serialization format. YAML sometimes seems to enjoy a position of privilege among the python community - I haven't used it extensively as most of the kind of work I've done with serials was data to be rendered in a frontend with javascript.
Message queues are really the most important tool in the toolbox when you need to defer background tasks without hanging response. They are commonly employed in web apps where the HTTP request should not hang until whatever complex processing needs to take place completes, so the UI can render early and count on an implicit "promise" the processing will take place. They have two important traits: they rely on eventual consistency, in that the process can finish long after the response in the protocol is sent, and they also have fail-safe and try-again directives should a task fail. They are where you turn in the "do this really hard task as soon as you can and I trust you to get it done" problem domain.
If we are not talking about potentially HUGE response bodies, and relatively simple data types within the serialized output, there is nothing wrong with rolling a simple HTTP deferred server in Twisted.

Python parallel processing libraries

Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.

Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.

There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.

Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.

The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.

Outgoing load balancer

I have a big threaded feed retrieval script in python.
My question is, how can I load balance outgoing requests so that I don't hit any one host too often?
This is a big problem for feedburner, since a large percentage of sites proxy their RSS through feedburner and to further complicate matters many sites will alias a subdomain on their domain to feedburner to obscure the fact that they're using it (e.g. "mysite" sets its RSS url to feeds.mysite.com/mysite, where feeds.mysite.com bounces to feedburner). Sometimes it blocks me for awhile and redirects to their "automated requests" error page.

You should probably do a one-time request (per week/month, whatever fits). for each feed and follow redirects to get the "true" address. Regardless of your throttling situation at the time, you should be able to resolve all feeds, save that data and then just do it once for every new feed you add to the list. You can look at urllib's geturl() as it returns the final url from the URL you put into it. When you do ping the feeds, be sure to use the original (keep the "real" simply for load-balancing) to make sure it redirects properly if the user has moved it or similar.
Once that is done, you can simply devise a load mechanism such as only X requests per hour for a given domain, going through each feed and skipping feeds whose hosts have hit the limit. If feedburner keeps their limits public (not likely) you can use that for X, but otherwise you will just have to estimate it and make a rough estimate that you know to be below the limit. Knowing google however, their limits might measure patterns and not have a specific hard limit.
Edit: Added suggestion from comment.

If your problem is related to Feedburner "throttling you", it most certainly does this because of the source IP of your bot. The way to "load balance to Feedburner" would be to have multiple different source IPs to start from.
Now there are numerous ways to achieving this, 2 of them being:
Multi-homed server: multiple IPs on the same machine
Multiple discrete machines
Of course, don't you go a put a NAT box in front of them now ;-)
The above takes care of the possible "throttling problems", now for the "scheduling part". You should maintain a "virtual scheduler" per "destination" and make sure not to exceed the parameters of the Web Service (e.g. Feedburner) in question. Now, the tricky part is to get hold of these "limits"... sometimes they are advertised and sometimes you need to figure them out experimentally.
I understand this is "high level architectural guidelines" but I am not ready to be coding this for you... I hope you forgive me ;-)

"how can I load balance outgoing requests so that I don't hit any one host too often?"
Generally, you do this by designing a better algorithm.
For example, randomly scramble your requests.
Or shuffle them 'fairly' so so that you round-robin through the sources. That would be a simple list of queues where you dequeue one request from each host.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.