Middleware to optimize postgres

Middleware to optimize postgres - python

In my company, we have an ingestion service written in Go whose job is to take messages from a HTTP end point and store them in Postgres. It receives a peak throughput of 50,000 messages/second. However, our database can handle a maximum of 30,000 messages/second.
Is it possible to write a middleware in Python to optimize this? If so please explain.

It seems to be pretty unrelated to Python or any particular programming language.
These are typical questions to be asked and answers to be given:
Are there duplicates? If yes, don't save every message immediately but rather wait for duplicates (for what some kind of RAM-originated cache is required, the simplest one is <thread-safe?> hashtable).
Batch your message into large enough packs and then dump them into PostgreSQL all-at-once. You have to determine what is "large enough" based on load tests.
Can you drop some of those messages? If your data is not of critical importance, or at least not all of it, then you may detect overload by tracking number of pending messages and start to throw incoming stuff away until load becomes acceptable.

Related

gRPC response streaming with repeated field in Python

I am currently designing an API that is supposed to handle relatively small messages but many data entries. There are operations to add, delete and list all items stored in a database.
Now to my question: I want to return all entries (up to 5 Million) in a short amount of time. I figured response streaming would be the way to go.
Does it make sense to stream messages with a repeated field to be able to return multiple entries in one message. So far i haven't seen any indication whether that is faster or not.
Example:
rpc ListDataSet (ListDataSetRequest) returns (stream ListDataSetResponse);
message ListDataSetResponse {
string transaction_id = 1;
repeated Entries entries = 2;
}
And in the server i would append a certain amount of entries to each message and yield the messages while looping over the list of entries to use a generator.
Any recommendations or tips would be appreciated

Yes, it makes sense to stream messages containing repeated fields.
From a performance perspective, you may want to consider benchmarking your alternatives to prove this to yourself.
gRPC lacks comprehensive best practices but one reads that smaller messages are better and 4MiB is often given as a good, notional upper bound.
One other thing to consider is that it's not just the performance of your servers but also of your clients that you need to consider.
A more common pattern (?) is to page large results and give control to the client to ask for next|other pages. This may be worth evaluating too.
For exceptionally "huge" (unspecified) results, you'd likely be better placed returning a reference in your gRPC message to an out-of-band (e.g. object storage) object.

What is the best way to handle multiple user requests when lot of back end calculation is involved?

Hi I am quite new to web application development. I have been designing an application where a user uploads a file, some calculation is done and an output table will be shown. This process takes approximately 5-6 seconds.
I am saving my data in sessions like this:
request.session ['data']=resultDATA.
And loading the data whenever I need from sessions like this:
resultDATA = request.session['data']
I dont need DATA once the user is signed out. So is approach correct to save user data (not involving passwords)?
My biggest problem is if n number of users upload their files at exact moment do the last user have to wait for n*6 seconds for his calculation to complete? If yes is there any solution for this?
Right now I am using django built-in web server.
Do I have to use a different server to solve this problem?

There are quiet some questions in this question, however i think they are related enough and concise enough to deserve an answer:
So is approach correct to save user data (not involving passwords)?
I don't see any problem with this approach, since it's volatile data and it's not sensitive.
My biggest problem is if n number of users upload their files at exact moment do the last user have to wait for n*6 seconds for his calculation to complete?
This shouldn't be an issue as you put it. obviously if your server is handling huge ammounts of traffic it will slow down, and it will take a bit longer than your usual 5-6 seconds. However it won't be n*6, the server should be able to handle multiple requests at once.
Do I have to use a different server to solve this problem?
No, but kind of yes... what i mean is that in development the built-in server is great. It does everything you need it to do, however when you decide to push the app into production, you'll need a proper server for it.
As a side note, try to see if you can improve the data collection time, because right now everything is running on your own PC, which means it will probably be faster than when you push it to production. When you "upload" a file to localhost it takes a lot less time than when you upload it to an actual server over the internet, so that's a thing to keep in mind.

Python Strategy for Large Scale Analysis (on-the-fly or deferred)

To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?
I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?
On-the-fly: Process data on-the-fly and store parametric data into a database
Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon
Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?
Are there any python projects that are already geared toward this kind of analysis?
Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary

I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.
The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.
Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.

First: "fastest execution to prevent user from waiting" means some kind of deferred processing. Once you decide to defer the processing -- so the user doesn't see it -- the choice between flat-file and database is essentially irrelevant with respect to end-user-wait time.
Second: databases are slow. Flat files are fast. Since you're going to use celery and avoid end-user-wait time, however, the distinction between flat file and database becomes irrelevant.
Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
This is fastest. Celery to load flat files.

Should I learn/use MapReduce, or some other type of parallelization for this task?

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.
This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.
Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:
Is this a flow that would be well suited to parallelizing with MapReduce?
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? (I'm not sure exactly what counts as a "Job", so I don't know exactly how I'll be billed)
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
Are there any problems you see with the way I've designed this job flow?
Ok, now onto the details:
The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.
There are two calls I can make:
Get Followed Users -- Which returns all the users being followed by the requested user, and
Get Favorite Items -- Which returns all the favorite items of the requested user.
After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:
Jobs in this flow include:
Start Updating Queue for user -- kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
Get Favorites for user -- Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
Calculate New Queue for user -- Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.
So, again, my questions are:
Is this a flow that would be well suited to parallelizing with MapReduce? I don't know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX's queue only after that's all done.
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? Is there a limit on how many "threads" I can have waiting on open API requests if I use their module?
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
Are there any problems you see with the way I've designed this job flow?
Thanks for reading, I'm looking forward to some discussion with you all.
Edit, in response to JimR:
Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.
What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.
So, what I'm leaning towards is:
A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it's low overhead and scales well.
Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
Possibly using Amazon's Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It's very cheap. May be worth implementing instead of a custom job-queue system.

The work you describe is probably a good fit for either a queue, or a combination of a queue and job server. It certainly could work as a set of MapReduce steps as well.
For a job server, I recommend looking at Gearman. The documentation isn't awesome, but the presentations do a great job documenting it, and the Python module is fairly self-explanatory too.
Basically, you create functions in the job server, and these functions get called by clients via an API. The functions can be called either synchronously or asynchronously. In your example, you probably want to asynchronously add the "Start update" job. That will do whatever preparatory tasks, and then asynchronously call the "Get followed users" job. That job will fetch the users, and then call the "Update followed users" job. That will submit all the "Get Favourites for UserA" and friend jobs together in one go, and synchronously wait for the result of all of them. When they have all returned, it will call the "Calculate new queue" job.
This job-server-only approach will initially be a bit less robust, since ensuring that you handle errors and any down servers and persistence properly is going to be fun.
For a queue, SQS is an obvious choice. It is rock-solid, and very quick to access from EC2, and cheap. And way easier to set up and maintain than other queues when you're just getting started.
Basically, you will put a message onto the queue, much like you would submit a job to the job server above, except you probably won't do anything synchronously. Instead of making the "Get Favourites For UserA" and so forth calls synchronously, you will make them asynchronously, and then have a message that says to check whether all of them are finished. You'll need some sort of persistence (a SQL database you're familiar with, or Amazon's SimpleDB if you want to go fully AWS) to track whether the work is done - you can't check on the progress of a job in SQS (although you can in other queues). The message that checks whether they are all finished will do the check - if they're not all finished, don't do anything, and then the message will be retried in a few minutes (based on the visibility_timeout). Otherwise, you can put the next message on the queue.
This queue-only approach should be robust, assuming you don't consume queue messages by mistake without doing the work. Making a mistake like that is hard to do with SQS - you really have to try. Don't use auto-consuming queues or protocols - if you error out, you might not be able to ensure that you put a replacement message back on the queue.
A combination of queue and job server may be useful in this case. You can get away with not having a persistence store to check job progress - the job server will allow you to track job progress. Your "get favourites for users" message could place all the "get favourites for UserA/B/C" jobs into the job server. Then, put a "check all favourites fetching done" message on the queue with a list of tasks that need to be complete (and enough information to restart any jobs that mysteriously disappear).
For bonus points:
Doing this as a MapReduce should be fairly easy.
Your first job's input will be a list of all your users. The map will take each user, get the followed users, and output lines for each user and their followed user:
"UserX" "UserA"
"UserX" "UserB"
"UserX" "UserC"
An identity reduce step will leave this unchanged. This will form the second job's input. The map for the second job will get the favourites for each line (you may want to use memcached to prevent fetching favourites for UserX/UserA combo and UserY/UserA via the API), and output a line for each favourite:
"UserX" "UserA" "Favourite1"
"UserX" "UserA" "Favourite2"
"UserX" "UserA" "Favourite3"
"UserX" "UserB" "Favourite4"
The reduce step for this job will convert this to:
"UserX" [("UserA", "Favourite1"), ("UserA", "Favourite2"), ("UserA", "Favourite3"), ("UserB", "Favourite4")]
At this point, you might have another MapReduce job to update your database for each user with these values, or you might be able to use some of the Hadoop-related tools like Pig, Hive, and HBase to manage your database for you.
I'd recommend using Cloudera's Distribution for Hadoop's ec2 management commands to create and tear down your Hadoop cluster on EC2 (their AMIs have Python set up on them), and use something like Dumbo (on PyPI) to create your MapReduce jobs, since it allows you to test your MapReduce jobs on your local/dev machine without access to Hadoop.
Good luck!

Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.
Thanks for the answers, they were a lot of help finding the solution I was looking for.

I am working with a similar problem that i need to solve. I was also looking at MapReduce and using the Elastic MapReduce service from Amazon.
I'm pretty convinced MapReduce will work for this problem. The implementation is where I'm getting hung up, becauase I'm not sure my reducer even needs to do anything.
I'll answer your questions as I understand your (and my) problem, and hopefully it helps.
Yes I think it'll be suited well. You could look at leveraging the Elastic MapReduce service's multiple steps option. You could use 1 Step to fetch a the people a user is following, and another step to compile a list of tracks for each of those followers, and the reducer for that 2nd step would probably be the one to build the cache.
Depends on how big your data-set is and how often you'll be running it. It's hard to say without knowing how big the data-set is (or is going to get) if it'll be cost effective or not. Initially, it'll probably be quite cost-effective, as you won't have to manage your own hadoop cluster, nor have to pay for EC2 instances (assuming that's what you use) to be up all the time. Once you reach the point where you're actually crunching this data for a long period of time, it probably will make less and less sense to use Amazon's MapReduce service, because you'll constantly have nodes online all the time.
A job is basically your MapReduce task. It can consist of multiple steps (each MapReduce task is a step). Once your data has been processed and all steps have been completed, your Job is done. So you're effectively paying for CPU time for each node in the Hadoop cluster. so, T*n where T is the Time (in hours) it takes to process your data, and n is the number of nodes you tell Amazon to spin up.
I hope this helps, good luck. I'd like to hear how you end up implementing your Mappers and Reducers, as I'm solving a very similar problem and I'm not sure my approach is really the best.

How should I stress test / load test a client server application? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I develop a client-server style, database based system and I need to devise a way to stress / load test the system. Customers inevitably want to know such things as:
• How many clients can a server support?
• How many concurrent searches can a server support?
• How much data can we store in the database?
• Etc.
Key to all these questions is response time. We need to be able to measure how response time and performance degrades as new load is introduced so that we could for example, produce some kind of pretty graph we could throw at clients to give them an idea what kind of performance to expect with a given hardware configuration.
Right now we just put out fingers in the air and make educated guesses based on what we already know about the system from experience. As the product is put under more demanding conditions, this is proving to be inadequate for our needs going forward though.
I've been given the task of devising a method to get such answers in a meaningful way. I realise that this is not a question that anyone can answer definitively but I'm looking for suggestions about how people have gone about doing such work on their own systems.
One thing to note is that we have full access to our client API via the Python language (courtesy of SWIG) which is a lot easier to work with than C++ for this kind of work.
So there we go, I throw this to the floor: really interested to see what ideas you guys can come up with!

Test 1: Connect and Disconnect clients like mad, to see how well you handle the init and end of sessions, and just how much your server will survive under spikes, also while doing this measure how many clients fail to connect. That is very important
Test 2: Connect clients and keep them logged on for say a week, doing random actions (FuzzTest). Time the round-trip of each action. Also keep record of the order of actions, because this way your "clients" will find loopholes in your usecases (very important, and VERY hard to test rationally).
Test 3 & 4: Determine major use cases for your system, and write up scripts that do these tasks. Then run several clients doing same task(test 3), and also several clients doing different tasks(test 4).
Series:
Now the other dimension you need here is amount of clients.
A nice series would be:
5,10,50,100,500,1000,5000,10000,...
This way you can get data for each series of tests with different work loads.
Also congrats on SWIGing your clients api to Python! That is a great way to get things ready.
Note: IBM has a sample of fuzz testing on Java, which is unrealted to your case, but will help you design a good fuzztest for your system

If you are comfortable coding tests in Python, I've found funkload to be very capable. You don't say your server is http-based, so you may have to adapt their test facilities to your own client/server style.
Once you have a test in Python, funkload can run it on many threads, monitoring response times, and summarizing for you at the end of the test.

For performance you are looking at two things: latency (the responsiveness of the application) and throughput (how many ops per interval). For latency you need to have an acceptable benchmark. For throughput you need to have a minimum acceptable throughput.
These are you starting points. For telling a client how many xyz's you can do per interval then you are going to need to know the hardware and software configuration. Knowing the production hardware is important to getting accurate figures. If you do not know the hardware configuration then you need to devise a way to map your figures from the test hardware to the eventual production hardware.
Without knowledge of hardware then you can really only observe trends in performance over time rather than absolutes.
Knowing the software configuration is equally important. Do you have a clustered server configuration, is it load balanced, is there anything else running on the server? Can you scale your software or do you have to scale the hardware to meet demand.
To know how many clients you can support you need to understand what is a standard set of operations. A quick test is to remove the client and write a stub client and the spin up as many of these as you can. Have each one connect to the server. You will eventually reach the server connection resource limit. Without connection pooling or better hardware you can't get higher than this. Often you will hit a architectural issue before here but in either case you have an upper bounds.
Take this information and design a script that your client can enact. You need to map how long your script takes to perform the action with respect to how long it will take the expected user to do it. Start increasing your numbers as mentioned above to you hit the point where the increase in clients causes a greater decrease in performance.
There are many ways to stress test but the key is understanding expected load. Ask your client about their expectations. What is the expected demand per interval? From there you can work out upper loads.
You can do a soak test with many clients operating continously for many hours or days. You can try to connect as many clients as you can as fast you can to see how well your server handles high demand (also a DOS attack).
Concurrent searches should be done through your standard behaviour searches acting on behalf of the client or, write a script to establish a semaphore that waits on many threads, then you can release them all at once. This is fun and punishes your database. When performing searches you need to take into account any caching layers that may exist. You need to test both caching and without caching (in scenarios where everyone makes unique search requests).
Database storage is based on physical space; you can determine row size from the field lengths and expected data population. Extrapolate this out statistically or create a data generation script (useful for your load testing scenarios and should be an asset to your organisation) and then map the generated data to business objects. Your clients will care about how many "business objects" they can store while you will care about how much raw data can be stored.
Other things to consider: What is the expected availability? What about how long it takes to bring a server online. 99.9% availability is not good if it takes two days to bring back online the one time it does go down. On the flip side a lower availablility is more acceptable if it takes 5 seconds to reboot and you have a fall over.

If you have the budget, LoadRunner would be perfect for this.

On a related note: Twitter recently OpenSourced their load-testing framework. Could be worth a go :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.