Recommendation Engine for simple but data heavy web app

Recommendation Engine for simple but data heavy web app - python

Problem:
I'm currently using python-recsys and SVD algorithm to compute recommendations for my users. Computation is rather quick (for now) but I'm wondering how this would behave if we go live. we have around 1 million products stored in Mongodb and are expecting around 100 users for start. I've simulated situations like that, but this random generated data does not actually apply to real cases.
We use Redis for recommendations storage, they're computed every 2 hours in celery tasks and currently are really memory heavy, although I've made my best to optimize them.
Worrying about future I'm planning to use Neo4j for that task although it's pretty hard to find any real life stories of developers using this db for recommendations.
Generally what I'd like to achieve is reasonably working recommendation engine (mahout would be overkill in that case i guess) which is not really memory because we cannot afford many servers.
How would Neo4j play with that problem ? Are there any good python drivers for that db ? Maybe it'd be better to use current Mongodb/Redis solution and tune it a little and not add another db to current stack ? I was also considering usning separate machine for just pure computation of recommendations - but is it a good choice?

Worrying about future I'm planning to use Neo4j for that task although
it's pretty hard to find any real life stories of developers using
this db for recommendations.
http://seenickcode.com/switching-from-mongodb-to-neo4j/
How would Neo4j play with that problem ? Are there any good python
drivers for that db ?
http://neo4j.com/developer/python/

Related

I am interested in disproving some graph theory conjectures in python, what is the most efficient library/server set up to use?

I am interested in implementing and running some heavy graph-theory algorithms for the purpose of (hopefully) finding counterexamples for some conjecture.
What is the most efficient libraries, server setups you would recommend?
I am thinking of using Python's Graph API.
For running the algorithms I was thinking of using Hadoop, but researching Hadoop I get the feeling it is more appropriate for analysing databases than enumerating problems.
If my thinking about Hadoop is correct, what is the best server setup you would recommend for running such a process?
Any leads on how to run an algorithm in a remote distributed environment that won't require a lot of code rewritting or cost a lot of money would be helpful.
many thanks!

You can look CUDA as another option, if it is highly computational task.

You could have a look on neo4j which is a no-sql graph database. If your scalability constraints are strong, it could be a good choice.
Interface is REST based, but some python bindings exist too (see here)
You can have a look here for a blog with some graph theory applications ( a small study on scalability could be found here ).

What's a good starting point to design an architecture with scalability in mind?

I'm currently about to start designing a new application.
The application will allow a user to insert some data and will provide data analysis (with reports as well), i know it's not helpful but the data-processing will be done in post-processing so that's not really interesting for the front-end.
I'd like to start with the right path to help myself when there will be the need to scale to handle more users.
I'm thinking about PostgreSQL to store the data, because I've already used it and I like it (also if a NoSQL would be a good choice -since not all data needs to have a relation- I like the Postgres support and community and I feel better knowing that there's a big community out there to help me), MySQL (innodb) is also a good choice, tbh I've not a real reason to choose it over PostgreSQL and vice-versa (is maybe MySQL easier to create shards?).
I know several programming languages but my strengths are Python, C/C++, Javascript.
I'm not sure if I should choose a sync or async approach for this task (I could scale out by running more sync applications behind a load balancer).
I've already developed another big-size project that teached me a lot of things about concurrency, but there each choice was influenced according to the (whole rest of the team, but mostly by the) sysadmin skills, so we have used python (django) + uwsgi + nginx.
For this project (since it's totally different from the other - that was an e-commerce, this is such a SaaS) I was also considering to make use of node.js, it would be a good opportunity to try it out in a serious project.
The most heavy data processing would be done by post-processes so all the front-end (user website) would be mostly I/O (+1 to use an async enviroment).
What would you suggest?
ps. I must also keep in mind that first of all the project has to start, so I cannot only think about each possible design, but I should start writing code ASAP :-)
My current thoughts are:
- start with something you know
- keep it as simple as possibile
- track everything to find bottlenecks
- scale out
So it wouldn't really matter if I deploy sync or async, but I know async has much better performances, and each thing that could help me to get better results (ergo lower costs) is evaluable as well.
I'm curious to know what are your experiences (also with other technologies)...
I'm becoming paranoid about this scalability and I fear it could lead to a wrong design (it's also the first time I'm designing alone for a commercial purpose = FUD)
If you need some more info please let me know and I'll try go give to you an answer.
Thanks.

A good resource for all of this is http://highscalability.com/. Lots of interesting case studies about handling big web loads.
You didn't mention it but you might want to think about hosting it in the cloud (Azure, Amazon, etc). Makes scaling the hardware a little easier and it's especially nice if your demand fluctuates.

Here are some basic guidelines:
Use as much async processes as possible. Or atleast design it in such a way that it can be converted to be async.
Design processes such that they can be segregated on different servers. This also goes to above. Say you have a webapp that has some intensive processes. If this process is asynch; then the main webserver could queue the job and be done with. Then a seperate server could pick the job and process it. This way your main web servers are not affected. But if you are resource constrained, you could still run the background process on same server (till you have enough clients and then you can spawn it off to a diff. server)
Design for load balancing. So if you app useses sessions, then you should factor in how you will be replicating sessions or not. You dont have to - you could send the user to a diff. server and then forward all subsequent requests to that server. But you still have to design for it.
Have the ability to route load to different servers based on some predefined criteria. So for eg: since your app is a SAAS app, you could decide that certain clients will go to Environment1 and certain other clients will go to Environment2. Lot of the SAAS players do this. For eg Salesforce.
You dont necessarily have to do this from the get go - but having this ability will go a long way to scale your app when the time comes.
Also, remember that theses approaches are not exclusive. You should design your app for all these approaches; but only implement it when required.
Take a look at the book The Art of Scalability
This book was written by guys that worked with eBay & Paypal.

Tale a look at this excellent presentation on scalability patterns and approaches.

Python/Django for an enterprise large scale web based system?

My company is highly dependent on Java and jsf; All projects since I was hired are implemented using them. But most of those projects are facing problems related to performance and availability. So am finally considering a shift to other technologies and I have tried to research in the net, and im about to decide to try python. But before i start i would like to hear ur answer that Python would solve me the performance problems we are facing.
To make things clear the performance problems we mostly face are related to glassfish server and page loading. We are currently using ice faces and have tried wood stock back then. Additionally I can't use .net for some policy related issues. And PHP is also out of question due to some security leaks experienced in earlier projects.
So am expecting to read the pros and cons related with performance and availability in trying to convince my boss and customers in to python.

I have some doubts that you will gain performance by using Django or a Python based solution. I don't know the Glassfish server nor how it scales up but unless badly designed I don't see why it should perform badly.
From the explanation of your performance issues, it doesn't seem to be a problem of language speed but instead, server configuration and availability.
Assuming that your Java code is reasonably optimal (i.e. efficient and acceptably fast), you won't solve the problem by using some Python solution. Instead you should invest some time into studying caching mechanisms and/or proxy solutions.
Depending on how your server is setup, an additional advice would be to let all the static content be served by a dedicated server such as Apache, nginx or similar and only leave the the dynamic content on to be interpreted by your glassfish server.
Since your projects are written in Java you are in theory using a language that can potentially be faster than Python, I don't see why a Python solution would perform better unless there is something wrong with the framework you are using.
If you want to talk about prototyping or faster development, then that's a different subject discussed multiple times on stackoverflow.

Optimization Techniques in Python

Recently i have developed a billing application for my company with Python/Django. For few months everything was fine but now i am observing that the performance is dropping because of more and more users using that applications. Now the problem is that the application is now very critical for the finance team. Now the finance team are after my life for sorting out the performance issue. I have no other option but to find a way to increase the performance of the billing application.
So do you guys know any performance optimization techniques in python that will really help me with the scalability issue
Guys we are using mysql database and its hosted on apache web server on Linux box. Secondly what i have noticed more is the over all application is slow and not the database transactional part. For example once the application is loaded then it works fine but if they navigate to other link on that application then it takes a whole lot of time.
And yes we are using HTML, CSS and Javascript

As I said in comment, you must start by finding what part of your code is slow.
Nobody can help you without this information.
You can profile your code with the Python profilers then go back to us with the result.
If it's a Web app, the first suspect is generally the database. If it's a calculus intensive GUI app, then look first at the calculations algo first.
But remember that perf issues car be highly unintuitive and therefor, an objective assessment is the only way to go.

ok, not entirely to the point, but before you go and start fixing it, make sure everyone understands the situation. it seems to me that they're putting some pressure on you to fix the "problem".
well first of all, when you wrote the application, have they specified the performance requirements? did they tell you that they need operation X to take less than Y secs to complete? Did they specify how many concurrent users must be supported without penalty to the performance? If not, then tell them to back off and that it is iteration (phase, stage, whatever) one of the deployment, and the main goal was the functionality and testing. phase two is performance improvements. let them (with your help obviously) come up with some non functional requirements for the performance of your system.
by doing all this, a) you'll remove the pressure applied by the finance team (and i know they can be a real pain in the bum) b) both you and your clients will have a clear idea of what you mean by "performance" c) you'll have a base that you can measure your progress and most importantly d) you'll have some agreed time to implement/fix the performance issues.
PS. that aside, look at the indexing... :)

A surprising feature of Python is that the pythonic code is quite efficient... So a few general hints:
Use built-ins and standard functions whenever possible, they're already quite well optimized.
Try to use lazy generators instead one-off temporary lists.
Use numpy for vector arithmetic.
Use psyco if running on x86 32bit.
Write performance critical loops in a lower level language (C, Pyrex, Cython, etc.).
When calling the same method of a collection of objects, get a reference to the class function and use it, it will save lookups in the objects dictionaries (this one is a micro-optimization, not sure it's worth)
And of course, if scalability is what matters:
Use O(n) (or better) algorithms! Otherwise your system cannot be linearly scalable.
Write multiprocessor aware code. At some point you'll need to throw more computing power at it, and your software must be ready to use it!

before you can "fix" something you need to know what is "broken". In software development that means profiling, profiling, profiling. Did I mention profiling. Without profiling you don't know where CPU cycles and wall clock time is going. Like others have said to get any more useful information you need to post the details of your entire stack. Python version, what you are using to store the data in (mysql, postgres, flat files, etc), what web server interface cgi, fcgi, wsgi, passenger, etc. how you are generating the HTML, CSS and assuming Javascript. Then you can get more specific answers to those tiers.

You may be interested in this document I've found some time ago.
As personal advice, be as more pythonic as you can: lazy evaluation is the keyword, so learn to use iterators and generators.

For the type of application you are describing (a web application probably backed by a database) your performance problems are unlikely to be language specific. They are far more likely to stem from design or architecture issues, though they could be simple coding problems too.
To sort this out you need to figure out where the bottlenecks are in your application and for that you need some sort of profiler.
Once you have found your bottlenecks you will be in a much better position. You can evaluate then problem areas for common issues including:
Design and Architecture issues
SQL anti-patterns
Incorrect usage of your framework (perhaps relying on inappropriate defaults)
Badly structured algorithms
The specifics of any solution are going to depend on the specifics of the bottlenecks your find.

http://wiki.python.org/moin/PythonSpeed/PerformanceTips
I optimized some python code a while back, the most surprising thing to me was how much each function call costs. If you minimize function calls or replace loops with builtins you'll be running much faster.

There are some great suggestions here… So let me suggest an implementation detail. I have found the runprofileserver command found in django-command-extensions very convenient for profiling my Django code.

I am not sure if this would solve the problem but you should have a look at psyco

Is Google App Engine a worthy platform for a Lifestreaming app? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm building a Lifestreaming app that will involve pulling down lots of feeds for lots of users, and performing data-mining, and machine learning algorithms on the results. GAE's load balanced and scalable hosting sounds like a good fit for a system that could eventually be moving around a LOT of data, but it's lack of cron jobs is a nuisance. Would I be better off using Django on a co-loc and dealing with my own DB scaling?

While I can not answer your question directly, my experience of building Microupdater (a news aggregator collecting a few hundred feeds on AppEngine) may give you a little insight.
Fetching feeds. Fetching lots of feeds by cron jobs (it was the only solution until SDK 1.2.5) is not efficient and scalable, which has lower limit on job frequency (say 1 min, so you could only fetch at most 60 feeds hourly). And with latest SDK 1.2.5, there is XMPP API, which I have not implemented yet. The best promising approach would be PubSubHubbub, of which you offer an callback url and HubBub will notify you new entries in real-time. And there is an demo implementation on AppEngine, which you can play around.
Parsing feeds. You may already know that parsing feeds is cpu-intensive. I use Universal Feed Parser by Mark Pilgrim, when parsing a large feed (say a public google reader topic), AppEngine may fail to process all entries. My dashboard have a lot of these CPU-limit warnings. But it may result in my incapability to optimize the code yet.
Totally said, AppEngine is not yet an ideal platform for lifestream app, but that may change in future.

It might change when they offer paid plans, but as it stands, App Engine is not good for CPU intensive apps. It is designed to scale to handle a large number of requests, not necessarily a large amount of calculation per request. I am running into this issue with fairly minor calculations, and I fear I may have to start looking elsewhere as my data set grows.

(This is obviously pretty old, responding just because it still comes up really high in related Google queries...)
I just started using AppEngine and haven't been using it for tons of external requests. But I do know that the info above is probably a lot less valid now, and might not even still stand. They relaxed the limits quite a bit since September 08 - check Aral Balkan's blog for his initial complaint about the above, and later developments.

If you're app solely relies on Django, then App Engine is a good bet. However, if you ever need to add C-enhanced libraries, you're up a creek. App Engine doesn't support things like PIL or ReportLab, which use C to speed up processing times. I'm only mentioning this because you may want to use C to speed up some of your routines in the long run.
If you decide to use a co-loc, check out WebFaction.com. They have great Django/Python support and they have no issue with you using the aforementioned lirbaries.

Take a look at Slice Host: They sell xen based virtualized server instances starting at $20.00 / month...
We’re just like you. Sick of oversold,
underperforming, ancient hosting
companies. We took matters into our
own hands. We built a hosting company
for people who know their stuff. Give
us a box, give us bandwidth, give us
performance and we get to work. Fast
machines, RAID-10 drives, Tier-1
bandwidth and root access. Managed
with a customized Xen VPS backend to
ensure that your resources are
protected and guaranteed.
It's great for starting a project on and scaling it out WITHOUT incurring the costs of a managed provider or colo.

No. If you need to pull lots of things down, App Engine isn't going to work so well. You can use it as a front end by putting your data in their store after doing your offline preprocessing, but you can't do much in the ~1 second time you have per request without doing some really crazy things.
Your app would likely be better off on your own hosting.

Pulling feeds or doing calculations won't be a problem. But you'll soon have to pay for your account. App engine includes Django, except you'll need to work with some adaptors for the model part. It will surely save you from maintenance headaches.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.