Capturing Implicit Signals of Interest in Django

Capturing Implicit Signals of Interest in Django - python

To set the background: I'm interested in:
Capturing implicit signals of interest in books as users browse around a site. The site is written in django (python) using mysql, memcached, ngnix, and apache
Let's say, for instance, my site sells books. As a user browses around my site I'd like to keep track of which books they've viewed, and how many times they've viewed them.
Not that I'd store the data this way, but ideally I could have on-the-fly access to a structure like:
{user_id : {book_id: number_of_views, book_id_2: number_of_views}}
I realize there are a few approaches here:
Some flat-file log
Writing an object to a database every time
Writing to an object in memcached
I don't really know the performance implications, but I'd rather not be writing to a database on every single page view, and the lag writing to a log and computing the structure later seems not quick enough to give good recommendations on-the-fly as you use the site, and the memcached appraoch seems fine, but there's a cost in keeping this obj in memory: you might lose it, and it never gets written somewhere 'permanent'.
What approach would you suggest? (doesn't have to be one of the above) Thanks!

If this data is not an unimportant statistic that might or might not be available I'd suggest taking the simple approach and using a model. It will surely hit the database everytime.
Unless you are absolutely positively sure these queries are actually degrading overall experience there is no need to worry about it. Even if you optimize this one, there's a good chance other unexpected queries are wasting more CPU time. I assume you wouldn't be asking this question if you were testing all other queries. So why risk premature optimization on this one?
An advantage of the model approach would be having an API in place. When you have tested and decided to optimize you can keep this API and change the underlying model with something else (which will most probably be more complex than a model).
I'd definitely go with a model first and see how it performs. (and also how other parts of the project perform)

What approach would you suggest? (doesn't have to be one of the above) Thanks!
hmmmm ...this like been in a four walled room with only one door and saying i want to get out of room but not through the only door...
There was an article i was reading sometime back (can't get the link now) that says memcache can handle huge (facebook uses it) sets of data in memory with very little degradation in performance...my advice is you will need to explore more on memcache, i think it will do the trick.

Either a document datastore (mongo/couchdb), or a persistent key value store (tokyodb, memcachedb etc) may be explored.
No definite recommendations from me as the final solution depends on multiple factors - load, your willingness to learn/deploy a new technology, size of the data...

Seems to me that one approach could be to use memcached to keep the counter, but have a cron running regularly to store the value from memcached to the db or disk. That way you'd get all the performance of memcached, but in the case of a crash you wouldn't lose more than a couple of minutes' data.

Related

In practice, how eventual is the "eventual consistency" in HRD?

I am in the process of migrating an application from Master/Slave to HRD. I would like to hear some comments from who already went through the migration.
I tried a simple example to just post a new entity without ancestor and redirecting to a page to list all entities from that model. I tried it several times and it was always consistent. Them I put 500 indexed properties and again, always consistent...
I was also worried about some claims of a limit of one 1 put() per entity group per second. I put() 30 entities with same ancestor (same HTTP request but put() one by one) and it was basically no difference from puting 30 entities without ancestor. (I am using NDB, could it be doing some kind of optimization?)
I tested this with an empty app without any traffic and I am wondering how much a real traffic would affect the "eventual consistency".
I am aware I can test "eventual consistency" on local development. My question is:
Do I really need to restructure my app to handle eventual consistency?
Or it would be acceptable to leave it the way it is because the eventual consistency is actually consistent in practice for 99%?

If you have a small app then your data probably live on the same part of the same disk and you have one instance. You probably won't notice eventual consistency. As your app grows, you notice it more. Usually it takes milliseconds to reach consistency, but I've seen cases where it takes an hour or more.
Generally, queries is where you notice it most. One way to reduce the impact is to query by keys only and then use ndb.get_multi() to load the entities. Fetching entities by keys ensures that you get the latest version of that entity. It doesn't guarantee that the keys list is strongly consistent, though. So you might get entities that don't match the query conditions, so loop through the entities and skip the ones that don't match.
From what I've noticed, the pain of eventual consistency grows gradually as your app grows. At some point you do need to take it seriously and update the critical areas of your code to handle it.

What's the worst case if you get inconsistent results?
Does a user see some unimportant info that's out of date? That's probably ok.
Will you miscalculate something important, like the price of something? Or the number of items in stock in a store? In that case, you would want to avoid that chance occurence.
From observation only, it seems like eventually consistent results show up more as your dataset gets larger, I suspect as your data is split across more tablets.
Also, if you're reading your entities back with get() requests by key/id, it'll always be consistent. Make sure you're doing a query to get eventually consistent results.

The replication speed is going to be primarily server-workload-dependent. Typically on an unloaded system the replication delay is going to be milliseconds.
But the idea of "eventually consistent" is that you need to write your app so that you don't rely on that; any replication delay needs to be allowable within the constraints of your application.

Preferred (or recommended) way to store large amounts of simulation configurations, runs values and final results

I am working with some network simulator. After making some extensions to it, I need to make a lot of different simulations and tests. I need to record:
simulation scenario configurations
values of some parameters (e.g. buffer sizes, signal qualities, position) per devices per time unit t
final results computed from those recorded values
Second data is needed to perform some visualization after simulation was performed (simple animation, showing some statistics over time).
I am using Python with matplotlib etc. for post-processing the data and for writing a proper app (now considering pyQt or Django, but this is not the topic of the question). Now I am wondering what would be the best way to store this data?
My first guess was to use XML files, but it can be too much overhead from the XML syntax (I mean, files can grow up to very big sizes, especially for the second part of the data type). So I tried to design a database... But this also seems to me to be not the proper way... Maybe a mix of both?
I have tried to find some clues in Google, but found nothing special. Have you ever had a need for storing such data? How have you done that? Is there any "design pattern" for that?

Separate concerns:
Apart from pondering on the technology to use for storing data (DBMS, CSV, or maybe one of the specific formats for scientific data), note that you have three very different kinds of data to manage:
Simulation scenario configurations: these are (typically) rather small, but they need to be simple to edit, simple to re-use, and should allow to reproduce a simulation run. Here, text or code files seem to be a good choice (these should also be version-controlled).
Raw simulation data: this is where you should be really careful if you are concerned with simulation performance, because writing 3 GB of data during a run can take a huge amount of time if implemented badly. One way to proceed would be to use existing file formats for this purpose (see below) and see if they work for you. If not, you can still use a DBMS. Also, it is usually a good idea to include a description of the scenario that generated the data (or at least a reference), as this helps you managing the results.
Data for post-processing: how to store this mostly depends on the post-processing tools. For example, if you already have a class structure for your visualization application, you could define a file format that makes it easy to read in the required data.
Look for existing solutions:
The problem you face (How to manage simulation data?) is fundamental and there are many potential solutions, each coming with certain trade-offs. As you are working in network simulation, check out what capabilities other tools used in your community provide. It could be that their developers ran into problems you are not even anticipating yet (regarding reproducibility etc.), and already found a good solution. For example, you could check out how OMNeT++ is handling simulation output: the simulation configurations are defined in a separate file, results are written to vec and sca files (depending on their nature). As far as I understood your problems with hierarchical data, this is supported as well (vectors get unique IDs and are associated with an attribute of some model entity).
Additional tools already work with these file formats, e.g. to convert them to other formats like CSV/MATLAB files, so you could even think of creating files in the same format (documented here) and to use existing tools/converters for post-processing.
Many other simulation tools will have similar features, so take a look at what would work best for you.

It sounds like you need to record more or less the same kinds of information for each case, so a relational database sounds like a good fit-- why do you think it's "not the proper way"?
If your data fits in a collection of CSV files, you're most of the way to a relational database already! Just store in database tables instead, and you have support for foreign keys and queries. If you go on to implement an object-oriented solution, you can initialize your objects from the database.

If your data structures are well-known and stable AND you need some of the SQL querying / computation features then a light-weight relational DB like SQLite might be the way to go (just make sure it can handle your eventual 3+GB data).
Else - ie, each simulation scenario might need a dedicated data structure to store the results -, and you don't need any SQL feature, then you might be better using a more free-form solution (document-oriented database, OO database, filesystem + csv, whatever).
Note that you can still use a SQL db in the second case, but you'll have to dynamically create tables for each resultset, and of course dynamically create the relevant SQL queries too.

Which is quicker? Memcache or file query? (using maxmind geoip.dat file)

I'm using Python on Appengine and am looking up the geolocation of an IP address like this:
import pygeoip
gi = pygeoip.GeoIP('GeoIP.dat')
Location = gi.country_code_by_addr(self.request.remote_addr)
(pygeoip can be found here: http://code.google.com/p/pygeoip/)
I want to geolocate each page of my app for a user so currently I lookup the IP address once then store it in memcache.
My question - which is quicker? Looking up the IP address each time from the .dat file or fetching it from memcache? Are there any other pros/cons I need to be aware of?
For general queries like this, is there a good guide to teach me how to optimise my code and run speed tests myself? I'm new to python and coding in general so apologies if this is a basic concept.
Thanks!
Tom
EDIT: Thanks for the responses, memcache seems to be the right answer. I think that Nick and Lennart are suggesting that I add the whole gi variable to memcache. I think this is possible. FYI - the whole GeoIP.dat file is just over 1MB so not that large.

What takes time there is rather loading the database from the dat file. Once you have that in memory, the lookup time is not significant. So if you can keep the gi variable in memory that seems the best solution.
If you can't you probably can't use memcached either.

If you need to do lookups across multiple processes (which you almost certainly do on AppEngine), and you are likely to encounter the same ip address lots of times in a short time span (which you probably are), then using memcache is probably a good idea for speed.
More details, since you said you were relatively new to coding:
As Lennart Regebro correctly says, the slow thing is reading the geoip file from disk and parsing it. Individual queries will then be fast. However, if any given process is only serving one request (which, from your perspective, on AppEngine, it is), then this price will get paid on each request. Caching recently used lookups in memcache will let you share this information across processes...but only for recently encountered data points. However, since any given ip is likely to show up in bursts (because it is one user interacting with your site), this is exactly what you want.
Other alternatives are to pre-load all the data points into memcache. You probably don't want to do this, since you have a limited amount of memory available, and you won't end up using most of it. (Also, memcache will throw parts of it away if you hit your memory limit, which means you'd need to write backup code to read from the geoip database live anyway.) In general, doing lazy caching -- look up a value the slow way when you first need it and then keep it around for re-use -- is a very effective mechanism. Memcache is specifically geared for this, since it throws away data that hasn't been used recently when it encounters memory pressure.
Another alternative in general (although not in AppEngine) is to run a separate process that handles just location queries, and having all your front-end processes talk to it (e.g. via thrift). Then you could use the suggestion of just loading up the geoip database in that process and querying it live for each request.
Hope that helps some.

For individual IP addresses that you already have gotten out of the database, I would put them in memcache for sure. I am assuming the database file is relatively large, and you don't want to load that from memcache every time you need to look up one address.
One tool I know people use to help track speed of API calls is AppStats. It can help you see how long various calls to the APIs are taking.
Since you are new to programming in general, I will mention that appstats is a very App Engine specific tool. If you were just writing a basic python application that was going to run on your own computer, you could do timing of things by simply subtracting two timestamps:
import time
t1 = time.time()
#do whatever it is you want to time here.
t2 = time.time()
elapsed_time = t2-t1

Comparing persistent storage solutions in python

I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?

Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.

A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.

I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.

if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.

Sqlite -- it comes with python, fast, widely availible and easy to maintain

If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.

Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.

The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.

It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.

Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.

I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.

I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1

First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.

This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.

Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.