I'm currently running into an issue in integrating ElasticSearch and MongoDB. Essentially I need to convert a number of Mongo Documents into searchable documents matching my ElasticSearch query. That part is luckily trivial and taken care of. My problem though is that I need this to be fast. Faster than network time, I would really like to be able to index around 100 docs/second, which simply isn't possible with network calls to Mongo.
I was able to speed this up a lot by using ElasticSearch's bulk indexing, but that's only half of the problem. Is there any way to either bundle reads or cache a collection (a manageable part of a collection, as this collection is larger than I would like to keep in memory) to help speed this up? I was unable to really find any documentation about this, so if you can point me towards relevant documentation I consider that a perfectly acceptable answer.
I would prefer a solution that uses Pymongo, but I would be more than happy to use something that directly talks to MongoDB over requests or something similar. Any thoughts on how to alleviate this?
pymongo is thread safe, so you can run multiple queries in parallel. (I assume that you can somehow partition your document space.)
Feed the results to a local Queue if processing the result needs to happen in a single thread.
Related
I'm currently working on something where data is produced, with a fair amount of processing, within SQL server, that I ultimately need to manipulate within Python to complete the task. It seems to me I have a couple different options:
(1) Run the SQL code from within Python, manipulate output
(2) Create an SP in SSMS, run the SP from within Python, manipulate output
(3) ?
The second seems cleanest, but I wonder if there's a better way to achieve my objective without needing to create a stored procedure every time I need SQL data in Python. Copying the entirety of the SQL code into Python seems similarly kludgy, particularly for larger or complex queries.
For those with more experience working between the two: can you offer any suggestions on workflow?
There is no silver bullet.
It really depends on the specifics of what you're doing. What amount of data are we talking? Is it even feasible to stream it all over the network, through Python, and back? How much more load can the database server handle? How complex are the manipulations you consider doing in Python? How proficient are you and your team in SQL, and in Python?
I've seen both approaches in production, and one slight advantage that sometimes gets overlooked is that when you have all the SQL nicely formatted inside your Python program, it's automatically under some Version Control, and you can check who edited what last and is thus to blame for the latest SNAFU ;-)
I am finding Neo4j slow to add nodes and relationships/arcs/edges when using the REST API via py2neo for Python. I understand that this is due to each REST API call executing as a single self-contained transaction.
Specifically, adding a few hundred pairs of nodes with relationships between them takes a number of seconds, running on localhost.
What is the best approach to significantly improve performance whilst staying with Python?
Would using bulbflow and Gremlin be a way of constructing a bulk insert transaction?
Thanks!
There are several ways to do a bulk create with py2neo, each making only a single call to the server.
Use the create method to build a number of nodes and relationships in a single batch.
Use a cypher CREATE statement.
Use the new WriteBatch class (just released this week) to manually make a batch of nodes and relationships (this is really just a manual version of 1).
If you have some code, I'm happy to look at it and make suggestions on performance tweaks. There are also quite a few tests you may be able to get inspiration from.
Cheers,
Nige
Neo4j's write performance is slow unless you are doing a batch insert.
The Neo4j batch importer (https://github.com/jexp/batch-import) is the fastest way to load data into Neo4j. It's a Java utility, but you don't need to know any Java because you're just running the executable. It handles typed data and indexes, and it imports from a CSV file.
To use it with Bulbs (http://bulbflow.com/) Models, use the model get_bundle() method to get the data, index name, and index keys, which is prepared for insert, and then output the data to a CSV file. Or if you don't want to model your data, just output your data from Python to the CSV file.
Will that work for you?
There's so many old answers to this question online, that it took me forever to realize there's an import tool that comes with neo4j. It's very fast and the best tool I was able to find.
Here's a simple example if we want to import student nodes:
bin/neo4j-import --into [path-to-your-neo4j-directory]/data/graph.db --nodes students
The students file contains data that looks like this, for example:
studentID:Id(Student),name,year:int,:LABEL
1111,Amy,2000,Student
2222,Jane,2012,Student
3333,John,2013,Student
Explanation:
The header explains how the data below it should be interpreted.
studentID is a property with type Id(Student).
name is of type string which is the default.
year is an integer
:LABEL is the label you want for these nodes, in this case it is "Student"
Here's the documentation for it: http://neo4j.com/docs/stable/import-tool-usage.html
Note: I realize the question specifically mentions python, but another useful answer mentions a non-python solution.
Well, I myself had need for massive performance from neo4j. I end up doing following things to improve graph performance.
Ditched py2neo, since there were lot of issues with it. Besides it is very convenient to use REST endpoint provided by neo4j, just make sure to use request sessions.
Use raw cypher queries for bulk insert, instead of any OGM(Object-Graph Mapper). That is very crucial if you need an high-performant system.
Performance was not still enough for my needs, so I ended writing a custom system that merges 6-10 queries together using WITH * AND UNION clauses. That improved performance by a factor of 3 to 5 times.
Use larger transaction size with atleast 1000 queries.
To insert a bulk of nodes in very high speed to Neo4K
Batch Inserter
http://neo4j.com/docs/stable/batchinsert-examples.html
In my case I'm working on Java.
I'm using Python on Appengine and am looking up the geolocation of an IP address like this:
import pygeoip
gi = pygeoip.GeoIP('GeoIP.dat')
Location = gi.country_code_by_addr(self.request.remote_addr)
(pygeoip can be found here: http://code.google.com/p/pygeoip/)
I want to geolocate each page of my app for a user so currently I lookup the IP address once then store it in memcache.
My question - which is quicker? Looking up the IP address each time from the .dat file or fetching it from memcache? Are there any other pros/cons I need to be aware of?
For general queries like this, is there a good guide to teach me how to optimise my code and run speed tests myself? I'm new to python and coding in general so apologies if this is a basic concept.
Thanks!
Tom
EDIT: Thanks for the responses, memcache seems to be the right answer. I think that Nick and Lennart are suggesting that I add the whole gi variable to memcache. I think this is possible. FYI - the whole GeoIP.dat file is just over 1MB so not that large.
What takes time there is rather loading the database from the dat file. Once you have that in memory, the lookup time is not significant. So if you can keep the gi variable in memory that seems the best solution.
If you can't you probably can't use memcached either.
If you need to do lookups across multiple processes (which you almost certainly do on AppEngine), and you are likely to encounter the same ip address lots of times in a short time span (which you probably are), then using memcache is probably a good idea for speed.
More details, since you said you were relatively new to coding:
As Lennart Regebro correctly says, the slow thing is reading the geoip file from disk and parsing it. Individual queries will then be fast. However, if any given process is only serving one request (which, from your perspective, on AppEngine, it is), then this price will get paid on each request. Caching recently used lookups in memcache will let you share this information across processes...but only for recently encountered data points. However, since any given ip is likely to show up in bursts (because it is one user interacting with your site), this is exactly what you want.
Other alternatives are to pre-load all the data points into memcache. You probably don't want to do this, since you have a limited amount of memory available, and you won't end up using most of it. (Also, memcache will throw parts of it away if you hit your memory limit, which means you'd need to write backup code to read from the geoip database live anyway.) In general, doing lazy caching -- look up a value the slow way when you first need it and then keep it around for re-use -- is a very effective mechanism. Memcache is specifically geared for this, since it throws away data that hasn't been used recently when it encounters memory pressure.
Another alternative in general (although not in AppEngine) is to run a separate process that handles just location queries, and having all your front-end processes talk to it (e.g. via thrift). Then you could use the suggestion of just loading up the geoip database in that process and querying it live for each request.
Hope that helps some.
For individual IP addresses that you already have gotten out of the database, I would put them in memcache for sure. I am assuming the database file is relatively large, and you don't want to load that from memcache every time you need to look up one address.
One tool I know people use to help track speed of API calls is AppStats. It can help you see how long various calls to the APIs are taking.
Since you are new to programming in general, I will mention that appstats is a very App Engine specific tool. If you were just writing a basic python application that was going to run on your own computer, you could do timing of things by simply subtracting two timestamps:
import time
t1 = time.time()
#do whatever it is you want to time here.
t2 = time.time()
elapsed_time = t2-t1
I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?
Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.
A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.
I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.
if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.
Sqlite -- it comes with python, fast, widely availible and easy to maintain
If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.
Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.
The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.
It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.
To set the background: I'm interested in:
Capturing implicit signals of interest in books as users browse around a site. The site is written in django (python) using mysql, memcached, ngnix, and apache
Let's say, for instance, my site sells books. As a user browses around my site I'd like to keep track of which books they've viewed, and how many times they've viewed them.
Not that I'd store the data this way, but ideally I could have on-the-fly access to a structure like:
{user_id : {book_id: number_of_views, book_id_2: number_of_views}}
I realize there are a few approaches here:
Some flat-file log
Writing an object to a database every time
Writing to an object in memcached
I don't really know the performance implications, but I'd rather not be writing to a database on every single page view, and the lag writing to a log and computing the structure later seems not quick enough to give good recommendations on-the-fly as you use the site, and the memcached appraoch seems fine, but there's a cost in keeping this obj in memory: you might lose it, and it never gets written somewhere 'permanent'.
What approach would you suggest? (doesn't have to be one of the above) Thanks!
If this data is not an unimportant statistic that might or might not be available I'd suggest taking the simple approach and using a model. It will surely hit the database everytime.
Unless you are absolutely positively sure these queries are actually degrading overall experience there is no need to worry about it. Even if you optimize this one, there's a good chance other unexpected queries are wasting more CPU time. I assume you wouldn't be asking this question if you were testing all other queries. So why risk premature optimization on this one?
An advantage of the model approach would be having an API in place. When you have tested and decided to optimize you can keep this API and change the underlying model with something else (which will most probably be more complex than a model).
I'd definitely go with a model first and see how it performs. (and also how other parts of the project perform)
What approach would you suggest? (doesn't have to be one of the above) Thanks!
hmmmm ...this like been in a four walled room with only one door and saying i want to get out of room but not through the only door...
There was an article i was reading sometime back (can't get the link now) that says memcache can handle huge (facebook uses it) sets of data in memory with very little degradation in performance...my advice is you will need to explore more on memcache, i think it will do the trick.
Either a document datastore (mongo/couchdb), or a persistent key value store (tokyodb, memcachedb etc) may be explored.
No definite recommendations from me as the final solution depends on multiple factors - load, your willingness to learn/deploy a new technology, size of the data...
Seems to me that one approach could be to use memcached to keep the counter, but have a cron running regularly to store the value from memcached to the db or disk. That way you'd get all the performance of memcached, but in the case of a crash you wouldn't lose more than a couple of minutes' data.