Correct way to bulk insert / merge nodes and edges

Correct way to bulk insert / merge nodes and edges - python

I've been using neo4j with py2neo for a couple of weeks now, and up to now it was fine to just do single node transactions, so I would have different node types
class NodeA(GraphObject):
...
class NodeB(GraphObject):
...
# create some nodes from data and simply save them one by one
for data in dataset:
node_a = NodeA(data)
node_b = NodeB(data)
if x:
node_a.related_to_b.add(node_b)
g.merge(node_b)
g.merge(node_a)
Nothing fancy. However, I'm starting to get more nodes and connections, and single transactions don't really work anymore, as expected. I've been looking for ways to do bulk inserts, but can't find any good ressources. The best I've managed to accomplish is using unwind_merge_nodes_query, which has two issues:
isn't that fast (~5 seconds for 700 very basic nodes on my laptop)
edges need to be handled separately
it requires keeping track of all the node ids to be able to handle edge connections
I've been writing functions to handle the above mentioned points, but I feel like I'm missing something and that there's a simpler way to handle batches of data

The unwind_merge_nodes_query function isn't generally intended to be used directly, although you can do so. Usually, you'd want to use the functions from the py2neo.bulk module instead, which wrap these functions.
Either way though, that nuance is unlikely to help much with your specific problems. As a client-side library, py2neo can only carry out operations exposed by the Neo4j server and, unfortunately, there exists no good (low level) way to import non-trivial bulk data from the client. Py2neo can't fix that.
If performance is your goal, your best bet might be to instead use a LOAD CSV Cypher statement. Note though that to do this, your input data file will need to be on our visible to the server directly.

Related

How to efficiently insert bulk data into Cassandra using Python?

I have a Python application, built with Flask, that allows importing of many data records (anywhere from 10k-250k+ records at one time). Right now it inserts into a Cassandra database, by inserting one record at a time like this:
for transaction in transactions:
self.transaction_table.insert_record(transaction)
This process is incredibly slow. Is there a best-practice approach I could use to more efficiently insert this bulk data?

You can use batch statements for this, an example and documentation is available from the datastax documentation. You can also use some child workers and/or async queries on top of this.
In terms of best practices, it is more efficient if each batch only contains one partition key. This is because you do not want a node to be used as a coordinator for many different partition keys, it would be faster to contact each individual node directly.
If each record has a different partition key, a single prepared statement with some child workers may work out to be better.
You may also want to consider using a TokenAware load balancing policy allowing the relevant node to be contacted directly, instead of being coordinated through another node.

The easiest solution is to generate csv files from your data, and import it with the COPY command. That should work well for up to a few million rows. For more complicated scenarios you could use the sstableloader command.

Bundling reads or caching collections with Pymongo

I'm currently running into an issue in integrating ElasticSearch and MongoDB. Essentially I need to convert a number of Mongo Documents into searchable documents matching my ElasticSearch query. That part is luckily trivial and taken care of. My problem though is that I need this to be fast. Faster than network time, I would really like to be able to index around 100 docs/second, which simply isn't possible with network calls to Mongo.
I was able to speed this up a lot by using ElasticSearch's bulk indexing, but that's only half of the problem. Is there any way to either bundle reads or cache a collection (a manageable part of a collection, as this collection is larger than I would like to keep in memory) to help speed this up? I was unable to really find any documentation about this, so if you can point me towards relevant documentation I consider that a perfectly acceptable answer.
I would prefer a solution that uses Pymongo, but I would be more than happy to use something that directly talks to MongoDB over requests or something similar. Any thoughts on how to alleviate this?

pymongo is thread safe, so you can run multiple queries in parallel. (I assume that you can somehow partition your document space.)
Feed the results to a local Queue if processing the result needs to happen in a single thread.

Fastest way to perform bulk add/insert in Neo4j with Python?

I am finding Neo4j slow to add nodes and relationships/arcs/edges when using the REST API via py2neo for Python. I understand that this is due to each REST API call executing as a single self-contained transaction.
Specifically, adding a few hundred pairs of nodes with relationships between them takes a number of seconds, running on localhost.
What is the best approach to significantly improve performance whilst staying with Python?
Would using bulbflow and Gremlin be a way of constructing a bulk insert transaction?
Thanks!

There are several ways to do a bulk create with py2neo, each making only a single call to the server.
Use the create method to build a number of nodes and relationships in a single batch.
Use a cypher CREATE statement.
Use the new WriteBatch class (just released this week) to manually make a batch of nodes and relationships (this is really just a manual version of 1).
If you have some code, I'm happy to look at it and make suggestions on performance tweaks. There are also quite a few tests you may be able to get inspiration from.
Cheers,
Nige

Neo4j's write performance is slow unless you are doing a batch insert.
The Neo4j batch importer (https://github.com/jexp/batch-import) is the fastest way to load data into Neo4j. It's a Java utility, but you don't need to know any Java because you're just running the executable. It handles typed data and indexes, and it imports from a CSV file.
To use it with Bulbs (http://bulbflow.com/) Models, use the model get_bundle() method to get the data, index name, and index keys, which is prepared for insert, and then output the data to a CSV file. Or if you don't want to model your data, just output your data from Python to the CSV file.
Will that work for you?

There's so many old answers to this question online, that it took me forever to realize there's an import tool that comes with neo4j. It's very fast and the best tool I was able to find.
Here's a simple example if we want to import student nodes:
bin/neo4j-import --into [path-to-your-neo4j-directory]/data/graph.db --nodes students
The students file contains data that looks like this, for example:
studentID:Id(Student),name,year:int,:LABEL
1111,Amy,2000,Student
2222,Jane,2012,Student
3333,John,2013,Student
Explanation:
The header explains how the data below it should be interpreted.
studentID is a property with type Id(Student).
name is of type string which is the default.
year is an integer
:LABEL is the label you want for these nodes, in this case it is "Student"
Here's the documentation for it: http://neo4j.com/docs/stable/import-tool-usage.html
Note: I realize the question specifically mentions python, but another useful answer mentions a non-python solution.

Well, I myself had need for massive performance from neo4j. I end up doing following things to improve graph performance.
Ditched py2neo, since there were lot of issues with it. Besides it is very convenient to use REST endpoint provided by neo4j, just make sure to use request sessions.
Use raw cypher queries for bulk insert, instead of any OGM(Object-Graph Mapper). That is very crucial if you need an high-performant system.
Performance was not still enough for my needs, so I ended writing a custom system that merges 6-10 queries together using WITH * AND UNION clauses. That improved performance by a factor of 3 to 5 times.
Use larger transaction size with atleast 1000 queries.

To insert a bulk of nodes in very high speed to Neo4K
Batch Inserter
http://neo4j.com/docs/stable/batchinsert-examples.html
In my case I'm working on Java.

Preferred (or recommended) way to store large amounts of simulation configurations, runs values and final results

I am working with some network simulator. After making some extensions to it, I need to make a lot of different simulations and tests. I need to record:
simulation scenario configurations
values of some parameters (e.g. buffer sizes, signal qualities, position) per devices per time unit t
final results computed from those recorded values
Second data is needed to perform some visualization after simulation was performed (simple animation, showing some statistics over time).
I am using Python with matplotlib etc. for post-processing the data and for writing a proper app (now considering pyQt or Django, but this is not the topic of the question). Now I am wondering what would be the best way to store this data?
My first guess was to use XML files, but it can be too much overhead from the XML syntax (I mean, files can grow up to very big sizes, especially for the second part of the data type). So I tried to design a database... But this also seems to me to be not the proper way... Maybe a mix of both?
I have tried to find some clues in Google, but found nothing special. Have you ever had a need for storing such data? How have you done that? Is there any "design pattern" for that?

Separate concerns:
Apart from pondering on the technology to use for storing data (DBMS, CSV, or maybe one of the specific formats for scientific data), note that you have three very different kinds of data to manage:
Simulation scenario configurations: these are (typically) rather small, but they need to be simple to edit, simple to re-use, and should allow to reproduce a simulation run. Here, text or code files seem to be a good choice (these should also be version-controlled).
Raw simulation data: this is where you should be really careful if you are concerned with simulation performance, because writing 3 GB of data during a run can take a huge amount of time if implemented badly. One way to proceed would be to use existing file formats for this purpose (see below) and see if they work for you. If not, you can still use a DBMS. Also, it is usually a good idea to include a description of the scenario that generated the data (or at least a reference), as this helps you managing the results.
Data for post-processing: how to store this mostly depends on the post-processing tools. For example, if you already have a class structure for your visualization application, you could define a file format that makes it easy to read in the required data.
Look for existing solutions:
The problem you face (How to manage simulation data?) is fundamental and there are many potential solutions, each coming with certain trade-offs. As you are working in network simulation, check out what capabilities other tools used in your community provide. It could be that their developers ran into problems you are not even anticipating yet (regarding reproducibility etc.), and already found a good solution. For example, you could check out how OMNeT++ is handling simulation output: the simulation configurations are defined in a separate file, results are written to vec and sca files (depending on their nature). As far as I understood your problems with hierarchical data, this is supported as well (vectors get unique IDs and are associated with an attribute of some model entity).
Additional tools already work with these file formats, e.g. to convert them to other formats like CSV/MATLAB files, so you could even think of creating files in the same format (documented here) and to use existing tools/converters for post-processing.
Many other simulation tools will have similar features, so take a look at what would work best for you.

It sounds like you need to record more or less the same kinds of information for each case, so a relational database sounds like a good fit-- why do you think it's "not the proper way"?
If your data fits in a collection of CSV files, you're most of the way to a relational database already! Just store in database tables instead, and you have support for foreign keys and queries. If you go on to implement an object-oriented solution, you can initialize your objects from the database.

If your data structures are well-known and stable AND you need some of the SQL querying / computation features then a light-weight relational DB like SQLite might be the way to go (just make sure it can handle your eventual 3+GB data).
Else - ie, each simulation scenario might need a dedicated data structure to store the results -, and you don't need any SQL feature, then you might be better using a more free-form solution (document-oriented database, OO database, filesystem + csv, whatever).
Note that you can still use a SQL db in the second case, but you'll have to dynamically create tables for each resultset, and of course dynamically create the relevant SQL queries too.

Comparing persistent storage solutions in python

I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?

Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.

A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.

I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.

if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.

Sqlite -- it comes with python, fast, widely availible and easy to maintain

If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.

Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.

The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.

It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.