Neo4j - Import very large CSV into existing Database

Neo4j - Import very large CSV into existing Database - python

I'm quite new to Neo4j and already lost with all the out of date documentation and very unclear commands, their effect or speed.
I am looking for a way to import some very large data fast.
The data is in B scale for one kind of data, split into multiple CSV, but I don't mind fusing it into one.
Doing a very simple import (load csv ... create (n:XXX {id: row.id})
Is taking ages, especially with a unique index, it takes days.
I stopped the operations, dropped the unique index and restarted, about 2x faster, but still too slow.
I know about neo4j-import (although deprecated, and there is no documentation on neo4j website about "neo4j-admin import"). It's already extremely unclear how to do simple things like a conditional something.
The biggest bummer is that it doesn't seem to work with an existing database.
The main question is, is there anyway to accelerate import of very large CSV files with neo4j?
First with simple statement like create, but hopefully with match as well.
Right now, running a cypher command such as "match (n:X {id: "Y"}) return n limit 1" takes multiple minutes on the 1B nodes.
(I'm running this on a server, with 200GB+ of RAM and 48CPUs, so probably not a limitation from hardware point of view).

Related

Is it useful to multithread sql queries to fetch data from a large DB

I am writing my bachelor thesis on a project with a massive database that tracks around 8000 animals, three times a second. After a few months, we now have approx 127 million entries and each row includes a column with an array with 1000-3000 entries that has the coordinates for every animal that was tracked in that square that moment. All that lays in a sql database that now easily exceeds 2 TB in size.
To export the data and analyse the moving patterns of the animals, they did it online over PHPMyAdmin as a csv export that would take hours to be finished and break down about everytime.
I wrote them a python (they wanted me to use python) script with mysql-connector-python that will fetch the data for them automatically. The problem is, since the database is so massive, one query can take up minutes or technically even hours to complete. (downloading a day of tracking data would be 3*60*60*24 entries)
The moment anything goes wrong (connection fails, computer is overloaded etc) the whole query is closed and it has to start all over again cause its not cached anywhere.
I then rewrote the whole thing as a class that will fetch the data by using smaller multithreaded queries.
I start about 5-7 Threads that each take a connection out of a connection pool, make the query, write it in a csv file successively and put the connection back in the pool once done with the query.
My solution works perfectly, the queries are about 5-6 times faster, depending on the amount of threads I use and the size of the chunks that I download. The data gets written into the file and when the connection breaks or anything happens, the csvfile still holds all the data that has been downloaded up to that point.
But on looking at solutions how to improve my method, I can find absolutely nothing about a similar approach and no-one seems to do it that way for large datasets.
What am I missing? Why does it seem like everyone is using a single-query approach to fetch their massive datasets, instead of splitting it into threads and avoiding these annoying issues with connection breaks and whatnot?
Is my solution even usable and good in a commercial environment or are there things that I just dont see right now, that would make my approach useless or even way worse?
Or maybe it is a matter of the programming language and if I had used C# to do the same thing it wouldve been faster anyways?
EDIT:
To clear some things up, I am not responsible for the database. While I can tinker with it since I also have admin rights, someone else that (hopefully) actually knows what he is doing, has set it up and writes the data. My Job is only to fetch it as simple and effective as possible. And since exporting from PHPMyAdmin is too slow and so is a single query on python for 100k rows (i do it using pd.read_sql) I switched to multithreading. So my question is only related to SELECTing the data effectively, not to change the DB.
I hope this is not becoming too long of a question...

There are many issues in a database of that size. We need to do the processing fast enough so that it never gets behind. (Once it lags, it will keel over, as you see.)
Ingestion. It sounds like a single client is receiving 8000 lat/lng values every 3 seconds, then INSERTing a single, quite wide row. Is that correct?
When you "process" the data, are you looking at each of the 8000 animals? Or looking at a selected animal? Fetching one out of a lat/lng from a wide row is messy and slow.
If the primary way things are SELECTed is one animal at a time, then your matrix needs to be transposed. That will make selecting all the data for one animal much faster, and we can mostly avoid the impact that Inserting and Selecting have on each other.
Are you inserting while you are reading?
What is the value of innodb_buffer_pool_size? You must plan carefully with the 2TB versus the much smaller RAM size. Depending on the queries, you may be terribly I/O-bound and maybe the data structure can be changed to avoid that.
"...csv file and put it back..." -- Huh? Are you deleting data, then re-inserting it? That sees 'wrong'. And very inefficient.
Do minimize the size of every column in the table. How big is the range for the animals? Your backyard? The Pacific Ocean? How much precision is needed in the location? Meters for whales; millimeters for ants. Maybe the coordinates can be scaled to a pair of SMALLINTs (2 bytes, 16-bit precision) or MEDIUMINTs (3 bytes each)?
I haven't dwelled on threading; I would like to wait until the rest of the issues are ironed out. Threads interfere with each other to some extent.
I find this topic interesting. Let's continue the discussion.

is the choice of Python and Hadoop a good one for this scenario?

I am looking for a solution to build an application with the following features:
A database compound of -potentially- millions of rows in a table, that might be related with a few small ones.
Fast single queries, such as "SELECT * FROM table WHERE field LIKE %value"
It will run on a Linux Server: Single node, but maybe multiple nodes in the future.
Do you think Python and Hadoop is a good choice?
Where could I find a quick example written in Python to add/retrieve information to Hadoop in order to see a proof of concept running with my one eyes and take a decision?
Thanks in advance!

Not sure whether these questions are on topic here, but fortunately the answer is simple enough:
In these days a million rows is simply not that large anymore, even Excel can hold more than a million.
If you have a few million rows in a large table, and want to run quick small select statements, the answer is that you are probably better off without Hadoop.
Hadoop is great for sets of 100 million rows, but does not scale down too wel (in performance and required maintenance).
Therefore, I would recommend you to try using a 'normal' database solution, like MySQL. At least untill your data starts growing significantly.
You can use python for advanced analytical processing, but for simple queries I would recommend using SQL.

Python Multithreading/processing gains for inserts to different tables in MySQL?

I've been pouring over everywhere I can to find an answer to this, but can't seem to find anything:
I've got a batch update to a MySQL database that happens every few minutes, with Python handling the ETL work (I'm pulling data from web API's into the MySQL system).
I'm trying to get a sense of what kinds of potential impact (be it positive or negative) I'd see by using either multithreading or multiprocessing to do multiple connections & inserts of the data simultaneously. Each worker (be it thread or process) would be updating a different table from any other worker.
At the moment I'm only updating a half-dozen tables with a few thousand records each, but this needs to be scalable to dozens of tables and hundreds of thousands of records each.
Every other resource I can find out there addresses doing multithreading/processing to the same table, not a distinct table per worker. I get the impression I would definitely want to use multithreading/processing, but it seems everyone's addressing the one-table use case.
Thoughts?

I think your question is too broad to answer concisely. It seems you're asking about two separate subjects - will writing to separate MySQL tables speed it up, and is python multithreading the way to go. For the python part, since you're probably doing mostly IO, you should look at gevent, and ultramysql. As for the MySQL part, you'll have to wait for more answers.

For one I wrote in C#, I decided the best work partitioning was each "source" having a thread for extraction, one for each transform "type", and one to load the transformed data to each target.
In my case, I found multiple threads per source just ended up saturating the source server too much; it became less responsive overall (to even non-ETL queries) and the extractions didn't really finish any faster since they ended up competing with each other on the source. Since retrieving the remote extract was more time consuming than the local (in memory) transform, I was able to pipeline the extract results from all sources through one transformer thread/queue (per transform "type"). Similarly, I only had a single target to load the data to, so having multiple threads there would have just monopolized the target.
(Some details omitted/simplified for brevity, and due to poor memory.)
...but I'd think we'd need more details about what your ETL process does.

SQLite read time per row increases with total number of rows: workaround? (Python 2.7, OSX)

I have an SQLite database containing several tables, two of which have over a million rows. I'm using Python 2.7 on OSX and the SQLite3 module.
I tried to read all the data from the tables into memory with a SELECT statement (without WHERE or anything of that nature) using the cursor's execute() method, followed by its fetchall() method. An hour later I decided to interrupt the process because I couldn't tell whether it had crashed. So I tried again using the cursor as an iterator and telling it to print the number of seconds elapsed every 10000 rows retrieved.
I found that while this method could retrieve the whole of a 150000 row table in about 6 seconds, with a 2500000 row table, it takes about 40 seconds per 10000 rows to begin with (60 seconds for some of the earliest batches), accelerating to about 20 seconds per 10000 rows from the millionth row onwards. So my question is: why is this happening, and what's a good solution to or workaround for it? An obvious workaround would be to break up large tables into smaller ones (or to give up on SQLite entirely and use something else) but that's not really very elegant.
Many other questions on Stack Overflow relate to slow read performance with relatively complex queries in SQLite (e.g. SQLite - select expression is very slow). But this question relates to the simplest possible kind of query, e.g. SELECT a,b,c,d FROM x.
I've given the OS, programming language, and wrapper that I'm using above, but I'm not sure how relevant they are to the problem. For example, if I try to inspect these large tables using the Firefox extension, SQLite Manager, Firefox just seems to hang. I've also tried switching to the apsw wrapper (https://github.com/rogerbinns/apsw) but there was no improvement.

I tried to read all the data from the tables into memory…
Don't do that.
Even if mediated by a cursor.
There are few cases where you actually need the entire database with an in-core representation and your's is probably not one of them.
Dragging all that data into memory is expensive as your tests have shown you. The cost of doing a fetchall is one of the major issues that DMBs were designed to cope with. For any database where you could build a system that's "good enough" for today. It will grow to exceed core tomorrow.
added in reply to comment:
It appears that PyTables is a commonly used tool to handle memory busting data. I have no experience with PyTables.

Speeding up Arcpy python Script, Big Data

I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).

This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.