I am trying to import a csv file containing 15 columns totaling 64 GB. I am using Stata MP with a
512 GB RAM computer, which I would presume is sufficient for a task such as this. Yet, I started the import using an import delimited command over 3 days ago, and it still has not loaded into Stata (still shows importing). Has anyone else ran into an issue like this and, if so, is this the type of situation where I just need to wait longer and it will eventually import, or will I end up waiting forever?
Would anyone have recommendations on how to best tackle situations such as this? I've heard that SAS tends to be much more efficient for big data tasks since it doesn't need to read an entire dataset into memory at once, but I do not have any coding knowledge in SAS and am not even sure if I'd have a way to access it. I do have an understanding of Python and R, but I am unsure if either would be of benefit since I believe they also read an entire dataset into memory.
I am writing my bachelor thesis on a project with a massive database that tracks around 8000 animals, three times a second. After a few months, we now have approx 127 million entries and each row includes a column with an array with 1000-3000 entries that has the coordinates for every animal that was tracked in that square that moment. All that lays in a sql database that now easily exceeds 2 TB in size.
To export the data and analyse the moving patterns of the animals, they did it online over PHPMyAdmin as a csv export that would take hours to be finished and break down about everytime.
I wrote them a python (they wanted me to use python) script with mysql-connector-python that will fetch the data for them automatically. The problem is, since the database is so massive, one query can take up minutes or technically even hours to complete. (downloading a day of tracking data would be 3*60*60*24 entries)
The moment anything goes wrong (connection fails, computer is overloaded etc) the whole query is closed and it has to start all over again cause its not cached anywhere.
I then rewrote the whole thing as a class that will fetch the data by using smaller multithreaded queries.
I start about 5-7 Threads that each take a connection out of a connection pool, make the query, write it in a csv file successively and put the connection back in the pool once done with the query.
My solution works perfectly, the queries are about 5-6 times faster, depending on the amount of threads I use and the size of the chunks that I download. The data gets written into the file and when the connection breaks or anything happens, the csvfile still holds all the data that has been downloaded up to that point.
But on looking at solutions how to improve my method, I can find absolutely nothing about a similar approach and no-one seems to do it that way for large datasets.
What am I missing? Why does it seem like everyone is using a single-query approach to fetch their massive datasets, instead of splitting it into threads and avoiding these annoying issues with connection breaks and whatnot?
Is my solution even usable and good in a commercial environment or are there things that I just dont see right now, that would make my approach useless or even way worse?
Or maybe it is a matter of the programming language and if I had used C# to do the same thing it wouldve been faster anyways?
EDIT:
To clear some things up, I am not responsible for the database. While I can tinker with it since I also have admin rights, someone else that (hopefully) actually knows what he is doing, has set it up and writes the data. My Job is only to fetch it as simple and effective as possible. And since exporting from PHPMyAdmin is too slow and so is a single query on python for 100k rows (i do it using pd.read_sql) I switched to multithreading. So my question is only related to SELECTing the data effectively, not to change the DB.
I hope this is not becoming too long of a question...
There are many issues in a database of that size. We need to do the processing fast enough so that it never gets behind. (Once it lags, it will keel over, as you see.)
Ingestion. It sounds like a single client is receiving 8000 lat/lng values every 3 seconds, then INSERTing a single, quite wide row. Is that correct?
When you "process" the data, are you looking at each of the 8000 animals? Or looking at a selected animal? Fetching one out of a lat/lng from a wide row is messy and slow.
If the primary way things are SELECTed is one animal at a time, then your matrix needs to be transposed. That will make selecting all the data for one animal much faster, and we can mostly avoid the impact that Inserting and Selecting have on each other.
Are you inserting while you are reading?
What is the value of innodb_buffer_pool_size? You must plan carefully with the 2TB versus the much smaller RAM size. Depending on the queries, you may be terribly I/O-bound and maybe the data structure can be changed to avoid that.
"...csv file and put it back..." -- Huh? Are you deleting data, then re-inserting it? That sees 'wrong'. And very inefficient.
Do minimize the size of every column in the table. How big is the range for the animals? Your backyard? The Pacific Ocean? How much precision is needed in the location? Meters for whales; millimeters for ants. Maybe the coordinates can be scaled to a pair of SMALLINTs (2 bytes, 16-bit precision) or MEDIUMINTs (3 bytes each)?
I haven't dwelled on threading; I would like to wait until the rest of the issues are ironed out. Threads interfere with each other to some extent.
I find this topic interesting. Let's continue the discussion.
I'm quite new to Neo4j and already lost with all the out of date documentation and very unclear commands, their effect or speed.
I am looking for a way to import some very large data fast.
The data is in B scale for one kind of data, split into multiple CSV, but I don't mind fusing it into one.
Doing a very simple import (load csv ... create (n:XXX {id: row.id})
Is taking ages, especially with a unique index, it takes days.
I stopped the operations, dropped the unique index and restarted, about 2x faster, but still too slow.
I know about neo4j-import (although deprecated, and there is no documentation on neo4j website about "neo4j-admin import"). It's already extremely unclear how to do simple things like a conditional something.
The biggest bummer is that it doesn't seem to work with an existing database.
The main question is, is there anyway to accelerate import of very large CSV files with neo4j?
First with simple statement like create, but hopefully with match as well.
Right now, running a cypher command such as "match (n:X {id: "Y"}) return n limit 1" takes multiple minutes on the 1B nodes.
(I'm running this on a server, with 200GB+ of RAM and 48CPUs, so probably not a limitation from hardware point of view).
I have a rather complex database which I deliver in CSV format to my client. The logic to arrive at that database is an intricate mix of Python processing and SQL joins done in sqlite3.
There are ~15 source datasets ranging from a few hundreds records to as many as several million (but fairly short) records.
Instead of having a mix of Python / sqlite3 logic, for clarity, maintainability and several other reasons I would love to move ALL logic to an efficient set of Python scripts and circumvent sqlite3 altogether.
I understand that the answer and the path to go would be Pandas, but could you please advise if this is the right track for a rather large database like the one described above?
I have been using Pandas with datasets > 20 GB in size (on a Mac with 8 GB RAM).
My main problem has been that there is a know bug in Python that makes it impossible to write files larger than 2 GB on OSX. However, using HDF5 circumvents that.
I found the tips in this and this article enough to make everything run without problem. The main lesson is to check the memory usage of your data frame and cast the types of the columns to the smallest possible data type.
What is the fastest way of converting records holding only numeric data into fixed with format strings and writing them to a file in Python? For example, suppose record is a huge list consisting of objects with attributes id, x, y, and wt and we frequently need to flush them to an external file. The flushing can be done with the following snippet:
with open(serial_fname(), "w") as f:
for r in records:
f.write("%07d %11.5e %11.5e %7.5f\n" % (r.id, r.x, r.y, r.wt))
However my code is spending too much time generating external files leaving too little time for doing what it is supposed to do between the flushes.
Amendmend to the original question:
I ran into this problem while writing a server software that keeps track of a global record set by pulling the information from several "producer" systems and relays any changes to the record set to "consumer" systems in real-time or near real-time in preprocessed form. Many of the consumer systems are Matlab applications.
I have listed below some suggestions I have received so far (thanks) with some comments:
Dump only the changes, not the whole data set: I'm actually doing this already. The resulting change sets are still huge.
Use binary (or some other more efficient) file format: I'm pretty much constrained by what Matlab can read reasonably efficiently and in addition to that the format should be platform independent.
Use database: I am actually trying to bypass the current database solution that is deemed both too slow and cumbersome, especially on Matlab's side.
Dividing task to separate processes: At the moment the dumping code is running in its own thread. However because of the GIL it is still consuming the same core. I guess I could move it to completely separate process.
I was trying to check if numpy.savetxt could speed things up a bit so I wrote the following simulation:
import sys
import numpy as np
fmt = '%7.0f %11.5e %11.5e %7.5f'
records = 10000
np.random.seed(1234)
aray = np.random.rand(records, 4)
def writ(f, aray=aray, fmt=fmt):
fw = f.write
for row in aray:
fw(fmt % tuple(row))
def prin(f, aray=aray, fmt=fmt):
for row in aray:
print>>f, fmt % tuple(row)
def stxt(f, aray=aray, fmt=fmt):
np.savetxt(f, aray, fmt)
nul = open('/dev/null', 'w')
def tonul(func, nul=nul):
func(nul)
def main():
print 'looping:'
loop(sys.stdout, aray)
print 'savetxt:'
savetxt(sys.stdout, aray)
I found the results (on my 2.4 GHz Core Duo Macbook Pro, with Mac OS X 10.5.8, Python 2.5.4 from the DMG on python.org, numpy 1.4 rc1 built from sources) slightly surprising, but they're quite repeatable so I thought they may be of interest:
$ py25 -mtimeit -s'import ft' 'ft.tonul(ft.writ)'
10 loops, best of 3: 101 msec per loop
$ py25 -mtimeit -s'import ft' 'ft.tonul(ft.prin)'
10 loops, best of 3: 98.3 msec per loop
$ py25 -mtimeit -s'import ft' 'ft.tonul(ft.stxt)'
10 loops, best of 3: 104 msec per loop
so, savetxt seems to be a few percent slower than a loop calling write... but good old print (also in a loop) seems to be a few percents faster than write (I guess it's avoiding some kind of call overhead). I realize that a difference of 2.5% or so isn't very important, but it's not in the direction I intuitively expected it to be, so I thought I'd report it. (BTW, using a real file instead of /dev/null only uniformly adds 6 or 7 milliseconds, so it doesn't change things much, one way or another).
I don't see anything about your snippet of code that I could really optimize. So, I think we need to do something completely different to solve your problem.
Your problem seems to be that you are chewing large amounts of data, and it's slow to format the data into strings and write the strings to a file. You said "flush" which implies you need to save the data regularly.
Are you saving all the data regularly, or just the changed data? If you are dealing with a very large data set, changing just some data, and writing all of the data... that's an angle we could attack to solve your problem.
If you have a large data set, and you want to update it from time to time... you are a candidate for a database. A real database, written in C for speed, will let you throw lots of data updates at it, and will keep all the records in a consistent state. Then you can, at intervals, run a "report" which will pull the records and write your fixed-width text file from them.
In other words, I'm proposing you divide the problem into two parts: updating the data set piecemeal as you compute or receive more data, and dumping the entire data set into your fixed-width text format, for your further processing.
Note that you could actually generate the text file from the database without stopping the Python process that is updating it. You would get an incomplete snapshot, but if the records are independent, that should be okay.
If your further processing is in Python also, you could just leave the data in the database forever. Don't bother round-tripping the data through a fixed-width text file. I'm assuming you are using a fixed-width text file because it's easy to extract the data again for future processing.
If you use the database idea, try to use PostgreSQL. It's free and it's a real database. For using a database with Python, you should use an ORM. One of the best is SqlAlchemy.
Another thing to consider: if you are saving the data in a fixed-width text file format for future parsing and use of the data in another application, and if that application can read JSON as well as fixed-width, maybe you could use a C module that writes JSON. It might not be any faster, but it might; you could benchmark it and see.
Other than the above, my only other idea is to split your program into a "worker" part and an "updater" part, where the worker generates updated records and the updater part saves the records to disk. Perhaps have them communicate by having the worker put the updated records, in text format, to the standard output; and have the updater read from standard input and update its record of the data. Instead of an SQL database, the updater could use a dictionary to store the text records; as new ones arrived, it could simply update the dictionary. Something like this:
for line in sys.stdin:
id = line[:7] # fixed width: id is 7 wide
records[id] = line # will insert or update as needed
You could actually have the updater keep two dictionaries, and keep updating one while the other one is written out to disk.
Dividing into a worker and an updater is a good way to make sure the worker doesn't spend all its time updating, and a great way to balance the work across multiple CPU cores.
I'm out of ideas for now.
you can try to build all the output strings in the memory, e.g. use a long string.
and then write this long string in the file.
more faster:
you may want to use binary files rather text files for logging information. But then you need to write another tool to view the binary files.
Now that you updated your question, I have a slightly better idea of what you are facing.
I don't know what the "current database solution that is deemed both too slow and cumbersome" is, but I still think a database would help if used correctly.
Run the Python code to collect data, and use an ORM module to insert/update the data into the database. Then run a separate process to make a "report", which would be the fixed-width text files. The database would be doing all the work of generating your text file. If necessary, put the database on its own server, since hardware is pretty cheap these days.
You could use try to push your loop to C using ctypes.