Pandas for large(r) datasets

Pandas for large(r) datasets - python

I have a rather complex database which I deliver in CSV format to my client. The logic to arrive at that database is an intricate mix of Python processing and SQL joins done in sqlite3.
There are ~15 source datasets ranging from a few hundreds records to as many as several million (but fairly short) records.
Instead of having a mix of Python / sqlite3 logic, for clarity, maintainability and several other reasons I would love to move ALL logic to an efficient set of Python scripts and circumvent sqlite3 altogether.
I understand that the answer and the path to go would be Pandas, but could you please advise if this is the right track for a rather large database like the one described above?

I have been using Pandas with datasets > 20 GB in size (on a Mac with 8 GB RAM).
My main problem has been that there is a know bug in Python that makes it impossible to write files larger than 2 GB on OSX. However, using HDF5 circumvents that.
I found the tips in this and this article enough to make everything run without problem. The main lesson is to check the memory usage of your data frame and cast the types of the columns to the smallest possible data type.

Related

Best Method for Importing Very Large CSV into Stata and/or Other Statistical Software

I am trying to import a csv file containing 15 columns totaling 64 GB. I am using Stata MP with a
512 GB RAM computer, which I would presume is sufficient for a task such as this. Yet, I started the import using an import delimited command over 3 days ago, and it still has not loaded into Stata (still shows importing). Has anyone else ran into an issue like this and, if so, is this the type of situation where I just need to wait longer and it will eventually import, or will I end up waiting forever?
Would anyone have recommendations on how to best tackle situations such as this? I've heard that SAS tends to be much more efficient for big data tasks since it doesn't need to read an entire dataset into memory at once, but I do not have any coding knowledge in SAS and am not even sure if I'd have a way to access it. I do have an understanding of Python and R, but I am unsure if either would be of benefit since I believe they also read an entire dataset into memory.

is the choice of Python and Hadoop a good one for this scenario?

I am looking for a solution to build an application with the following features:
A database compound of -potentially- millions of rows in a table, that might be related with a few small ones.
Fast single queries, such as "SELECT * FROM table WHERE field LIKE %value"
It will run on a Linux Server: Single node, but maybe multiple nodes in the future.
Do you think Python and Hadoop is a good choice?
Where could I find a quick example written in Python to add/retrieve information to Hadoop in order to see a proof of concept running with my one eyes and take a decision?
Thanks in advance!

Not sure whether these questions are on topic here, but fortunately the answer is simple enough:
In these days a million rows is simply not that large anymore, even Excel can hold more than a million.
If you have a few million rows in a large table, and want to run quick small select statements, the answer is that you are probably better off without Hadoop.
Hadoop is great for sets of 100 million rows, but does not scale down too wel (in performance and required maintenance).
Therefore, I would recommend you to try using a 'normal' database solution, like MySQL. At least untill your data starts growing significantly.
You can use python for advanced analytical processing, but for simple queries I would recommend using SQL.

Neo4j - Import very large CSV into existing Database

I'm quite new to Neo4j and already lost with all the out of date documentation and very unclear commands, their effect or speed.
I am looking for a way to import some very large data fast.
The data is in B scale for one kind of data, split into multiple CSV, but I don't mind fusing it into one.
Doing a very simple import (load csv ... create (n:XXX {id: row.id})
Is taking ages, especially with a unique index, it takes days.
I stopped the operations, dropped the unique index and restarted, about 2x faster, but still too slow.
I know about neo4j-import (although deprecated, and there is no documentation on neo4j website about "neo4j-admin import"). It's already extremely unclear how to do simple things like a conditional something.
The biggest bummer is that it doesn't seem to work with an existing database.
The main question is, is there anyway to accelerate import of very large CSV files with neo4j?
First with simple statement like create, but hopefully with match as well.
Right now, running a cypher command such as "match (n:X {id: "Y"}) return n limit 1" takes multiple minutes on the 1B nodes.
(I'm running this on a server, with 200GB+ of RAM and 48CPUs, so probably not a limitation from hardware point of view).

Speeding up Arcpy python Script, Big Data

I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).

This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.

Creating big Excel sheets programmatically

We are using OpenPyxl to export MySQL content to Microsoft Excel in XSLX format
https://bitbucket.org/ericgazoni/openpyxl/overview
However, the amount of data we are dealing with is big. We are running to out of memory situation. Tables may contain up to 400 columns in 50000+ rows. Even the files are big, they are not that big that Microsoft Excel or OpenOffice should have problems with them.
We are assuming our issues mainly stem from the fact that Python keeps XML DOM structure in memory in not efficient enough manner.
EDIT: Eric, the author of OpenPyxl, pointed out that there is an option to make OpenPyxl write with fixed memory usage. However, this didn't solve our problem completely, as we still have issues with raw speed and something else taking up too much memory in Python.
Now we are looking for more efficient ways to create Excel files. With Python preferably, but if we cannot find a good solution we might want to look other programming languages as well.
Options, not in any specific order, include
1) Using OpenOffice and PyUno and hope their memory structures are more efficient than with OpenPyxl and the TCP/IP call bridge is efficient enough
2) Openpyxl uses xml.etree. Would Python lxml (libxml2 native extension) be more efficient wit XML memory structures and is it possible to replace xml.etree directly with lxml drop-in e.g. with monkey-patching? (later the changes could be contributed back to Openpyxl if there is a clear benefit)
3) Export from MySQL to CSV and then post-process CSV files directly to XSLX using Python and file iteration
4) Use other programming languages and libraries (Java)
Pointers:
http://dev.lethain.com/handling-very-large-csv-and-xml-files-in-python/
http://enginoz.wordpress.com/2010/03/31/writing-xlsx-with-java/

If you're going to use Java, you will want to use Apache POI, but likely not the regular UserModel as you're wanting to keep your memory footprint down.
Instead, take a look at BigGridDemo, which shows you how to write a very large xlsx file using POI, with most of the work not happening in memory.
You might also find that the technique used in the BigGridDemo could equally be used in Python?

Have you tried to look at the optimized writer for openpyxl ? It's a recent feature (2 months old), but it's quite robust (used in production in several corporate projects) and can handle almost indefinite amount of data with steady memory consumption (around 7Mb)
http://packages.python.org/openpyxl/optimized.html#optimized-writer

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.