Pytables EArray vs Table for Speed/Efficiency

Pytables EArray vs Table for Speed/Efficiency - python

I'm trying to figure out what is the most efficient way to store time-value pairs in pytables. I'm using pytables since I'm dealing with huge ammounts of data. I will need to perform calculations on the data (average, interpolate, etc.). I don't know the number of rows ahead of time.
I know that an EArray can be appended to, much like a Table. Is there a reason to chose one over the other?
Given my simple data structure (homogeneous time-value pairs) i figured an EArray would be faster/most efficient, but the following quote from the pytables creator himself threw me off:
"...PyTables is specially tuned for, well, tables.
And these entities wear special I/O buffers and query engines that are
fined tuned for maximum speed. *Array objects do not wear the same
machinery."quote location

If the columns have some particular meaning or name, then you should definitely use a Table.
The efficiency largely depends on what kinds of operations you are doing on the data. Most of the time there won't be much of a difference. EArray might be faster for row-access, Tables are probably slightly better at column access, and they should be very similar for whole Table/EArray access.
Of course, the moment you want to do something more than simply access element and instead want to query or transform the data, you should use a Table. Tables are really built up around this idea of querying, via where() methods, and indexing, which makes such operations very fast. EArrays lack this infrastructure and are therefore slower.

Related

Editing a big database in Excel - any easy to learn language that provides array manipulation except for VBA? Python? Which library?

I am currently trying to develop macros/programs to help me edit a big database in Excel.
Just recently I successfully wrote a custom macro in VBA, which stores two big arrays into memory, in memory it compares both arrays by only one column in each (for example by names), then the common items that reside in both arrays are copied into another temporary arrays TOGETHER with other entries in the same row of the array. So if row(11) name was "Tom", and it is common for both arrays, and next to Tom was his salary of 10,000 and his phone number, the entire row would be copied.
This was not easy, but I got to it somehow.
Now, this works like a charm for arrays as big as 10,000 rows x 5 columns + another array of the same size 10,000 rows x 5 columns. It compares and writes back to a new sheet in a few seconds. Great!
But now I tried a much bigger array with this method, say 200,000 rows x 10 columns + second array to be compared 10,000 rows x 10 columns...and it took a lot of time.
Problem is that Excel is only running at 25% CPU - I checked that online it is normal.
Thus, I am assuming that to get a better performance I would need to use another 'tool', in this case another programming language.
I heard that Python is great, Python is easy etc. but I am no programmer, I just learned a few dozen object names and I know some logic so I got around in VBA.
Is it Python? Or perhaps changing the programming language won't help? It is really important to me that the language is not too complicated - I've seen C++ and it stings my eyes, I literally have no idea what is going on in those codes.
If indeed python, what libraries should I start with? Perhaps learn some easy things first and then go into those arrays etc.?
Thanks!

I have no intention of condescending but anything I say would sound like condescending, so so be it.
The operation you are doing is called join. It's a common operation in any kind of database. Unfortunately, Excel is not a database.
I suspect that you are doing NxM operation in Excel. 200,000 rows x 10,000 rows operation quickly explodes. Pick a key in N, search a row in M, and produce result. When you do this, regardless of computer language, the computation order becomes so large that there is no way to finish the task in reasonable amount of time.
In this case, 200,000 rows x 10,000 rows require about 5,000 lookup per every row on average in 200,000 rows. That's 1,000,000,000 times.
So, how do the real databases do this in reasonable amount of time? Use index. When you look into this 10,000 rows of table, what you are looking for is indexed so searching a row becomes log2(10,000). The total order of computation becomes N * log2(M) which is far more manageable. If you hash the key, the search cost is almost O(1) - meaning it's constant. So, the computation order becomes N.
What you are doing probably is, in real database term, full table scan. It is something to avoid for real database because it is slow.
If you use any real (SQL) database, or programming language that provides a key based search in dataset, your join will become really fast. It's nothing to do with any programming language. It is really a 101 of computer science.
I do not know anything about what Excel can do. If Excel provides some facility to lookup a row based on indexing or hashing, you may be able to speed it up drastically.

Ideally you want to design a database (there are many such as SQLite, PostgreSQL, MySQL etc.) and stick your data into it. SQL is the language of talking to a database (DML data manipulation language) or creating/editing the structure of the database (DDL data definition language).
Why a database? You’ll get data validation and the ability to query data with many relationships (such as One to Many, e.g. one author can have many books but you’ll have an Author table and a Book table and will need to join these).
Pandas works not just with databases but CSV and text files, Microsoft Excel, HDF5 and is great for reading and writing to these in memory structures as well as merging, joining, slicing the data. The quickest way to what you want is likely read the data you have into panda dataframes and then manipulate from there. This makes a database optional though recommended. See Pandas Merging 101 for an idea of what you can do with pandas.
Another python tool you could use is SQLAlchemy which is an ORM object relational mapper (converts say a row in an Author table to an Author class object in python). Whilst it’s important to know SQL and database principles you don’t need to use SQL statements directly when using SQLAlchemy.
Each of these areas are huge like the ocean. You can dip your toes into each but if you wade in too deep you’ll want to know how to swim. I have fist-sized books on each to give (that I’ve not finished) you a rough idea what I mean by this.
A possible roadmap may look like:
Database (optional but recommended):
Learn about relational data
Learn database design
Learn SQL
Pandas (highly recommended):
Learn to read and write data (to excel / database)
Learn to merge, join, concatenate and update a DataFrame

Data structure options for efficiently storing sets of integer pairs on disk?

I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora
Things that kind of suck:
Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.

How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.

Why not just store a table containing stuff that was deleted since the last re-write?
This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.
You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.
This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.
Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.
You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.
Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.

You are trying to reinvent what already exists in new age NoSQL data stores.
There are 2 very good candidates for your requirements.
Redis.
MongoDb
Both support data structures like dictionaries,lists,queues. The operations like append, modify or delete are also available in both , and very fast.
The performance of both of them is driven by amount of data that can reside in the RAM.
Since most of your data is integer based, that should not be a problem.
My personal suggestion is to go with Redis, with a good persistence configuration (i.e. the data should periodically be saved from RAM to disk ).
Here is a brief of redis data structures :
http://redis.io/topics/data-types-intro
The redis database is a lightweight installation, and client is available in Python.

Database or Table Solution for Temporary Numpy Arrays

I am creating a Python desktop application that allows users to select different distributional forms to model agricultural yield data. I have the time series agricultural data - close to a million rows - saved in a SQLite database (although this is not set in stone if someone knows of a better choice). Once the user selects some data, say corn yields from 1990-2010 in Illinois, I want them to select a distributional form from a drop-down. Next, my function fits the distribution to the data and outputs 10,000 points drawn from that fitted distributional form in a Numpy array. I would like this data to be temporary during the execution of the program.
In an attempt to be efficient, I would only like to make this fit and the subsequent drawing of numbers one time for a specified region and distribution. I have been researching temporary files in Python, but I am not sure that is the best approach for saving many different Numpy arrays. PyTables also looks like an interesting approach and seems to be compatible with Numpy, but I am not sure it is good for handling temporary data. No SQL solutions, like MongoDB, seem to be very popular these days as well, which also interests me from a resume building perspective.
Edit: After reading the comment below and researching it, I am going to go with PyTables, but I am trying to find the best way to tackle this. Is it possible to create a table like below, where instead of Float32Col I can use createTimeSeriesTable() from the scikits time series class or do I need to create a datetime column for the date and a boolean column for the mask, in addition to the Float32Col below to hold the data. Or is there a better way to be going about this problem?
class Yield(IsDescription):
geography_id = UInt16Col()
data = Float32Col(shape=(50, 1)) # for 50 years of data
Any help on the matter would be greatly appreciated.

What's your use case for the temporary data? Are you just going to be reading it all in at once (and never wanting to just read in a subset)?
If so, just save the array to a temporary file (e.g. with numpy.save, or equivalently, pickle with a binary protocol). There's no need for fancier solutions in that case.
On a side note, I'd highly recommend PyTables over SQLite for storing your original time series data.
Based on what it sounds like you're doing, you're not going to need the "relational" parts of a relational database (e.g. joins). If you don't need to join or relate tables, you just need fast simple queries, and you want the data in memory as a numpy array, PyTables is an excellent option. PyTables uses HDF to store your data, which can be much more compact on disk than a SQLite database. PyTables is also considerably faster for loading large chunks of data into memory as numpy arrays.

Data Transformation in Postgres Using SqlSoup

I have a large dataset of events in a Postgres database that is too large to analyze in memory. Therefore I would like to quantize the datetimes to a regular interval and perform group by operations within the database prior to returning results. I thought I would use SqlSoup to iterate through the records in the appropriate table and make the necessary transformations. Unfortunately I can't figure out how to perform the iteration in such a way that I'm not loading references to every record into memory at once. Is there some way of getting one record reference at a time in order to access the data and update each record as needed?
Any suggestions would be most appreciated!
Chris

After talking with some folks, it's pretty clear the better answer is to use Pig to process and aggregate my data locally. At the scale, I'm operating it wasn't clear Hadoop was the appropriate tool to be reaching for. One person I talked to about this suggests Pig will be orders of magnitude faster than in-DB operations at the scale I'm operating at which is about 10^7 records.

Store data series in file or database if I want to do row level math operations?

I'm developing an app that handle sets of financial series data (input as csv or open document), one set could be say 10's x 1000's up to double precision numbers (Simplifying, but thats what matters).
I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input. This will be between columns (row level operations) on one set and also between columns on many (potentially all) sets at the row level also. I plan to write it in Python and it will eventually need a intranet facing interface to display the results/graphs etc. for now, csv output based on some input parameters will suffice.
What is the best way to store the data and manipulate? So far I see my choices as being either (1) to write csv files to disk and trawl through them to do the math or (2) I could put them into a database and rely on the database to handle the math. My main concern is speed/performance as the number of datasets grows as there will be inter-dataset row level math that needs to be done.
-Has anyone had experience going down either path and what are the pitfalls/gotchas that I should be aware of?
-What are the reasons why one should be chosen over another?
-Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?
-Is there any project or framework out there to help with this type of task?
-Edit-
More info:
The rows will all read all in order, BUT I may need to do some resampling/interpolation to match the differing input lengths as well as differing timestamps for each row. Since each dataset will always have a differing length that is not fixed, I'll have some scratch table/memory somewhere to hold the interpolated/resampled versions. I'm not sure if it makes more sense to try to store this (and try to upsample/interploate to a common higher length) or just regenerate it each time its needed.

"I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input."
This is the standard use case for a data warehouse star-schema design. Buy Kimball's The Data Warehouse Toolkit. Read (and understand) the star schema before doing anything else.
"What is the best way to store the data and manipulate?"
A Star Schema.
You can implement this as flat files (CSV is fine) or RDBMS. If you use flat files, you write simple loops to do the math. If you use an RDBMS you write simple SQL and simple loops.
"My main concern is speed/performance as the number of datasets grows"
Nothing is as fast as a flat file. Period. RDBMS is slower.
The RDBMS value proposition stems from SQL being a relatively simple way to specify SELECT SUM(), COUNT() FROM fact JOIN dimension WHERE filter GROUP BY dimension attribute. Python isn't as terse as SQL, but it's just as fast and just as flexible. Python competes against SQL.
"pitfalls/gotchas that I should be aware of?"
DB design. If you don't get the star schema and how to separate facts from dimensions, all approaches are doomed. Once you separate facts from dimensions, all approaches are approximately equal.
"What are the reasons why one should be chosen over another?"
RDBMS slow and flexible. Flat files fast and (sometimes) less flexible. Python levels the playing field.
"Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?"
Star Schema: central fact table surrounded by dimension tables. Nothing beats it.
"Is there any project or framework out there to help with this type of task?"
Not really.

For speed optimization, I would suggest two other avenues of investigation beyond changing your underlying storage mechanism:
1) Use an intermediate data structure.
If maximizing speed is more important than minimizing memory usage, you may get good results out of using a different data structure as the basis of your calculations, rather than focusing on the underlying storage mechanism. This is a strategy that, in practice, has reduced runtime in projects I've worked on dramatically, regardless of whether the data was stored in a database or text (in my case, XML).
While sums and averages will require runtime in only O(n), more complex calculations could easily push that into O(n^2) without applying this strategy. O(n^2) would be a performance hit that would likely have far more of a perceived speed impact than whether you're reading from CSV or a database. An example case would be if your data rows reference other data rows, and there's a need to aggregate data based on those references.
So if you find yourself doing calculations more complex than a sum or an average, you might explore data structures that can be created in O(n) and would keep your calculation operations in O(n) or better. As Martin pointed out, it sounds like your whole data sets can be held in memory comfortably, so this may yield some big wins. What kind of data structure you'd create would be dependent on the nature of the calculation you're doing.
2) Pre-cache.
Depending on how the data is to be used, you could store the calculated values ahead of time. As soon as the data is produced/loaded, perform your sums, averages, etc., and store those aggregations alongside your original data, or hold them in memory as long as your program runs. If this strategy is applicable to your project (i.e. if the users aren't coming up with unforeseen calculation requests on the fly), reading the data shouldn't be prohibitively long-running, whether the data comes from text or a database.

What matters most if all data will fit simultaneously into memory. From the size that you give, it seems that this is easily the case (a few megabytes at worst).
If so, I would discourage using a relational database, and do all operations directly in Python. Depending on what other processing you need, I would probably rather use binary pickles, than CSV.

Are you likely to need all rows in order or will you want only specific known rows?
If you need to read all the data there isn't much advantage to having it in a database.
edit: If the code fits in memory then a simple CSV is fine. Plain text data formats are always easier to deal with than opaque ones if you can use them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.