Best structure for on-disk retrieval of large data using Python?

Best structure for on-disk retrieval of large data using Python? - python

I basically have a large (multi-terabyte) dataset of text (it's in JSON but I could change it to dict or dataframe). It has multiple keys, such as "group" and "user".
Right now I'm filtering the data by reading through the entire text for these keys. It would be far more efficient to have a structure where I filter and read only the key.
Doing the above would be trivial if it fit in memory, and I could use standard dict/pandas methods and hash tables. But it doesn't fit in memory.
There must be an off the shelf system for this. Can anyone recommend one?
There are discussions about this, but some of the better ones are old. I'm looking for the simplest off the shelf solution.

I suggest you to split your large file to multiple small files with method readlines(CHUNK) and then you can process it one by one.
I worked with large Json and at beginning, the process was 45sec by file and my program ran while 2 days but when I splintered it, the program finished only for 4h

Related

HDF5 with Python, Pandas: Data Corruption and Read Errors

So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?

Loading Large Files as Dictionary

My first question on stackoverflow :)
I am trying to load a pertained vectors from Glove vectors and create a dictionary with words as keys and corresponding vectors as values. I did the usual naive method to this:
fp=open(wordEmbdfile)
self.wordVectors={}
# Create wordVector dictionary
for aline in fp:
w=aline.rstrip().split()
self.wordVectors[w[0]]=w[1:]
fp.close()
I see a huge memory pressure on Activity Monitor, and eventually after trying for an hour or two it crashes.
I am going to try splitting in multiple smaller files and create multiple dictionaries.
In the meantime I have following questions:
To read the word2vec file, is it better if I read the gzipped file using gzip.open or uncompress it and then read it with plain open.
The word vector file has text in first column and float in the rest, would it be more optimal to use genfromtext or loadtext from numpy?
I intend save this dictionary using chicle, I know loading it is going to be hard too. I read the suggestion to use shelve, how does it compare to cPickle in loading time and access time. May be its better to spend some more time loading with cPickle if improve future accesses (if cPickle does not crash, with 8G RAM), Does anyone have some suggestion on this?
Thanks!

Sorting .csv file by column title

Is there a way to sort a csv file by column header name (sort vertically) without loading the whole thing into memory? I tagged this as python because it is the language I am most familiar with, but any other way would be fine also. I am limited to doing this via commandline on a remote machine due to data protection rules.

Any on-disk sorting algorithm is going to require more disk operations than just reading and writing once, and that I/O is likely to be your bottleneck. And it's going to more complicated as well. So, unless you really can't fit the file into memory, it will be a lot faster to do so, and a whole lot simpler.
But if you have to do this…
The standard on-disk sorting algorithm is a merge sort, similar to the familiar in-memory merge sort. It works like this:
Split the file into chunks that are big enough to fit into memory. You can do this iteratively/lazily, and easily: just read, say, 100MB at a time. Just make sure to rfind the last newline and hold everything after it over for the next chunk.
For each chunk, sort it in memory, and write the result to a temporary file. You can use the csv module, and the sort function with key=itemgetter(colnum).
If you have, say, 10 or fewer chunks, just open all of the temporary files and merge them. Again, you can use the csv module, and min with the same key or heapq.merge with equivalent decorate-sort-undecorate.
If you have 10-100 chunks, merge groups of 10 into larger temp files, then merge the larger ones in exactly the same way. With 100-1000, or 1000-10000, etc., just keep doing the same thing recursively.
If you have a simple CSV file with no quoting/escaping, and you have either ASCII data, ASCII-superset data that you want to sort asciibetically, or ASCII-superset data that you want to sort according to LC_COLLATE, the POSIX sort command does exactly what you're looking for, in the same way you'd probably build it yourself. Something like this:
sort -t, -k ${colnum},${colnum} -i infile.csv -o outfile.csv
If your data don't meet those requirements, you might be able to do a "decorate-sort-undecorate" three-pass solution. But at that point, it might be easier to switch to Python. Trying to figure out how to sed an arbitrary Excel CSV into something sort can handle and that can be reversed sounds like you'd waste more time debugging edge cases than you would writing the Python.

Data structure options for efficiently storing sets of integer pairs on disk?

I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora
Things that kind of suck:
Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.

How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.

Why not just store a table containing stuff that was deleted since the last re-write?
This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.
You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.
This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.
Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.
You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.
Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.

You are trying to reinvent what already exists in new age NoSQL data stores.
There are 2 very good candidates for your requirements.
Redis.
MongoDb
Both support data structures like dictionaries,lists,queues. The operations like append, modify or delete are also available in both , and very fast.
The performance of both of them is driven by amount of data that can reside in the RAM.
Since most of your data is integer based, that should not be a problem.
My personal suggestion is to go with Redis, with a good persistence configuration (i.e. the data should periodically be saved from RAM to disk ).
Here is a brief of redis data structures :
http://redis.io/topics/data-types-intro
The redis database is a lightweight installation, and client is available in Python.

Python synchronised reading of sorted files

I have two groups of files that contain data in CSV format with a common key (Timestamp) - I need to walk through all the records chronologically.
Group A: 'Environmental Data'
Filenames are in format A_0001.csv, A_0002.csv, etc.
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains environmental data in CSV/column format
Very large, several GBs worth of data
Group B: 'Event Data'
Filenames are in format B_0001.csv, B_0002.csv
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains event based data in CSV/column format
Relatively small compared to Group A files, < 100 MB
What is best approach?
Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
Real-time merge: Implement code to 'merge' the files in real-time
I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.

im thinking importing it into a db (mysql, sqlite, etc) will give better performance than merging it in script. the db typically has optimized routines for loading csv and the join will be probably be as fast or much faster than merging 2 dicts (one being very large) in python.

"YYYY-MM-DD HH:MM:SS" can be sorted with a simple ascii compare.
How about reusing external merge logic? If the first field is the key then:
for entry in os.popen("sort -m -t, -k1,1 file1 file2"):
process(entry)

This is a similar to a relational join. Since your timestamps don't have to match, it's called a non-equijoin.
Sort-Merge is one of several popular algorithms. For non-equijoins, it works well. I think this would be what you're called "pre-merge". I don't know what you mean by "merge in real time", but I suspect it's still a simple sort-merge, which is a fine technique, heavily used by real databases.
Nested Loops can also work. In this case, you read the smaller table in the outer loop. In the inner loop you find all of the "matching" rows from the larger table. This is effectively a sort-merge, but with an assumption that there will be multiple rows from the big table that will match the small table.
This, BTW, will allow you to more properly assign meaning to the relationship between Event Data and Environmental Data. Rather than reading the result of a massive sort merge and trying to determine which kind of record you've got, the nested loops handle that well.
Also, you can do "lookups" into the smaller table while reading the larger table.
This is hard when you're doing non-equal comparisons because you don't have a proper key to do a simple retrieval from a simple dict. However, you can easily extend dict (override __contains__ and __getitem__) to do range comparisons on a key instead of simple equality tests.

I would suggest pre-merge.
Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and read statements you will need.
This will improve the runtime of your algorithm, and I think that's a good enough reason in this scenario to decide to use this approach
Hope this helps

You could read from the files in chunks of, say, 10000 records (or whatever number further profiling tells you to be optimal) and merge on the fly. Possibly using a custom class to encapsulate the IO; the actual records could then be accessed through the generator protocol (__iter__ + next).
This would be memory friendly, probably very good in terms of total time to complete the operation and would enable you to produce output incrementally.
A sketch:
class Foo(object):
def __init__(self, env_filenames=[], event_filenames=[]):
# open the files etc.
def next(self):
if self._cache = []:
# take care of reading more records
else:
# return the first record and pop it from the cache
# ... other stuff you need ...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.