I have two groups of files that contain data in CSV format with a common key (Timestamp) - I need to walk through all the records chronologically.
Group A: 'Environmental Data'
Filenames are in format A_0001.csv, A_0002.csv, etc.
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains environmental data in CSV/column format
Very large, several GBs worth of data
Group B: 'Event Data'
Filenames are in format B_0001.csv, B_0002.csv
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains event based data in CSV/column format
Relatively small compared to Group A files, < 100 MB
What is best approach?
Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
Real-time merge: Implement code to 'merge' the files in real-time
I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.
im thinking importing it into a db (mysql, sqlite, etc) will give better performance than merging it in script. the db typically has optimized routines for loading csv and the join will be probably be as fast or much faster than merging 2 dicts (one being very large) in python.
"YYYY-MM-DD HH:MM:SS" can be sorted with a simple ascii compare.
How about reusing external merge logic? If the first field is the key then:
for entry in os.popen("sort -m -t, -k1,1 file1 file2"):
process(entry)
This is a similar to a relational join. Since your timestamps don't have to match, it's called a non-equijoin.
Sort-Merge is one of several popular algorithms. For non-equijoins, it works well. I think this would be what you're called "pre-merge". I don't know what you mean by "merge in real time", but I suspect it's still a simple sort-merge, which is a fine technique, heavily used by real databases.
Nested Loops can also work. In this case, you read the smaller table in the outer loop. In the inner loop you find all of the "matching" rows from the larger table. This is effectively a sort-merge, but with an assumption that there will be multiple rows from the big table that will match the small table.
This, BTW, will allow you to more properly assign meaning to the relationship between Event Data and Environmental Data. Rather than reading the result of a massive sort merge and trying to determine which kind of record you've got, the nested loops handle that well.
Also, you can do "lookups" into the smaller table while reading the larger table.
This is hard when you're doing non-equal comparisons because you don't have a proper key to do a simple retrieval from a simple dict. However, you can easily extend dict (override __contains__ and __getitem__) to do range comparisons on a key instead of simple equality tests.
I would suggest pre-merge.
Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and read statements you will need.
This will improve the runtime of your algorithm, and I think that's a good enough reason in this scenario to decide to use this approach
Hope this helps
You could read from the files in chunks of, say, 10000 records (or whatever number further profiling tells you to be optimal) and merge on the fly. Possibly using a custom class to encapsulate the IO; the actual records could then be accessed through the generator protocol (__iter__ + next).
This would be memory friendly, probably very good in terms of total time to complete the operation and would enable you to produce output incrementally.
A sketch:
class Foo(object):
def __init__(self, env_filenames=[], event_filenames=[]):
# open the files etc.
def next(self):
if self._cache = []:
# take care of reading more records
else:
# return the first record and pop it from the cache
# ... other stuff you need ...
Related
I basically have a large (multi-terabyte) dataset of text (it's in JSON but I could change it to dict or dataframe). It has multiple keys, such as "group" and "user".
Right now I'm filtering the data by reading through the entire text for these keys. It would be far more efficient to have a structure where I filter and read only the key.
Doing the above would be trivial if it fit in memory, and I could use standard dict/pandas methods and hash tables. But it doesn't fit in memory.
There must be an off the shelf system for this. Can anyone recommend one?
There are discussions about this, but some of the better ones are old. I'm looking for the simplest off the shelf solution.
I suggest you to split your large file to multiple small files with method readlines(CHUNK) and then you can process it one by one.
I worked with large Json and at beginning, the process was 45sec by file and my program ran while 2 days but when I splintered it, the program finished only for 4h
So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?
I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora
Things that kind of suck:
Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.
How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.
Why not just store a table containing stuff that was deleted since the last re-write?
This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.
You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.
This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.
Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.
You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.
Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.
You are trying to reinvent what already exists in new age NoSQL data stores.
There are 2 very good candidates for your requirements.
Redis.
MongoDb
Both support data structures like dictionaries,lists,queues. The operations like append, modify or delete are also available in both , and very fast.
The performance of both of them is driven by amount of data that can reside in the RAM.
Since most of your data is integer based, that should not be a problem.
My personal suggestion is to go with Redis, with a good persistence configuration (i.e. the data should periodically be saved from RAM to disk ).
Here is a brief of redis data structures :
http://redis.io/topics/data-types-intro
The redis database is a lightweight installation, and client is available in Python.
Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!
A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).
Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count
If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.
I'm developing an app that handle sets of financial series data (input as csv or open document), one set could be say 10's x 1000's up to double precision numbers (Simplifying, but thats what matters).
I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input. This will be between columns (row level operations) on one set and also between columns on many (potentially all) sets at the row level also. I plan to write it in Python and it will eventually need a intranet facing interface to display the results/graphs etc. for now, csv output based on some input parameters will suffice.
What is the best way to store the data and manipulate? So far I see my choices as being either (1) to write csv files to disk and trawl through them to do the math or (2) I could put them into a database and rely on the database to handle the math. My main concern is speed/performance as the number of datasets grows as there will be inter-dataset row level math that needs to be done.
-Has anyone had experience going down either path and what are the pitfalls/gotchas that I should be aware of?
-What are the reasons why one should be chosen over another?
-Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?
-Is there any project or framework out there to help with this type of task?
-Edit-
More info:
The rows will all read all in order, BUT I may need to do some resampling/interpolation to match the differing input lengths as well as differing timestamps for each row. Since each dataset will always have a differing length that is not fixed, I'll have some scratch table/memory somewhere to hold the interpolated/resampled versions. I'm not sure if it makes more sense to try to store this (and try to upsample/interploate to a common higher length) or just regenerate it each time its needed.
"I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input."
This is the standard use case for a data warehouse star-schema design. Buy Kimball's The Data Warehouse Toolkit. Read (and understand) the star schema before doing anything else.
"What is the best way to store the data and manipulate?"
A Star Schema.
You can implement this as flat files (CSV is fine) or RDBMS. If you use flat files, you write simple loops to do the math. If you use an RDBMS you write simple SQL and simple loops.
"My main concern is speed/performance as the number of datasets grows"
Nothing is as fast as a flat file. Period. RDBMS is slower.
The RDBMS value proposition stems from SQL being a relatively simple way to specify SELECT SUM(), COUNT() FROM fact JOIN dimension WHERE filter GROUP BY dimension attribute. Python isn't as terse as SQL, but it's just as fast and just as flexible. Python competes against SQL.
"pitfalls/gotchas that I should be aware of?"
DB design. If you don't get the star schema and how to separate facts from dimensions, all approaches are doomed. Once you separate facts from dimensions, all approaches are approximately equal.
"What are the reasons why one should be chosen over another?"
RDBMS slow and flexible. Flat files fast and (sometimes) less flexible. Python levels the playing field.
"Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?"
Star Schema: central fact table surrounded by dimension tables. Nothing beats it.
"Is there any project or framework out there to help with this type of task?"
Not really.
For speed optimization, I would suggest two other avenues of investigation beyond changing your underlying storage mechanism:
1) Use an intermediate data structure.
If maximizing speed is more important than minimizing memory usage, you may get good results out of using a different data structure as the basis of your calculations, rather than focusing on the underlying storage mechanism. This is a strategy that, in practice, has reduced runtime in projects I've worked on dramatically, regardless of whether the data was stored in a database or text (in my case, XML).
While sums and averages will require runtime in only O(n), more complex calculations could easily push that into O(n^2) without applying this strategy. O(n^2) would be a performance hit that would likely have far more of a perceived speed impact than whether you're reading from CSV or a database. An example case would be if your data rows reference other data rows, and there's a need to aggregate data based on those references.
So if you find yourself doing calculations more complex than a sum or an average, you might explore data structures that can be created in O(n) and would keep your calculation operations in O(n) or better. As Martin pointed out, it sounds like your whole data sets can be held in memory comfortably, so this may yield some big wins. What kind of data structure you'd create would be dependent on the nature of the calculation you're doing.
2) Pre-cache.
Depending on how the data is to be used, you could store the calculated values ahead of time. As soon as the data is produced/loaded, perform your sums, averages, etc., and store those aggregations alongside your original data, or hold them in memory as long as your program runs. If this strategy is applicable to your project (i.e. if the users aren't coming up with unforeseen calculation requests on the fly), reading the data shouldn't be prohibitively long-running, whether the data comes from text or a database.
What matters most if all data will fit simultaneously into memory. From the size that you give, it seems that this is easily the case (a few megabytes at worst).
If so, I would discourage using a relational database, and do all operations directly in Python. Depending on what other processing you need, I would probably rather use binary pickles, than CSV.
Are you likely to need all rows in order or will you want only specific known rows?
If you need to read all the data there isn't much advantage to having it in a database.
edit: If the code fits in memory then a simple CSV is fine. Plain text data formats are always easier to deal with than opaque ones if you can use them.