Save periodically gathered data with python - python

I periodically receive data (every 15 minutes) and have them in an array (numpy array to be precise) in python, that is roughly 50 columns, the number of rows varies, usually is somewhere around 100-200.
Before, I only analyzed this data and tossed it, but now I'd like to start saving it, so that I can create statistics later.
I have considered saving it in a csv file, but it did not seem right to me to save high amounts of such big 2D arrays to a csv file.
I've looked at serialization options, particularly pickle and numpy's .tobytes(), but in both cases I run into an issue - I have to track the amount of arrays stored. I've seen people write the number as the first thing in the file, but I don't know how I would be able to keep incrementing the number while having the file still opened (the program that gathers the data runs practically non-stop). Constantly opening the file, reading the number, rewriting it, seeking to the end to write new data and closing the file again doesn't seem very efficient.
I feel like I'm missing some vital information and have not been able to find it. I'd love it if someone could show me something I can not see and help me solve the problem.

Saving on a csv file might not be a good idea in this case, think about the accessibility and availability of your data. Using a database will be better, you can easily update your data and control the size amount of data you store.

Related

Most efficient way to store pandas Dataframe on disk for frequent access?

I am working on an application which generates a couple of hundred datasets every ten minutes. These datasets consist of a timestamp, and some corresponding values from an ongoing measurement.
(Almost) Naturally, I use pandas dataframes to manage the data in memory.
Now I need to do some work with history data (eg. averaging or summation over days/weeks/months etc. but not limited to that), and I need to update those accumulated values rather frequently (ideally also every ten minutes), so I am wondering which would be the most access-efficient way to store the data on disk?
So far I have been storing the data for every ten minute interval in a separate csv-file and then read the relevant files into a new dataframe as needed. But I feel that there must be a more efficient way, especially when it comes to working with a larger amount of datasets. Although computation cost and memory are not the central issue, as I am running the code on a comparatively powerful machine, but I still don't want to (and most likely, can't afford to) read all the data into memory every time.
It seems to me that the answer should lie within the built-in serialization functions of pandas, but from the docs and my google findings I honestly can't really tell which would fit my needs best.
Any ideas how I could manage my data better?

Out-of-core custom binary file processing

The problem at hand is a batch of large (500GB-1TB) binary files representing continuous time-series of data, that each have header and footer. I've seen plenty of solutions, like Dask, or Python core mmap, or NumPy memmap, or the like, but none of them seem to deal with the footer inclusion problem. Not that I have noticed at least, maybe there is some trivial thing like a "file-size mask" that allows me to "ghost" last N bytes of each file.
My restrictions in file interaction are limited to read-only access, and I have them all available locally.
What I currently have written is an io.open(filename, 'rb', buffering=0) that iterates over a list of files, and manually moves the pointer after each read, and keeps track of pointer's location relatively to the file. When it reaches the point where the next read would get into footer, e.g. into the next file, there is a really ugly bit that splits the chunk into two smaller, often asymmetric, chunks, where the first is read, then file switch occurs, and then the next one is read.
I feel like I am reinventing the bicycle with that, and I would greatly appreciate any brainstorm-level suggestions. I'll gladly work out technical details on my own, once I know the direction I should head.

modify and write large file in python

Say I have a data file of size 5GB in the disk, and I want to append another set of data of size 100MB at the end of the file -- Just simply append, I don't want to modify nor move the original data in the file. I know I can read the hole file into memory as a long long list and append my small new data to it, but it's too slow. How I can do this more efficiently?
I mean, without reading the hole file into memory?
I have a script that generates a large stream of data, say 5GB, as a long long list, and I need to save these data into a file. I tried to generate the list first and then output them all in once, but as the list increased, the computer got slow down very very severely. So I decided to output them by several times: each time I have a list of 100MB, then output them and clear the list. (this is why I have the first problem)
I have no idea how to do this.Is there any lib or function that can do this?
Let's start from the second point: if the list you store in memory is larger than the available ram, the computer starts using the hd as ram and this severely slow down everything. The optimal way of outputting in your situation is fill the ram as much as possible (always keeping enough space for the rest of the software running on your pc) and then writing on a file all in once.
The fastest way to store a list in a file would be using pickle so that you store binary data that take much less space than formatted ones (so even the read/write process is much much faster).
When you write to a file, you should keep the file always open, using something like with open('namefile', 'w') as f. In this way, you save the time to open/close the file and the cursor will always be at the end. If you decide to do that, use f.flush() once you have written the file to avoid loosing data if something bad happen. The append method is good alternative anyway.
If you provide some code it would be easier to help you...

HDF5 with Python, Pandas: Data Corruption and Read Errors

So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?

Fast saving and retrieving of python data structures for an autocorrect program?

So, I have written an autocomplete and autocorrect program in Python 2. I have written the autocorrect program using the approach mentioned is Peter Norvig's blog on how to write a spell checker, link.
Now, I am using a trie data structure implemented using nested lists. I am using a trie as it can give me all words starting with a particular prefix.At the leaf would be a tuple with the word and a value denoting the frequency of the word.For e.g.- the words bad,bat,cat would be saved as-
['b'['a'['d',('bad',4),'t',('bat',3)]],'c'['a'['t',('cat',4)]]]
Where 4,3,4 are the number times the words have been used or the frequency value. Similarly I have made a trie of about 130,000 words of the english dictionary and stored it using cPickle.
Now, it takes about 3-4 seconds for the entire trie to be read each time.The problem is each time a word is encountered the frequency value has to be incremented and then the updated trie needs to be saved again. As you can imagine it would be a big problem waiting each time for 3-4 seconds to read and then again that much time to save the updated trie each time. I will need to perform a lot of update operations each time the program is run and save them.
Is there a faster or efficient way to store a large data structure which repeatedly will be updated? How are the data structures of the autocorrect programs in IDEs and mobile devices saved & retrieved so fast? I am open to different approaches as well.
A few things come to mind.
1) Split the data. Say use 26 files each storing the tries starting with a certain character. You can improve it so that you use a prefix. This way the amount of data you need to write is less.
2) Don't reflect everything to disk. If you need to perform a lot of operations do them in ram(memory) and write them down at then end. If you're afraid of data loss, you can checkpoint your computation after some time X or after a number of operations.
3) Multi-threading. Unless you program only does spellchecking, it's likely there are other things it needs to do. Have a separate thread that does loading writing so that it doesn't block everything while it does disk IO. Multi-threading in python is a bit tricky but it can be done.
4) Custom structure. Part of the time spent in serialization is invoking serialization functions. Since you have a dictionary for everything that's a lot of function calls. In the perfect case you should have a memory representation that matches exactly the disk representation. You would then simply read a large string and put it into your custom class (and write that string to disk when you need to). This is a bit more advanced and likely the benefits will not be that huge especially since python is not so efficient in playing with bits, but if you need to squeeze the last bit of speed out of it, this is the way to go.
I would suggest you to move serialization to a separate thread and run it periodically. You don't need to re-read your data each time because you already have the latest version in memory. This way your program would be responsive to the user while the data is being saved to the disk. The saved version on disk may be lagging and the latest updates may get lost in case of program crash but this shouldn't be a big issue for your use case, I think.
It depends on a particular use case and environment but, I think, most programs having local data sets sync them using multi-threading.

Categories

Resources