Cassandra as a Time Series store - python

I am considering the use of Cassandra as a time-series store. I have millions of series and each series have around 10K of sequential points with uniform intervals. Some series though have a few thousands points or less. They may start and end at different points but all share the same times. I access the data series
Vertically: predefined partitions (e.g. all days in a year) and I need all the rows.
Horizontally: All values of a specific series (random)
I am considering two options. First I could just have a column per time as it is recommended for monitoring systems for example (I have a different access pattern though). Second, using list columns one per partition.
I am worried about read performance (second use case is more critical) and storage overhead. I did finf the following formula:
total_column_size = column_name_size + column_value_size + 15 here
I think that would make the first option quite expensive in terms of storage. I could not find any documentation for a list storage layout. Do you know of any? Have other recommendations?
BTW, I am using python as a client for cassandra if that makes any difference.

"Storage is cheap" is generally the philosophy here. If you have 2 query patterns, which you seem to, then store everything twice: once partitioned by your desired verticals (days by the looks), and once again by your chosen series. If you don't know how to partition your series in advance (it wasn't clear from the question) then it becomes more complicated. Cassandra reads are sequential when reading in order - and this is the only way you should be using it anyway.
You have in the region of X0bn points which is larger than your average DB but is not bordering on ridiculous, particularly when distributed over a cluster. It's hard to put an exact figure given that I don't know the width of your data points, but if these are just scalar values then this is only going to be 2TB or so of data.

Related

Most efficient way to store pandas Dataframe on disk for frequent access?

I am working on an application which generates a couple of hundred datasets every ten minutes. These datasets consist of a timestamp, and some corresponding values from an ongoing measurement.
(Almost) Naturally, I use pandas dataframes to manage the data in memory.
Now I need to do some work with history data (eg. averaging or summation over days/weeks/months etc. but not limited to that), and I need to update those accumulated values rather frequently (ideally also every ten minutes), so I am wondering which would be the most access-efficient way to store the data on disk?
So far I have been storing the data for every ten minute interval in a separate csv-file and then read the relevant files into a new dataframe as needed. But I feel that there must be a more efficient way, especially when it comes to working with a larger amount of datasets. Although computation cost and memory are not the central issue, as I am running the code on a comparatively powerful machine, but I still don't want to (and most likely, can't afford to) read all the data into memory every time.
It seems to me that the answer should lie within the built-in serialization functions of pandas, but from the docs and my google findings I honestly can't really tell which would fit my needs best.
Any ideas how I could manage my data better?

How to optimally access the latest datapoint in MongoDB?

I am storing multiple time-series in a MongoDB with sub-second granularity. The DB is updated by a bunch of Python scripts, and the data stored serve two main purposes:
(1) It's a central information source for the latest data from all series. Multiple scripts access it every second or so to read the latest datapoint in each collection.
(2) It's a long-term data store. I often load the whole DB into Python to analyse trends in the data.
To keep the DB as efficient as possible, I want to bucket my data (ideally holding one document per day in each collection). Because of (1), however, the bigger the buckets, the more expensive the sorting required to access the last datapoint.
I can think of two solutions here, but I'm not sure what alternatives there are, or which is the best way:
a) Store the latest timestamp in a one-line document in a separate db/collection. No sorting required on read, but an additional write required every time a any series gets a new datapoint.
b) Keep the buckets smaller (say 1-hour each) and sort.
With a) you write smallish documents to a separate collection, which is performance wise preferable to updating large documents. You could write all new datapoints in this collection and aggregate them for the hour or day, depending on your preference. But as you said this requires an additional write operation.
With b) you need to keep the index size for the sort field in mind. Does the index size fit in memory? That's crucial for the performance of the sort, as you do not want to do any in memory sorting of a large collection.
I recommend exploring the hybrid approach, of storing individual datapoints for a limited time in an 'incoming' collection. Once your bucketing interval of hour or day approaches, you can aggregate the datapoints into buckets and store them in a different collection. Of course there is now some additional complexity in the application, that needs to be able to read bucketed and datapoint collections and merge them.

Python+Postgresql: Ideal way to call data for computation (rolling / expanding window) + multithreading?

I have a huge database (~100 variables with a few million rows) consisting of stock data. I managed to connect python with the database via sqlalchemy (postgreql+psycopg2). I am running it all on the cloud.
In principle I want to do a few things:
1) Regression of all possible combinations: I am running a simple regression of each stock, i.e. ABC on XYZ AND also XYZ on ABC, this across the n=100 stocks, resulting in n(n+1) / 2 combinations.
-> I think of a function that calls in the pairs of stocks, does the two regressions and compares the results and picks one based on some criteria.
My question: Is there an efficient way to call in the "factorial"?
2) Rolling Windows: To avoid an overload of data, I thought to only call the dataframe of investigation, i.e. 30days, and then roll over each day, meaning my periods are:
1: 1D-30D
2: 2D-31D and so on
Meaning I always drop the first day and add another row at the end of my dataframe. So meaning I have two steps, drop the first day and read in the next row from my database.
My question: Is this a meaningful way or does Python has something better in its sleeve? How would you do it?
3) Expanding windows: Instead of dropping the first row and add another one, I keep the 30 days and add another 30days and then run my regression. Problem here, at some point I would embrace all the data which will probably be too big for the memory?
My question: What would be a workaround here?
4) As I am running my analysis on the cloud (with a few more cores than my own pc) in fact I could use multithreading, sending "batch" jobs and let Python do things in parallel. I thought of splitting my dataset in 4x 25 stocks and let it run in parallel (so vertical split), or should I better split horizontally?
Additionally I am using Jupyter; I am wondering how to best approach here, usually I have a shell script calling my_program.py. Is this the same here?
Let me try to give answers categorically and also note my observations.
From your description, I suppose you have taken each stock scrip as one variable and you are trying to perform pairwaise linear regression amongst them. Good news about this - it's highly parallizable. All you need to do is generate unique combinations of all possible pairings and perform your regressions and then only to keep those models which fit your criteria.
Now as stocks are your variables, I am assuming rows are their prices or something similar values but definitely some time series data. If my assumption is correct then there is a problem in rolling window approach. In creating these rolling windows what you are implicitly doing is using a data sampling method called 'bootstrapping' which uses random but repeatitive sampling. But due to just rolling your data you are not using random sampling which might create problems for your regression results. At best the model may simply be overtrained, at worst, I cannot imagine. Hence, drop this appraoch. Plus if it's a time series data then the entire concept of windowing would be questionable anyway.
Expanding windows are no good for the same reasons stated above.
About memory and processibility - I think this is an excellent scenario where one can use Spark. It is exactly built for this purpose and has excellent support for python. Millions of data points are no big deal for Spark. Plus, you would be able to massively parallelize your operations. Being on cloud infrastructure also gives you advantage about configurability and expandability without headache. I don't know why people like to use Jupyter even for batch tasks like these but if you are hell-bent on using it, then PySpark kernel is also supported by Jupyter. Vertical split would be right approach here probably.
Hope these answer your questions.

Pandas and the best method for representing variable-length time-series

Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.
I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.

Sorting using Map-Reduce - Possible approach

I have a large dataset with 500 million rows and 58 variables. I need to sort the dataset using one of the 59th variable which is calculated using the other 58 variables. The variable happens to be a floating point number with four places after decimal.
There are two possible approaches:
The normal merge sort
While calculating the 59th variables, i start sending variables in particular ranges to to particular nodes. Sort the ranges in those nodes and then combine them in the reducer once i have perfectly sorted data and now I also know where to merge what set of data; It basically becomes appending.
Which is a better approach and why?
I'll assume that you are looking for a total sort order without a secondary sort for all your rows. I should also mention that 'better' is never a good question since there is typically a trade-off between time and space and in Hadoop we tend to think in terms of space rather than time unless you use products that are optimized for time (TeraData has the capability of putting Databases in memory for Hadoop use)
Out of the two possible approaches you mention, I think only one would work within the Hadoop infrastructure. Num 2, Since Hadoop leverages many nodes to do one job, sorting becomes a little trickier to implement and we typically want the 'shuffle and sort' phase of MR to take care of the sorting since distributed sorting is at the heart of the programming model.
At the point when the 59th variable is generated, you would want to sample the distribution of that variable so that you can send it through the framework then merge like you mentioned. Consider the case when the variable distribution of x contain 80% of your values. What this might do is send 80% of your data to one reducer who would do most of the work. This assumes of course that some keys will be grouped in the sort and shuffle phase which would be the case unless you programmed them unique. It's up to the programmer to set up partitioners to evenly distribute the load by sampling the key distribution.
If on the other hand we were to sort in memory then we could accomplish the same thing during reduce but there are inherent scalability issues since the sort is only as good as the amount of memory available in the node currently running the sort and dies off quickly when it starts to use HDFS to look for the rest of the data that did not fit into memory. And if you ignored the sampling issue you will likely run out of memory unless all your key values pairs are evenly distributed and you understand the memory capacity within your data.
Check out the Hadoop Comparator Class Part of HadoopStreaming Wiki Page
You can move the datasets to HDFS, use Python to write a mapper and do a hadoop streaming mapper only job. The Hadoop Streaming will automatically help you sort them.
Then you can use hdfs dfs -getmerge and -copyToLocal to move the sorted records back to local if you want.

Categories

Resources