I haven't been able to find any direct answers, so I thought I'd ask here.
Can ETL, say for example AWS Glue, be used to perform aggregations to lower the resolution of data to AVG, MIN, MAX, etc over arbitrary time ranges?
e.g. - Given 2000+ data points of outside temperature in the past month, use an ETL job to lower that resolution to 30 data points of daily averages over the past month. (actual use case of such data aside, just an example).
The idea is to perform aggregations to lower the resolution of data to make charts, graphs, etc display long time ranges of large data sets more quickly, as we don't need every individual data point that we must then dynamically aggregate on the fly for these charts and graphs.
My research so far only suggests that ETL be used for 1 to 1 transformations of data, not 1000 to 1. It seems ETL is used more for transforming data to appropriate structure to store in a db, and not for aggregating over large data sets.
Could I use ETL to solve my aggregation needs? This will be on a very large scale, implemented with AWS and Python.
The 'T' in ETL stands for 'Transform', and aggregation is one of most common ones performed. Briefly speaking: yes, ETL can do this for you. The rest depends on specific needs. Do you need any drill-down? Increasing resolution on zoom perhaps? This would affect the whole design, but in general preparing your data for presentation layer is exactly what ETL is used for.
Related
I am working on an application which generates a couple of hundred datasets every ten minutes. These datasets consist of a timestamp, and some corresponding values from an ongoing measurement.
(Almost) Naturally, I use pandas dataframes to manage the data in memory.
Now I need to do some work with history data (eg. averaging or summation over days/weeks/months etc. but not limited to that), and I need to update those accumulated values rather frequently (ideally also every ten minutes), so I am wondering which would be the most access-efficient way to store the data on disk?
So far I have been storing the data for every ten minute interval in a separate csv-file and then read the relevant files into a new dataframe as needed. But I feel that there must be a more efficient way, especially when it comes to working with a larger amount of datasets. Although computation cost and memory are not the central issue, as I am running the code on a comparatively powerful machine, but I still don't want to (and most likely, can't afford to) read all the data into memory every time.
It seems to me that the answer should lie within the built-in serialization functions of pandas, but from the docs and my google findings I honestly can't really tell which would fit my needs best.
Any ideas how I could manage my data better?
I am storing multiple time-series in a MongoDB with sub-second granularity. The DB is updated by a bunch of Python scripts, and the data stored serve two main purposes:
(1) It's a central information source for the latest data from all series. Multiple scripts access it every second or so to read the latest datapoint in each collection.
(2) It's a long-term data store. I often load the whole DB into Python to analyse trends in the data.
To keep the DB as efficient as possible, I want to bucket my data (ideally holding one document per day in each collection). Because of (1), however, the bigger the buckets, the more expensive the sorting required to access the last datapoint.
I can think of two solutions here, but I'm not sure what alternatives there are, or which is the best way:
a) Store the latest timestamp in a one-line document in a separate db/collection. No sorting required on read, but an additional write required every time a any series gets a new datapoint.
b) Keep the buckets smaller (say 1-hour each) and sort.
With a) you write smallish documents to a separate collection, which is performance wise preferable to updating large documents. You could write all new datapoints in this collection and aggregate them for the hour or day, depending on your preference. But as you said this requires an additional write operation.
With b) you need to keep the index size for the sort field in mind. Does the index size fit in memory? That's crucial for the performance of the sort, as you do not want to do any in memory sorting of a large collection.
I recommend exploring the hybrid approach, of storing individual datapoints for a limited time in an 'incoming' collection. Once your bucketing interval of hour or day approaches, you can aggregate the datapoints into buckets and store them in a different collection. Of course there is now some additional complexity in the application, that needs to be able to read bucketed and datapoint collections and merge them.
I have a huge database (~100 variables with a few million rows) consisting of stock data. I managed to connect python with the database via sqlalchemy (postgreql+psycopg2). I am running it all on the cloud.
In principle I want to do a few things:
1) Regression of all possible combinations: I am running a simple regression of each stock, i.e. ABC on XYZ AND also XYZ on ABC, this across the n=100 stocks, resulting in n(n+1) / 2 combinations.
-> I think of a function that calls in the pairs of stocks, does the two regressions and compares the results and picks one based on some criteria.
My question: Is there an efficient way to call in the "factorial"?
2) Rolling Windows: To avoid an overload of data, I thought to only call the dataframe of investigation, i.e. 30days, and then roll over each day, meaning my periods are:
1: 1D-30D
2: 2D-31D and so on
Meaning I always drop the first day and add another row at the end of my dataframe. So meaning I have two steps, drop the first day and read in the next row from my database.
My question: Is this a meaningful way or does Python has something better in its sleeve? How would you do it?
3) Expanding windows: Instead of dropping the first row and add another one, I keep the 30 days and add another 30days and then run my regression. Problem here, at some point I would embrace all the data which will probably be too big for the memory?
My question: What would be a workaround here?
4) As I am running my analysis on the cloud (with a few more cores than my own pc) in fact I could use multithreading, sending "batch" jobs and let Python do things in parallel. I thought of splitting my dataset in 4x 25 stocks and let it run in parallel (so vertical split), or should I better split horizontally?
Additionally I am using Jupyter; I am wondering how to best approach here, usually I have a shell script calling my_program.py. Is this the same here?
Let me try to give answers categorically and also note my observations.
From your description, I suppose you have taken each stock scrip as one variable and you are trying to perform pairwaise linear regression amongst them. Good news about this - it's highly parallizable. All you need to do is generate unique combinations of all possible pairings and perform your regressions and then only to keep those models which fit your criteria.
Now as stocks are your variables, I am assuming rows are their prices or something similar values but definitely some time series data. If my assumption is correct then there is a problem in rolling window approach. In creating these rolling windows what you are implicitly doing is using a data sampling method called 'bootstrapping' which uses random but repeatitive sampling. But due to just rolling your data you are not using random sampling which might create problems for your regression results. At best the model may simply be overtrained, at worst, I cannot imagine. Hence, drop this appraoch. Plus if it's a time series data then the entire concept of windowing would be questionable anyway.
Expanding windows are no good for the same reasons stated above.
About memory and processibility - I think this is an excellent scenario where one can use Spark. It is exactly built for this purpose and has excellent support for python. Millions of data points are no big deal for Spark. Plus, you would be able to massively parallelize your operations. Being on cloud infrastructure also gives you advantage about configurability and expandability without headache. I don't know why people like to use Jupyter even for batch tasks like these but if you are hell-bent on using it, then PySpark kernel is also supported by Jupyter. Vertical split would be right approach here probably.
Hope these answer your questions.
I am looking for a method/data structure to implement an evaluation system for a binary matcher for a verification.
This system will be distributed over several PCs.
Basic idea is described in many places over the internet, for example, in this document: https://precisebiometrics.com/wp-content/uploads/2014/11/White-Paper-Understanding-Biometric-Performance-Evaluation.pdf
This matcher, that I am testing, takes two data items as an input and calculates a matching score that reflects their similarity (then a threshold will be chosen, depending on false match/false non-match rate).
Currently I store matching scores along with labels in CSV file, like following:
label1, label2, genuine, 0.1
label1, label4, genuine, 0.2
...
label_2, label_n+1, impostor, 0.8
label_2, label_n+3, impostor, 0.9
...
label_m, label_m+k, genuine, 0.3
...
(I've got a labeled data base)
Then I run a python script, that loads this table into Pandas DataFrame and calculates FMR/FNMR curve, similar to the one, shown in figure 2 in the link above. The processing is rather simple, just sorting the dataframe, scanning rows from top to bottom and calculating amount of impostors/genuines on rows above and below each row.
The system should also support finding outliers in order to support matching algorithm improvement (labels of pairs of data items, produced abnormally large genuine scores or abnormally small impostor scores). This is also pretty easy with the DataFrames (just sort and take head rows).
Now I'm thinking about how to store the comparison data in RAM instead of CSV files on HDD.
I am considering Redis in this regard: amount of data is large, and several PCs are involved in computations, and Redis has a master-slave feature that allows it quickly sync data over the network, so that several PCs have exact clones of data.
It is also free.
However, Redis does not seem to me to suit very well for storing such tabular data.
Therefore, I need to change data structures and algorithms for their processing.
However, it is not obvious for me, how to translate this table into Redis data structures.
Another option would be using some other data storage system instead of Redis. However, I am unaware of such systems and will be grateful for suggestions.
You need to learn more about Redis to solve your challenges. I recommend you give https://try.redis.io a try and then think about your questions.
TL;DR - Redis isn't a "tabular data" store, it is a store for data structures. It is up to you to use the data structure(s) that serves your query(ies) in the most optimal way.
IMO what you want to do is actually keep the large data (how big is it anyway?) on slower storage and just store the model (FMR curve computations? Outliers?) in Redis. This can almost certainly be done with the existing core data structures (probably Hashes and Sorted Sets in this case), but perhaps even more optimally with the new Modules API. See the redis-ml module as an example of serving machine learning models off Redis (and perhaps your use case would be a nice addition to it ;))
Disclaimer: I work at Redis Labs, home of the open source Redis and provider of commercial solutions that leverage on it, including the above mentioned module (open source, AGPL licensed).
I am considering the use of Cassandra as a time-series store. I have millions of series and each series have around 10K of sequential points with uniform intervals. Some series though have a few thousands points or less. They may start and end at different points but all share the same times. I access the data series
Vertically: predefined partitions (e.g. all days in a year) and I need all the rows.
Horizontally: All values of a specific series (random)
I am considering two options. First I could just have a column per time as it is recommended for monitoring systems for example (I have a different access pattern though). Second, using list columns one per partition.
I am worried about read performance (second use case is more critical) and storage overhead. I did finf the following formula:
total_column_size = column_name_size + column_value_size + 15 here
I think that would make the first option quite expensive in terms of storage. I could not find any documentation for a list storage layout. Do you know of any? Have other recommendations?
BTW, I am using python as a client for cassandra if that makes any difference.
"Storage is cheap" is generally the philosophy here. If you have 2 query patterns, which you seem to, then store everything twice: once partitioned by your desired verticals (days by the looks), and once again by your chosen series. If you don't know how to partition your series in advance (it wasn't clear from the question) then it becomes more complicated. Cassandra reads are sequential when reading in order - and this is the only way you should be using it anyway.
You have in the region of X0bn points which is larger than your average DB but is not bordering on ridiculous, particularly when distributed over a cluster. It's hard to put an exact figure given that I don't know the width of your data points, but if these are just scalar values then this is only going to be 2TB or so of data.