I'm going throw documentation for boto3 AWS. And cannot find simple information for fetching raw values from my custom metrics.
For ex. I'm trying to log user id on access for particular website path. But from the documentation I have access for aggregated values only. Which means it's impossible to do. And this static is SampleCount'|'Average'|'Sum'|'Minimum'|'Maximum. Which means no sense for my particular case of user id.
UPD
In simple words there is no support to fetch raw values from the Cloudwatch.
The statistic you are refering to are obtained using get_metric_statistics. However, to get the actual data points you should be looking at get_metric_data:
You can use the GetMetricData API to retrieve as many as 500 different metrics in a single request, with a total of as many as 100,800 data points.
But it should be noted that the older the data points, the less resolution they have. AWS does not store all points. Only "new" data will be stored with original resolution.
Data points that are initially published with a shorter period are aggregated together for long-term storage. For example, if you collect data using a period of 1 minute, the data remains available for 15 days with 1-minute resolution. After 15 days, this data is still available, but is aggregated and retrievable only with a resolution of 5 minutes.
Update on aggregation
Also from docs:
Although you can publish data points with time stamps as granular as one-thousandth of a second, CloudWatch aggregates the data to a minimum granularity of 1 second.
CloudWatch does not store raw values published with PutMetricData, it only stores aggregations. Smallest granularity you can get is 1 sec, for the latest 3 hours.
If you need access to raw values, you could use CloudWatch Logs with Embedded Metric Format to publish your metrics: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html
That way CloudWatch Logs creates your custom metrics, which will still only have the aggregations. You can use them for alarming and dashboarding.
But you will also have the raw CloudWatch Logs entries you can look up and see what exactly was published.
Related
I'm writing an application in Python to add data to Google Fit. I am successful in creating data sources and adding datasets with points for heart rate, cadence, speed, steps, etc. as well as sessions for those datasets.
I am now trying to add location data so that activities in Google Fit show maps of the activity but not having any luck with that. Something that is unclear to me is that while all of the above items are a single data point, location is 4 data points according to https://developers.google.com/fit/datatypes/location#location_sample.
Do these 4 different items in the data point need to be named in any way, or do I just add them as 4 fpVals one after another in the same order as described on the above reference? I.e. in building my array of points for the dataset's patch operation do I just add them to the value array as such:
gfit_loc.append(dict(
dataTypeName='com.google.location.sample',
endTimeNanos=p.time.timestamp() * 1e9,
startTimeNanos=p.time.timestamp() * 1e9,
value=[dict(fpVal=p.latitude),
dict(fpVal=p.longitude),
dict(fpVal=10),
dict(fpVal=p.elevation)]
))
where the dataset is added with:
data = service.users().dataSources().datasets().patch(
userId='me',
dataSourceId='raw:com.google.location.sample:718486793782',
datasetId='%s-%s' % (min_log_ns, max_log_ns),
body=dict(
dataSourceId='raw:com.google.location.sample:718486793782',
maxEndTimeNs=max_log_ns,
minStartTimeNs=min_log_ns,
point=gfit_loc
)
).execute()
So it turns out that I was doing everything correctly with a small exception. I was setting the activity type to 95, which is defined as Walking (treadmill) for all activities in my prototype. I had not gotten to allowing the user to specify the activity type and given that 95 is an indoor treadmill activity, Google Fit was simply not showing any location data for the activity in the form of a map.
Once I started using non-treadmill activity types, maps started showing up in my Google Fit activities.
I have a bit of code that will download minute to minute data historically from binance, and combine it all into their own CSV. EG: BCHUSDT-1m-data.csv, BTCUSDT-1m-data.csv, etc for whatever pairs I want. However, I keep getting a
requests.exceptions.ChunkedEncodingError connectionreset error 10054 (closed by remote host).
Is there a better way to go about getting this information than using the client.get_historical_klines(interval) method? Ideally I would want even more granular data (30s, 15, or even 1s if at all possible historically). Thanks in advance!
Link to API: Python-Binance API
For less than 1m trades you need to use
trades = client.get_historical_trades(symbol='BNBBTC')
or
trades = client.get_aggregate_trades(symbol='BNBBTC')
The last one is better it cost less weight and contains more information
Then if you want to combine it to candles/klines you can use pandas resample or ohlc function.
I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.
redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure
So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.
Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.
I am storing multiple time-series in a MongoDB with sub-second granularity. The DB is updated by a bunch of Python scripts, and the data stored serve two main purposes:
(1) It's a central information source for the latest data from all series. Multiple scripts access it every second or so to read the latest datapoint in each collection.
(2) It's a long-term data store. I often load the whole DB into Python to analyse trends in the data.
To keep the DB as efficient as possible, I want to bucket my data (ideally holding one document per day in each collection). Because of (1), however, the bigger the buckets, the more expensive the sorting required to access the last datapoint.
I can think of two solutions here, but I'm not sure what alternatives there are, or which is the best way:
a) Store the latest timestamp in a one-line document in a separate db/collection. No sorting required on read, but an additional write required every time a any series gets a new datapoint.
b) Keep the buckets smaller (say 1-hour each) and sort.
With a) you write smallish documents to a separate collection, which is performance wise preferable to updating large documents. You could write all new datapoints in this collection and aggregate them for the hour or day, depending on your preference. But as you said this requires an additional write operation.
With b) you need to keep the index size for the sort field in mind. Does the index size fit in memory? That's crucial for the performance of the sort, as you do not want to do any in memory sorting of a large collection.
I recommend exploring the hybrid approach, of storing individual datapoints for a limited time in an 'incoming' collection. Once your bucketing interval of hour or day approaches, you can aggregate the datapoints into buckets and store them in a different collection. Of course there is now some additional complexity in the application, that needs to be able to read bucketed and datapoint collections and merge them.
I haven't been able to find any direct answers, so I thought I'd ask here.
Can ETL, say for example AWS Glue, be used to perform aggregations to lower the resolution of data to AVG, MIN, MAX, etc over arbitrary time ranges?
e.g. - Given 2000+ data points of outside temperature in the past month, use an ETL job to lower that resolution to 30 data points of daily averages over the past month. (actual use case of such data aside, just an example).
The idea is to perform aggregations to lower the resolution of data to make charts, graphs, etc display long time ranges of large data sets more quickly, as we don't need every individual data point that we must then dynamically aggregate on the fly for these charts and graphs.
My research so far only suggests that ETL be used for 1 to 1 transformations of data, not 1000 to 1. It seems ETL is used more for transforming data to appropriate structure to store in a db, and not for aggregating over large data sets.
Could I use ETL to solve my aggregation needs? This will be on a very large scale, implemented with AWS and Python.
The 'T' in ETL stands for 'Transform', and aggregation is one of most common ones performed. Briefly speaking: yes, ETL can do this for you. The rest depends on specific needs. Do you need any drill-down? Increasing resolution on zoom perhaps? This would affect the whole design, but in general preparing your data for presentation layer is exactly what ETL is used for.