I have a bit of code that will download minute to minute data historically from binance, and combine it all into their own CSV. EG: BCHUSDT-1m-data.csv, BTCUSDT-1m-data.csv, etc for whatever pairs I want. However, I keep getting a
requests.exceptions.ChunkedEncodingError connectionreset error 10054 (closed by remote host).
Is there a better way to go about getting this information than using the client.get_historical_klines(interval) method? Ideally I would want even more granular data (30s, 15, or even 1s if at all possible historically). Thanks in advance!
Link to API: Python-Binance API
For less than 1m trades you need to use
trades = client.get_historical_trades(symbol='BNBBTC')
or
trades = client.get_aggregate_trades(symbol='BNBBTC')
The last one is better it cost less weight and contains more information
Then if you want to combine it to candles/klines you can use pandas resample or ohlc function.
Related
I am a starter with pandas, picked it up as it seemed to be most popular and easiest to work with based on reviews. My intention is fast data processing using async processes (pandas don't really support async, but haven't reached that problem yet). If you believe I could use better library for my needs based on below scenarios, please let me know.
My code is running websockets using asyncio which are fetching activity data constantly and storing it into a pandas DataFrame like so:
data_set.loc[len(data_set)] = [datetime.now(),res['data']['E'] ,res['data']['s'] ,res['data']['p'] ,res['data']['q'] ,res['data']['m']]
That seems to work while printing out the results. The data frame gets big quickly, so have clean up function checking len() of data frame and drop() rows.
My intention is to take the full set in data_set and create a summary view based on a group value and calculate additional values as analytics using the grouped data and data points at different date_time snaps. These calculations would be running multiple times per second.
What I mean is this (all is made up, not a working code example just principle of what's needed):
grouped_data = data_set.groupby('name')
stats_data['name'] = grouped_data['name'].drop_duplicates()
stats_data['latest'] = grouped_data['column_name'].tail(1)
stats_data['change_over_1_day'] = ? (need to get oldest record that's within 1 day frame (out of multiple day data), and get value from specific column and compare it against ['latest']
stats_data['change_over_2_day'] = ?
stats_data['change_over_3_day'] = ?
stats_data['total_over_1_day'] = grouped_data.filter(data > 1 day ago).sum(column_name)
I have googled a million things, every time the examples are quite basic and don't really help my scenarios.
Any help appreciated.
The question was a bit vague I guess, but after some more research (googling) and trial/error (hours) managed to accomplish all that I mentioned here.
Hopefully can help someone to save some time who are new to this:
stats_data = data.loc[trade_data.groupby('name')['date_time'].idxmax()].reset_index(drop=True)
1_day_ago = data.loc[data[data.date_time > day_1].groupby("name")["date_time"].idxmin()].drop(labels = ['date_time','id','volume','flag'], axis=1).set_index('name')['value']
stats_data['change_over_1_day'] = stats_data['value'].astype('float') / stats_data['name'].map(1_day_ago).astype('float') * 100 - 100
Same principal applied to other columns.
If anyone has a much more efficient/faster way to do this, please post your answer.
I'm going throw documentation for boto3 AWS. And cannot find simple information for fetching raw values from my custom metrics.
For ex. I'm trying to log user id on access for particular website path. But from the documentation I have access for aggregated values only. Which means it's impossible to do. And this static is SampleCount'|'Average'|'Sum'|'Minimum'|'Maximum. Which means no sense for my particular case of user id.
UPD
In simple words there is no support to fetch raw values from the Cloudwatch.
The statistic you are refering to are obtained using get_metric_statistics. However, to get the actual data points you should be looking at get_metric_data:
You can use the GetMetricData API to retrieve as many as 500 different metrics in a single request, with a total of as many as 100,800 data points.
But it should be noted that the older the data points, the less resolution they have. AWS does not store all points. Only "new" data will be stored with original resolution.
Data points that are initially published with a shorter period are aggregated together for long-term storage. For example, if you collect data using a period of 1 minute, the data remains available for 15 days with 1-minute resolution. After 15 days, this data is still available, but is aggregated and retrievable only with a resolution of 5 minutes.
Update on aggregation
Also from docs:
Although you can publish data points with time stamps as granular as one-thousandth of a second, CloudWatch aggregates the data to a minimum granularity of 1 second.
CloudWatch does not store raw values published with PutMetricData, it only stores aggregations. Smallest granularity you can get is 1 sec, for the latest 3 hours.
If you need access to raw values, you could use CloudWatch Logs with Embedded Metric Format to publish your metrics: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html
That way CloudWatch Logs creates your custom metrics, which will still only have the aggregations. You can use them for alarming and dashboarding.
But you will also have the raw CloudWatch Logs entries you can look up and see what exactly was published.
I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.
redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure
So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.
Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.
I would like to retrieve the following (historical) information while using the
ek.get_data()
function: ISIN, MSNR,MSNP, MSPI, NR, PI, NT
for some equity indices, take ".STOXX" as an example. How do I do that? I want to specify I am using the get data function instead of the timeseries function because I need daily data and I would not respect the 3k rows limit in get.timeseries.
In general: how do I get to know the right names for the fields that I have to use inside the
ek.get_data()
function? I tried with both the codes that the Excel Eikon program uses and also the names used in the Eikon browser but they differ quite a lot from the example I saw in some sample code on the web (eg. TR.TotalReturnYTD vs TR.PCTCHG_YTD. How do I get to understand what would be the right name for the data types I need?
Considering the codes in your function (ISIN, MSNR,MSNP, MSPI, NR, PI, NT), I'd guess you are interested in the Datastream dataset. You are probably beter off using the DataStream WebServices (DSWS) API instead of the Eikon API. This will also relieve you off your 3k row limit.
Should I timestamp my data extracts?
A few collegues an me work together on a python server to solve a data science related problem. I wrote a few functions to extract my data from my source data base and save it to the python server for further processing. Now I'm struggling with whether I should save the extract with a timestamp, the result being that every time I start my pipeline another extract is saved or omit the timestamp and overwrite the old extract. I read alot about data not needing the same kind of version control as code does and I don't really want to clutter the server with multiple, vastly redundant data extracts.
save the extract with a timestamp, the result being that every time I start my pipeline another extract is saved or omit the timestamp and overwrite the old extract.
Is the change of a feature over time important to your data science related problem?
Do you have any metrics which could tell a story if measured over time?
Perhaps you can store the delta since last data pull instead of redundant features (feature engineer on a different table).
Just a couple of thoughts. Good luck :)