How to get traded volume from tick data? - python

I have this excel data with price movement and traded volume.
By using mydf[mydf.columns[5]].resample('1Min').ohlc(), I get OHLC data but don't know how to get trade volume for each minute. I have few problems in my mind:
the tick frequency is not uniform (means for some particular minute i may have say 100 sample size and for for other it may vary to 120 so .group() may not work for me)
OHLC function takes care of the previous issue automatically as i make column G date time index
Can i have a code which should based on "G" column do sum for volume in particular minute and then subtract it with previous minute volume data so that i get exact traded volume for that particular minute?
Here is the input for ohlc
and the output i get is this,
.
I am not interested in CE as of now.
I just want another column added in this dataframe with volume for each minute value.

Related

Lookoing to create a graph based off the average of two columns in my dataset

Ulitmately I am very new with Data Analysis and am in the middle of a project that is due very soon.
Of the data here:
enter image description here
I would like to have the Station areas grouped up, and the Time_Diff averaged out for each area.
There are 35000+ entries in this dataset, hence why I want to group it up into the totals so the graph will work.
Such as:
Tallaght: 13:46
Blanchardstown: 14:35
etc..
I have attempted to graph them but my results were only returning the total count of the time_diff column hence making the area with the higher entries the higher count.
The Time_Diff column I made by converting the 'text' value times into datetime using pandas, then minus the IA from the TOC to retrieve the time difference.
My dataset: https://data.gov.ie/dataset/fire-brigade-and-ambulance?package_type=dataset
Brownie points if you can figure out how I can remove the 0 days entry from the output. I believe this was a result of me converting the 'text' to datetime.
subset.groupby('Station Area')['Time_Diff'].mean()

How to take the value after every 10 min?

I am using LSTM for forecasting the stock prediction , I did some feature engineering on time series data .I have two columns , first is price and second is date . Now I want to train the model which take the value of price after every ten minutes . Suppose :
date_index date new_price
08:28:31 08:28:31 8222000.0
08:28:35 08:28:35 8222000.0
08:28:44 08:28:44 8200000.0
08:28:50 08:28:50 8210000.0
08:28:56 08:28:56 8060000.0
08:29:00 08:29:00 8110000.0
08:29:00 08:29:00 8110000.0
08:29:05 08:29:05 8010000.0
08:29:24 08:29:24 8222000.0
08:29:28 08:29:28 8210000.0
Lets say the date comes first is 8:28:31 it will takes the value of its corresponding price and at the second time it should take the value of corresponding columns after ten minutes means 8:38:31 , and sometimes this time does not available in data . How to do it . My goal is just to train the model on after every 10 minutes or 15 minutes ?
The main keyword you are looking for here is resampling
You have time-series data with unevenly spaced time stamps and want to convert the index to something more regular (i.e. 10 minute intervals).
The question you have to answer is: If the exact 10 minute timestamp is not available in your data, what do you want to do? Take the most recent event instead?
(Let's say the data for 8:38:31 is not available, but there's data for 8::37:25. Do you just want to take that?)
If so, I think something like df.resample(_some_argument_to_set_interval_to_10_minutes).last() should work, where I forgot the exact syntax for setting the interval in the resample method. Might be something like 10m or something.
See here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

How to get statistics of once column of dataframe using data from a second column?

I'm trying to write a program to give a deeper analysis of stock trading data but am coming up against a wall. I'm pulling all trades for a given timeframe and creating a new CSV file in order to use that file as the input for a predictive neural network.
The dataframe I currently have has three values: (1) the price of the stock; (2) the number of shares sold at that price; and (3) the unix timestamp of that particular trade. I'm having trouble getting any accurate statistical analysis of the data. For example, if I use .median(), the program only looks at the number of values listed rather than the fact that each value may have been traded hundreds of times based on the volume column.
As an example, this is the partial trading history for one of the stocks that I'm trying to analyze.
0 227.60 40 1570699811183
1 227.40 27 1570699821641
2 227.59 50 1570699919891
3 227.60 10 1570699919891
4 227.36 100 1570699967691
5 227.35 150 1570699967691 . . .
To better understand the issue, I've also grouped it by price and summed the other columns with groupby('p').sum(). I realize this means the timestamp is useless, but it makes visualization easier.
227.22 2 1570700275307
227.23 100 1570699972526
227.25 100 4712101657427
227.30 105 4712101371199
227.33 50 1570700574172
227.35 4008 40838209836171 . . .
Is there any way to use the number from the trade volume column to perform a statistical analysis of the price column? I've considered creating a new dataframe where each price is listed the number of times that it is traded, but am not sure how to do this.
Thanks in advance for any help!

How to calculate moving average incrementally with daily data added to data frame in pandas?

I have daily data and want to calculate 5 days, 30 days and 90 days moving average per user and write out to a CSV. New data comes in everyday. How do I calculate these averages for the new data only, assuming I will load the data frame with last 89 days data plus today's data.
date user daily_sales 5_days_MA 30_days_MV 90_days_MV
2019-05-01 1 34
2019-05-01 2 20
....
2019-07-18 .....
The number of rows per day is about 1 million. If data for 90days is too much, 30 days is OK
You can apply rolling() method on your dataset if it's in DataFrame format.
your_df['MA_30_days'] = df[where_to_apply].rolling(window = 30).mean()
If you need different window on which moving average will be calculated just change window parameter. In my example I used mean() to calculate but you can choose some other statistic as well.
This code will create another column named 'MA_30_days' with calculated moving average in your DataFrame.
You can also create another DataFrame where you will collect and loop over your dataset to calculate all moving averages and save it to CSV format as you wanted.
your_df.to_csv('filename.csv')
In your case to calculation should be consider only the newest data. If you want to perform this on latest data just slice it. However the very first rows will be NaN (depends on window).
df[where_to_apply][-90:].rolling(window = 30).mean()
This will calculate moving average on last 90 rows of specific column in some df and first 29 rows would be NaN. If your latest 90 rows should be all meaningful data than you can start calculation earlier than on last 90 rows - depends on window size.
if the df already contains yesterday's moving average, and just the new day's Simple MA is required, I would say use this approach:
MAlength=90
df.loc[day-1:'MA']=(
(df.loc[day-1:'MA']*MAlength) #expand yesterday's MA value
-df.loc[day-MAlength:'Price'] #remove oldest price
+df.loc[day-MAlength:'Price'] #add newest price
)/MAlength #re-average

Take maximum rainfall value for each season over a time period (xarray)

I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Categories

Resources