Using Python and Pandas to generate trends from indicators - python

I'm trying to determine what kinds of corrections a market makes in response to changes.
A simple version of this using Python, Pandas and matplotlib might look like:
from pandas import *
ts = read_csv("time_series.csv")
chng = ts / ts.shift(1)
chng.name="current"
f = chng.shift(-1)
f.name="future"
frame = concat([chng, f], axis=1)
frame.groupby(frame.current.round(2)).future.mean().plot()
For example if my data set had a strong habit of correcting changes back to the original value (within 1 tick) the output of the code above might show a negative correlation.
The problem with this method is that it can only show what the response is over a fixed time frame (i.e. I can change the amount that the future data set is shifted, but it can only be one value).
What I would like to do is divide the initial values into buckets and show a trendline for how each range of initial values was received over time (1 tick later, 3 ticks later, etc).
How could I go about doing that?

Related

Calculating the cumulative bottom values for a stacked bar chart when the length of the array varies

Found how to do it:
used pandas to groupby strike,expiration and sum openInterest, then after a couple of hours of scratching my head i learned what .unstack() does and did that.
y = option_chain.groupby(['strike', 'expirationDate'])['openInterest'].sum().unstack(level=-1)
y.plot.bar(stacked=True)
I am looking to plot a stacked bar chart for options open interest. I am looking to do the exact same thing as the person in this link : https://medium.com/#txlian13/webscraping-options-data-with-python-and-yfinance-e4deb0124613 . I have the data from the same source and I have it arranged in the same way.
My problem is that I can't find a way to calculate the bottom argument and chart looks like this:
all the values start at y=0 and not the previous bar height
tried this code among other options but not managed to make it work
exp is a list of all possible expiration dates for the options
bottom = np.zeros(12) #(using 12 because I am testing with the same stock, so I know my first array needs to be 12 to match the number of strikes for the first date)
for i in exp:
z = option_chain.loc[option_chain['expirationDate'] == i]
zx = z['strike']
zy = z['openInterest']
#here i print my bottom and its an empty array of 0s so it will plot the next line from 0
plt.bar(zx,zy,label=i,alpha=0.7,bottom=bottom)
bottom += zy
#i print bottom again here and I can see that it has the 12 correct values of the open interest
#then i get an error "ValueError: shape mismatch: objects cannot be broadcast to a single shape"
So my problem is that the strike (my x values) changes with every iteration I make. For example my first iteration has 12 values for x and the second one has 9 value for x.
So, is there a way to have a variable array that changes with my x and also I realize this will lead to another problem: how to match the x's so that it gets added to the correct strike.
One way I was thinking to do is to find which date has the most strikes and use that as my base, but the problem with that is that it is not given that the date with most strikes has all the strikes in the other dates.
If the problem can be easily fixed with another plotting package, I have no issue in using that. I am a finance graduate and just trying to learn python so only used matplotlib as it's the one the with the most learning materials out there.

Time-series analysis with Python

So I have sensor-based time series data for a subject measured in second intervals, with the corresponding heart rate at each time point in an Excel format. My goal is to analyze whether there are any trends over time. When I import it into Python, I can see a certain number, but not the time. However, when imported in Excel, I can convert it into time format easily.
This is what it looks like in Python.. (column 1 = timestamp, column 2 = heart rate in bpm)
This is what it should look like though:
This is what I tried to convert it into datetime format in Python:
import datetime
Time = datetime.datetime.now()
"%s:%s.%s" % (Time.minute, Time.second, str(Time.microsecond)[:2])
if isinstance(Time,datetime.datetime):
print ("Yay!")
df3.set_index('Time', inplace=True)
Time gets recognized as a float64 if I do this, not datetime64 [ns].
Consequently, when I try to plot this timeseries, I get the following:
I even did the Dickey-fuller Test to analyze trends in Python with this dataset. Does my misconfiguration of the time column in Python actually affect my ADF-test? I'm assuming since only trends in the 'heartrate' column are analyzed with this code, it shouldn't matter, right?
Here's the code I used:
#Perform Dickey-Fuller test:
print("Results of Dickey-Fuller Test:")
dftest=adfuller(df3 ['HeartRate'], autolag='AIC')
dfoutput=pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
test_stationarity(df3)
Did I do this correctly? I don't have experience in the engineering field and I'm doing this to improve healthcare for older people, so any help will be very much appreciated!
Thanks in advance! :)
It seems that the dateformat in excel is expresed as the number of days that have passed since 12/30/1899. In order to transform the number on the timestamp column to seconds you only need to multiply it by 24*60*60 = 86400 (the number of seconds in one day).

Selecting slices of a Pandas Series based on both index and value conditions

I have a Pandas Series which contains acceleration timeseries data. My goal is to select slices of extreme force given some threshold. I was able to get part way with the following:
extremes = series.where(lambda force: abs(force - RESTING_FORCE) >= THRESHOLD, other=np.nan)
Now extremes contains all values which exceed the threshold and NaN for any that don't, maintaining the original index.
However, a secondary requirement is that nearby peaks should be merged into a single event. Visually, you can picture the three extremes on the left (two high, one low) being joined into one complete segment and the two peaks on the right being joined into another complete segment.
I've read through the entire Series reference but I'm having trouble finding methods to operate on my partial dataset. For example, if I had a method that returned an array of non-NaN index ranges, I would be able to sequentially compare each range and decide whether or not to fill in the space between with values from the original series (nearby) or leave them NaN (too far apart).
Perhaps I need to abandon the intermediate step and approach this from an entirely different angle? I'm new to Python so I'm having trouble getting very far with this. Any tips would be appreciated.
It actually wasn't so simple to come up with a vectorized solution without looping.
You'll probably need to go through the code step by step to see the actual outcome of each method but here is short sketch of the idea:
Solution outline
Identify all peaks via simple threshold filter
Get timestamps of peak values into a column and forward fill gaps in between in order to allow to compare current valid timestamp with previous valid timestamp
Do actual comparison via diff() to get time deltas and apply time delta comparison
Convert booleans to integers to use cummulative sum to create signal groups
Group by signals and get min and max timestamp values
Example data
Here is the code with a dummy example:
%matplotlib inline
import pandas as pd
import numpy as np
size = 200
# create some dummy data
ts = pd.date_range(start="2017-10-28", freq="d", periods=size)
values = np.cumsum(np.random.normal(size=size)) + np.sin(np.linspace(0, 100, size))
series = pd.Series(values, index=ts, name="force")
series.plot(figsize=(10, 5))
Solution code
# define thresholds
threshold_value = 6
threshold_time = pd.Timedelta(days=10)
# create data frame because we'll need helper columns
df = series.reset_index()
# get all initial peaks below or above threshold
mask = df["force"].abs().gt(threshold_value)
# create variable to store only timestamps of intial peaks
df.loc[mask, "ts_gap"] = df.loc[mask, "index"]
# create forward fill to enable comparison between current and next peak
df["ts_fill"] = df["ts_gap"].ffill()
# apply time delta comparison to filter only those within given time interval
df["within"] = df["ts_fill"].diff() < threshold_time
# convert boolean values into integers and
# create cummulative sum which creates group of consecutive timestamps
df["signals"] = (~df["within"]).astype(int).cumsum()
# create dataframe containing start and end values
df_signal = df.dropna(subset=["ts_gap"])\
.groupby("signals")["ts_gap"]\
.agg(["min", "max"])
# show results
df_signal
>>> min max
signals
10 2017-11-06 2017-11-27
11 2017-12-13 2018-01-22
12 2018-02-03 2018-02-23
Finally, show the plot:
series.plot(figsize=(10, 5))
for _, (idx_min, idx_max) in df_signal.iterrows():
series[idx_min:idx_max].plot()
Result
As you can see in the plot, peaks greater an absolute value of 6 are merged into a single signal if their last and first timestamps are within a range of 10 days. The thresholds here are arbitrary just for illustration purpose. you can change them to whatever.

Python/Pandas: sort by date and compute two week (rolling?) average

So far I've read in 2 CSV's and merged them based on a common element. I take the output of the merged CSV and iterate through the unique element they've been merged on. While I have them separated I want to generate a daily count line and a two week rolling average from the current date going backward. I cannot index based of the 'Date Opened' field but I still need my outputs organized by this with the most recent first. Once these are sorted by date my daily count plotting issue will be rectified. My remaining task would be to compute a two week rolling average for count within the week. I've looked into the Pandas documentation and I think the rolling_mean will work but the parameters of this function don't really make sense to me. I've tried biwk_avg = pd.rolling_mean(open_dt, 28) but that doesnt seem to work. I know there is an easier way to do this but I think I've hit a roadblock with the documentation available. The end result should look something like this graph. Right now my daily count graph isnt sorted(even though I think I've instructed it to) and is unusable in line form.
def data_sort():
data_merge = data_extract()
domains = data_merge.groupby('PWx Domain')
for domain in domains.groups.items():
dsort = (data_merge.loc[domain[1]])
print (dsort.head())
open_dt = pd.to_datetime(dsort['Date Opened']).dt.date
#open_dt.to_csv('output\''+str(domain)+'_out.csv', sep = ',')
open_ct = open_dt.value_counts(sort= False)
biwk_avg = pd.rolling_mean(open_ct, 28)
plt.plot(open_ct,'bo')
plt.show()
data_sort()
Rolling mean alone is not enough in your case; you need a combination of resampling (to group data by days) followed by a 14-day rolling mean (why do you use 28 in your code?). Something like thins:
for _,domain in data_merge.groupby('PWx Domain'):
# Convert date to the index
domain.index = pd.to_datetime(domain['Date Opened'])
# Sort dy dates
domain.sort_index(inplace=True)
# Do the averaging
rolling = pd.rolling_mean(domain.resample('1D').mean(), 14)
plt.plot(rolling,'bo')
plt.show()

Resampling pandas timeseries without computing a new offset

I'm reading in timeseries data that contains only the available times. This leads to a Series with no missing values, but an unequally spaced index. I'd like to convert this to a Series with an equally spaced index with missing values. Since I don't know a priori what the spacing will be, I'm currently using a function like
min_dt = np.diff(series.index.values).min()
new_spacing = pandas.DateOffset(days=min_dt.days, seconds=min_dt.seconds,
microseconds=min_dt.microseconds)
series = series.asfreq(new_spacing)
to compute what the spacing should be (note that this is using Pandas 0.7.3 - the 0.8 beta code looks slightly differently since I have to use series.index.to_pydatetime() for correct behavior with Numpy 1.6).
Is there an easier way to do this operation using the pandas library?
If you want NaN's in the places where there is no data, you can just use Minute() located in datetools (as of pandas 0.7.x)
from pandas.core.datetools import day, Minute
tseries.asfreq(Minute())
That should provide an evenly spaced time series with 1 minute differences with NaNs as the series values where there is no data.

Categories

Resources