Pandas Rolling Correlation Introduces Gaps - python

I have a relatively clean data set with two columns and no gaps, a snapshot is shown below:
I run the following line of code:
correlation = pd.rolling_corr(data['A'], data['B'], window=120)
and for some reason, this outputs a dataframe (shown as a plot below) with large gaps in it:
I haven't personally seen this issue before, and am not sure after reviewing the data (more than the code) what the issue could be?

It happens due to the missing dates in the time series, weekends etc. Evidence of this in your example being 7/2/2003 -> 10/2/2003. One solution is to fill in these gaps by re-indexing the time series dataframe.
df.index = pd.DatetimeIndex(df.index) # required
df = df.asfreq('D') # reindex will include missing days
df = df.fillna(method='bfill') # fill / interpolate NaNs
corr = df.A.rolling(30).corr(df.B) # no gaps

You are getting NAN values in your correlation variable where the number of rows is less than the value of the window attribute.
import pandas as pd
import numpy as np
data = pd.DataFrame({'A':np.random.randn(10), 'B':np.random.randn(10)})
correlation = pd.rolling_corr(data['A'], data['B'], window=3)
print correlation
0 NaN
1 NaN
2 0.852602
3 0.020681
4 -0.915110
5 -0.741857
6 0.173987
7 0.874049
8 -0.874258
9 -0.835340
In the docs for this function is warns about this in the min_periods attribute section: "Minimum number of observations in window required to have a value (otherwise result is NA)."
It seems the default None is not working, since you'd think you wouldn't see the NaN unless you set a value for this.

Related

pandas: calculate the daily average, grouped by label

I want to create a graph with lines represented by my label
so in this example picture, each line represents a distinct label
The data looks something like this where the x-axis is the datetime and the y-axis is the count.
datetime, count, label
1656140642, 12, A
1656140643, 20, B
1656140645, 11, A
1656140676, 1, B
Because I have a lot of data, I want to aggregate it by 1 hour or even 1 day chunks.
I'm able to generate the above picture with
# df is dataframe here, result from pandas.read_csv
df.set_index("datetime").groupby("label")["count"].plot
and I can get a time-range average with
df.set_index("datetime").groupby(pd.Grouper(freq='2min')).mean().plot()
but I'm unable to get both rules applied. Can someone point me in the right direction?
You can use .pivot (documentation) function to create a convenient structure where datetime is index and the different labels are the columns, with count as values.
df.set_index('datetime').pivot(columns='label', values='count')
output:
label A B
datetime
1656140642 12.0 NaN
1656140643 NaN 20.0
1656140645 11.0 NaN
1656140676 NaN 1.0
Now when you have your data in this format, you can perform simple aggregation over the index (with groupby / resample/ whatever suits you) so it will be applied each column separately. Then plotting the results is just plotting different line for each column.

Plotting a financial graph in Python with mplfinance having a NaN column

I want to plot a candlestick graph with additional indicators where one of indicator can be all nan. I'm using python matplotlib utilities called mplfinance for this. Mplfinance takes one parameter as main data for building candlesticks and second parameter is an array with additional indicators values. When I try to implement a custom indicator I first add an empty column filled with nan to the array with indicators and then fill it in a loop with a condition. It may happen that the whole column can stay all nan after the loop so I get an error and can't plot the graph.
df = pd.DataFrame.from_dict(pd.json_normalize(newBars), orient='columns')
idf = df.copy()
idf = idf.iloc[:,[0]]
idf.columns = ['col0']
idf.assign(col1=float('nan')) # Now we add column #1
for i in range(len(idf)-1):
if a > b: # Some condition I use to calculate Col1
idf.iat[i, 1] = float_value
indicators = [
mpf.make_addplot(idf['Col0'],color='grey',width=1,panel=0),
mpf.make_addplot(idf['Col1'],color='g',type='scatter',markersize=50,marker='^',panel=0),
]
mpf.plot(df, type='candle', style='yahoo', volume=True, addplot=indicators,
figscale=1.1,figratio=(8,5), panel_ratios=(2,1))
From the code there is a chance that Col1 can be all nan and in this case I get the following error:
ValueError: zero-size array to reduction operation maximum which has no identity
How can I avoid this error and just plot the graph without nan columns even if such column exists in the array?
Mplfinance is designed this way on purpose. if you pass all NaN data to mpf.make_addplot() you are effect saying plot nothing. You can easily test if you have any data before adding the make_addplot() to you list of addplot indicators.
Yes, it may make your code simpler if you can just pass indicators without having to check if your model actually "indicated" anything, however (1) this will make the mplfinance code have to check, increasing (albeit very slightly) the cost of maintaining the mplfinance library, and (2) it could be that you passed all NaN values by mistake, in which case if mplfinance simply ignores the data you may spend a lot of time debugging to determine why your indicator is not showing up on the plot.
For further discussion, see: https://github.com/matplotlib/mplfinance/issues/259#issuecomment-688429294

Find Gaps in a Pandas Dataframe

I have a Dataframe which has a column for Minutes and correlated value, the frequency is about 79 seconds but sometimes there is missing data for a period (no rows at all). I want to detect if there is a gap of 25 or more Minutes and delete the dataset if so.
How do I test if there is a gap which is?
The dataframe looks like this:
INDEX minutes data
0 23.000 1.456
1 24.185 1.223
2 27.250 0.931
3 55.700 2.513
4 56.790 1.446
... ... ...
So there is a irregular but short gap and one that exceeds 25 Minutes. In this case I want the dataset to be empty:
I am quite new to Python, especially to Pandas so an explanation would be helpful to learn.
You can use numpy.roll to create a column with shifted values (i.e. the first value from the original column becomes the second value, the second becomes the third, etc):
import pandas as pd
import numpy as np
df = pd.DataFrame({'minutes': [23.000, 24.185, 27.250, 55.700, 56.790]})
np.roll(df['minutes'], 1)
# output: array([56.79 , 23. , 24.185, 27.25 , 55.7 ])
Add this as a new column to your dataframe and subtract the original column with the new column.
We also drop the first row beforehand, since we don't want to calculate the difference from your first timepoint in the original column and your last timepoint that got rolled to the start of the new column.
Then we just ask if any of the values resulting from the subtraction is above your threshold:
df['rolled_minutes'] = np.roll(df['minutes'], 1)
dropped_df = df.drop(index=0)
diff = dropped_df['minutes'] - dropped_df['rolled_minutes']
(diff > 25).any()
# output: True

Python interpolate throws no errors - but also does nothing

I am trying some DataFrame manipulation in Pandas that I have learnt. The dataset that I am playing with is from the EY Data Science Challenge.
This first part may be irrelevant but just for context - I have gone through and set some indexes:
import pandas as pd
import numpy as np
# loading the main dataset
df_main = pd.read_csv(filename)
'''Sorting Indexes'''
# getting rid of the id column
del df_main['id']
# sorting values by LOCATION and GENDER columns
# setting index to LOCATION (1st tier) then GENDER (2nd tier) and then re-
#sorting
df_main = df_main.sort_values(['LOCATION','TIME'])
df_main = df_main.set_index(['LOCATION','TIME']).sort_index()
The problem I have is with the missing values - I have decided that columns 7 ~ 18 can be interpolate because a lot of the data is very consistent year by year.
So I made a simple function to take in a list of columns and apply the interpolate function for each column.
'''Missing Values'''
x = df_main.groupby("LOCATION")
def interpolate_columns(list_of_column_names):
for column in list_of_column_names:
df_main[column] = x[column].apply(lambda x: x.interpolate(how = 'linear'))
interpolate_columns( list(df_main.columns[7:18]) )
However, the problem I am getting is one of the columns (Access to electricity (% of urban population with access) [1.3_ACCESS.ELECTRICITY.URBAN]) seems to not be interpolating when all the other columns are interpolated successfully.
I get no errors thrown when I run the function, and it is not trying to interpolate backwards either.
Any ideas regarding why this problem is occurring?
EDIT: I should also mention that the column in question was missing the same number of values - and in the same rows - as many of the other columns that interpolated successfully.
After looking at the data more closely, it seems like interpolate was not working on some columns because I was missing data at the first rows of the group in the groupby object.

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Categories

Resources