Filling gaps based on gap length - python

I am currently playing with financial data, missing financial data specifically. What I'm trying to do is fill the gaps basing on gap length, for example:
- if length of the gap is lower than 5 NaN, then interpolate
- if length is > 5 NaN, then fill with values from different series
So what I am trying to accomplish here is a function that will scan series for NaN, get their length and then fill them appropriately. I just wanted to push as much as I can to pandas/numpy ops and not do it in loops etc...
Below just example, this is not optimal at all:
ser = pd.Series(np.sort(np.random.uniform(size=100)))
ser[48:52] = None
ser[10:20] = None
def count(a):
tmp = 0
for i in range(len(a)):
current=a[i]
if not(np.isnan(current)) and tmp>0:
a[(i-tmp):i]=tmp
tmp=0
if np.isnan(current):
tmp=tmp+1
g = ser.copy()
count(g)
g[g<1]=0
df = pd.DataFrame(ser, columns=['ser'])
df['group'] = g
Now we want to interpolate when gap is < 10 and put something where gap > 9
df['ready'] = df.loc[df.group<10,['ser']].interpolate(method='linear')
df['ready'] = df.loc[df.group>9,['ser']] = 100
To sum up, 2 questions:
- can Pandas do it robust way?
- if not, what can you suggest to make my way more robust and faster? Lets just focus on 2 points here: first there is this loop over series - it will take ages once I have, say, 100 series with gaps. Maybe something like Numba? Then, I'm interpolating on copies any suggestions on how to do it inplace?
Thanks for having a look

You could leverage interpolate's limit parameter.
df['ready'] = df.loc[df.group<10,['ser']].interpolate(method='linear',limit=9)
limit : int, default None.
Maximum number of consecutive NaNs to fill.
Then run interpolate() a second time with a different method or even run fillna()

After a lengthy look for an answer it turns out there is no automated way of doing fillna based on gap length.
Conclusion: one can utilize the code from the question, the idea will work.

Related

Pandas dataframe sum of row won't let me use result in equation

Anybody wish to help me understand why below code doesn't work?
start_date = '1990-01-01'
ticker_list = ['SPY', 'QQQ', 'IWM','GLD']
tickers = yf.download(ticker_list, start=start_date)['Close'].dropna()
ticker_vol_share = (tickers.pct_change().rolling(20).std()) \
/ ((tickers.pct_change().rolling(20).std()).sum(axis=1))
Both (tickers.pct_change().rolling(20).std()) and ((tickers.pct_change().rolling(20).std()).sum(axis=1)) runs fine by themselves, but when ran together they form a dataframe with thousands of columns all filled with nan
Try this.
rolling_std = tickers.pct_change().rolling(20).std()
ticker_vol_share = rolling_std.apply(lambda row:row/sum(row),axis = 1)
You will get
Why its not working as expected:
Your tickers object is a DataFrame, as is the tickers.pct_change(), tickers.pct_change().rolling(20) and tickers.pct_change().rolling(20).std(). The tickers.pct_change().rolling(20).std().sum(axis=1) is probably a Series.
You're therefore doing element-wise division of a DataFrame by a Series. This yields a DataFrame.
Without seeing your source data, it's hard to say for sure why the output DF is filled with nan, but that can certainly happen if some of the things you're dividing by are 0. It might also happen if each series is only one element long after taking the rolling average. It might also happen if you're actually evaluating a Series tickers rather than a DataFrame, since Series.sum(axis=1) doesn't make a whole lot of sense. It is also suspicious that your top and bottom portions of the division are probably different shapes, since sum() collapses an axis.
It's not clear to me what your expected output is, so I'll defer to others or wait for an update before answering that part.

How to speed up a pandas loop?

I have a pandas dataframe named 'matrix', it looks like this:
antecedent_sku consequent_sku similarity
0 001 002 0.3
1 001 003 0.2
2 001 004 0.1
3 001 005 0.4
4 002 001 0.4
5 002 003 0.5
6 002 004 0.1
Out of this dataframe I want to create a similarity matrix for further clustering. I do it in two steps.
Step 1: to create an empty similarity matrix ('similarity')
set_name = set(matrix['antecedent_sku'].values)
similarity = pd.DataFrame(index = list(set_name), columns = list(set_name))
Step 2: to fill it with values from 'matrix':
for ind in tqdm(list(similarity.index)):
for col in list(similarity.columns):
if ind==col:
similarity.loc[ind, col] = 1
elif len(matrix.loc[(matrix['antecedent_sku'].values==f'{ind}') & (matrix['consequent_sku'].values==f'{col}'), 'similarity'].values) < 1:
similarity.loc[ind, col] = 0
else:
similarity.loc[ind, col] = matrix.loc[(matrix['antecedent_sku'].values==f'{ind}') & (matrix['consequent_sku'].values==f'{col}'), 'similarity'].values[0]
The problem: it takes 4 hours to fill a matrix of shape (3000,3000).
The question: what am I doing wrong? Should I aim at speeding up the code with something like Cython/Numba or the problem lies in the archetecture of my approach and I should use built-in functions or some other clever way to transform 'matrix' into 'similarity' instead of a double loop?
P.S. I run Python 3.8.7
Iterating over pandas dataframe using loc is known to be very slow. The CPython interpreter is also known to be slow too (typically loops). Every pandas operation have a high overhead. However, the main point is that you iterate over 3000x3000 elements so to call for each element things like matrix['antecedent_sku'].values==f'{ind}' which certainly iterate over 3000 items that are strings also known to be an inefficient datatype (since the processor need to parse a variable-length UTF-8 sequence of multiple characters). Since this is done twice per iteration and you parse a new integer for each comparison, this means 3000*3000*3000*2 = 54_000_000_000 string comparisons will be performed, with overall 3000*3000*3000*2*2*3 = 324_000_000_000 characters to (inefficiently) compare! There is no chance this can be fast since this is very inefficient. Not to mention every of the 9_000_000 iterations creates/delete several temporary arrays and Pandas objects.
The first thing to do is to reduce the number of recomputed operations thanks to some precomputations. Indeed, you can store the values of matrix['antecedent_sku'].values==f'{ind}' (as Numpy arrays since pandas series are inefficient) in a dictionary indexed by ind so to fetch it faster in the loop. This should make this part 3000 time faster (since there should be only 3000 items). Even better: you can use a groupby to do that more efficiently.
Moreover, you can convert the columns to integers (ie. antecedent_sku and consequent_sku) so to avoid many expensive string comparisons.
Then you can remove useless operations like the matrix.loc[..., 'similarity'].values. Indeed, since you just want to know the length of the result, you can just use np.sum of the binary numpy array. In fact, you can even use np.any since you check if the length is less than 1.
Then you can avoid the creation of temporary Numpy array with a preallocated buffer and by specifying the output buffer in Numpy operations. For example, you can use np.logical_and(A, B, out=your_preallocated_buffer) instead of just a A & B.
Finally, if (and only if) all the previous steps are not enough to make the overall computation hundreds or thousands time faster, you can use Numba by converting your dataframe to a Numpy array first (since Numba does not support dataframe). If this is still not enough, you can use prange (instead of range) and the flag parallel=True of Numba so to use multiple threads.
Please note that Pandas is not really design to manipulate dataframes of 3000 columns and will certainly not be very fast because of that. Numpy is better suited for manipulating matrices.
Following Jerome's lead with a dictionary, I've done the following:
Step 1: to create a dictionary
matrix_dict = matrix.copy()
matrix_dict = matrix_dict.set_index(['antecedent_sku', 'consequent_sku'])['similarity'].to_dict()
matrix_dict looks like this:
{(001, 002): 0.3}
Step 2: to fill similarity with values from matrix_dict
for ind in tqdm(list(similarity.index)):
for col in list(similarity.columns):
if ind==col:
similarity.loc[ind, col] = 1
else:
similarity.loc[ind, col] = matrix_dict.get((int(ind), int(col)))
Step 3: fillna with zeroes
similarity = similarity.fillna(0)
Result: x35 performance (4 hours 20 minutes to 7 minutes)

Best method for non-regular index-based interpolation on grouped dataframes

Problem statement
I had the following problem:
I have samples that ran independent tests. In my dataframe, tests of sample with the same "test name" are also independent. So the couple (test,sample) is independent and unique.
data are collected at non regular sampling rates, so we're speaking about unequaly spaced indices. This "time series" index is called unreg_idx in the example. For the sake of simplicity, it is a float between 0 & 1.
I want to figure out what the value at a specific index, e.g. for unreg_idx=0.5. If the value is missing, I just want a linear interpolation that depends on the index. If extrapolating because the value is at an extremum in the sorted unreg_idx of the group (test,sample), it can leave NaN.
Note the following from pandas documentation:
Please note that only method='linear' is supported for
DataFrame/Series with a MultiIndex.
’linear’: Ignore the index and treat the values as equally spaced.
This is the only method supported on MultiIndexes.
The only solution I found is long, complex and slow. I am wondering if I am missing out on something, or on the contrary something is missing from the pandas library. I believe this is a typical issue in scientific and engineering fields to have independent tests on various samples with non regular indices.
What I tried
sample data set preparation
This part is just for making an example
import pandas as pd
import numpy as np
tests = (f'T{i}' for i in range(20))
samples = (chr(i) for i in range(97,120))
idx = pd.MultiIndex.from_product((tests,samples),names=('tests','samples'))
idx
dfs=list()
for ids in idx:
group_idx = pd.MultiIndex.from_product(((ids[0],),(ids[1],),tuple(np.random.random_sample(size=(90,))))).sort_values()
dfs.append(pd.DataFrame(1000*np.random.random_sample(size=(90,)),index=group_idx))
df = pd.concat(dfs)
df = df.rename_axis(index=('test','sample','nonreg_idx')).rename({0:'value'},axis=1)
The (bad) solution
add_missing = df.index.droplevel('nonreg_idx').unique().to_frame().reset_index(drop=True)
add_missing['nonreg_idx'] = .5
add_missing = pd.MultiIndex.from_frame(add_missing)
added_missing = df.reindex(add_missing)
df_full = pd.concat([added_missing.loc[~added_missing.index.isin(df.index)], df])
df_full.sort_index(inplace=True)
def interp_fnc(group):
try:
return group.reset_index(['test','sample']).interpolate(method='slinear').set_index(['test','sample'], append=True).reorder_levels(['test','sample','value']).sort_index()
except:
return group
grouped = df_full.groupby(level=['test','sample'])
df_filled = grouped.apply(
interp_fnc
)
Here, the wanted values are in df_filled. So I can do df_filled.loc[(slice(None), slice(None), .5),'value'] to get what I need for each sample/test.
I would have expected to be able to do the same within 1 or maximum 2 lines of code. I have 14 here. apply is quite a slow method. I can't even use numba.
Question
Can someone propose a better solution?
If you think there is no better alternative, please comment and I'll open an issue...

How to optimize code that iterates on a big dataframe in Python

I have a big pandas dataframe. It has thousands of columns and over a million rows. I want to calculate the difference between the max value and the min value row-wise. Keep in mind that there are many NaN values and some rows are all NaN values (but I still want to keep them!).
I wrote the following code. It works but it's time consuming:
totTime = []
for index, row in date.iterrows():
myRow = row.dropna()
if len(myRow):
tt = max(myRow) - min(myRow)
else:
tt = None
totTime.append(tt)
Is there any way to optimize it? I tried with the following code but I get an error when it encounters all NaN rows:
tt = lambda x: max(x.dropna()) - min(x.dropna())
totTime = date.apply(tt, axis=1)
Any suggestions will be appreciated!
It is usually a bad idea to use a python for loop to iterate over a large pandas.DataFrame or a numpy.ndarray. You should rather use the available build in functions on them as they are optimized and in many cases actually not written in python but in a compiled language. In your case you should use the methods pandas.DataFrame.max and pandas.DataFrame.min that both give you an option skipna to skip nan values in your DataFrame without the need to actually drop them manually. Furthermore, you can choose a axis to minimize along. So you can specifiy axis=1 to get the minimum along columns.
This will add up to something similar as what #EdChum just mentioned in the comments:
data.max(axis=1, skipna=True) - data.min(axis=1, skipna=True)
I have the same problem about iterating. 2 points:
Why don't you replace NaN values with 0? You can do it with this df.replace(['inf','nan'],[0,0]). It replaces inf and nan values.
Take a look at this This. Maybe you can understand, I have a similar question about how to optimize the loop to calculate de difference between actual row with the previous one.

Finding rows by their difference in Pandas dataframe

I have a data frame in which I want to identify all pairs of rows whose time value t differs by a fixed amount, say diff.
In [8]: df.t
Out[8]:
0 143.082739
1 316.285739
2 344.315561
3 272.258814
4 137.052583
5 258.279331
6 114.069608
7 159.294883
8 150.112371
9 181.537183
...
For example, if diff = 22.2423, then we would have a match between rows 4 and 7.
The obvious way to find all such matches is to iterate over each row and apply a filter to the data frame:
for t in df.t:
matches = df[abs(df.t - (t + diff)) < EPS]
# log matches
But as I have a log of values (10000+), this will be quite slow.
Further, I want to look and check to see if any differences of a multiple of diff exist. So, for instance, rows 4 and 9 differ by 2 * diff in my example. So my code takes a long time.
Does anyone have any suggestions on a more efficient technique for this?
Thanks in advance.
Edit: Thinking about it some more, the question boils down to finding an efficient way to find floating-point numbers contained in two lists/Series objects, to within some tolerance.
If I can do this, then I can simply compare df.t, df.t - diff, df.t - 2 * diff, etc.
If you want to check many multiples, it might be best to take the modulo of df with respect to diff and compare the result to zero, within your tolerance.
Whether you use modulo or not, the efficient way to compare floats within some tolerance is numpy.allclose. In versions before 1.8, call it as numpy.testing.allcose.
So far what I've described still involved looping over rows, because you must compare each row to every other. A better, but slightly more involved approach, would use scipy.cKDTree to query all pairs within a given distance (tolerance).

Categories

Resources