understanding rolling correlation in pandas - python

I am trying to understand how pandas.rolling_corr actually calculates rolling correlations. So far I have always been doing it with numpy. I prefer to use pandas due to the speed and the ease of use, but I cannot get the rolling correlation as it used to do.
I start with two numy arrays:
c = np.array([1,2,3,4,5,6,7,8,9,8,7,6,5,4,3,2,1])
d = np.array([8,9,8])
now I want to calculate the cross-correlation for which length-3-window of my array c. I define a rolling window function:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
and calculate the correlation between each of my generated windows and the second original dataset. This approach works just fine:
for win in rolling_window(c, len(d)):
print(np.correlate(win, d))
Outputs:
[50]
[75]
[100]
[125]
[150]
[175]
[200]
[209]
[200]
[175]
[150]
[125]
[100]
[75]
[50]
If I attempt to solve it with pandas:
a = pd.DataFrame([1,2,3,4,5,6,7,8,9,8,7,6,5,4,3,2,1])
b = pd.DataFrame([8,9,8])
no matter if I use DataFrame rolling_corr:
a.rolling(window=3, center=True).corr(b)
or Pandas rolling_corr:
pd.rolling_corr(a, b, window=1, center=True)
I just get a bunch of NaNs:
0
0 NaN
1 0.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
Can someone give me a hand? I am able to solve the problem with numpy by flattening the numpy array obtained from converting the pandas DataFrame
a.values.ravel()
However, I would like to solve the calculation entirely with pandas. I have searched the documentation but haven't found the answer I am looking for. What am I missing or not undrstanding?
Thank you very much in advance.
D.

The computation you're trying to do can be thought of as operating on the following dataframe:
pd.concat([a, b], axis=1)
0 0
0 1 8
1 2 9
2 3 8
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 NaN
9 8 NaN
10 7 NaN
11 6 NaN
12 5 NaN
13 4 NaN
14 3 NaN
15 2 NaN
16 1 NaN
If you're using window=3, it correlates the first three values in b with the first 3 values in a, leaving the rest with NaN, and placing the value in the center of the window (center=True).
You can try:
pd.rolling_apply(a, window=3, func=lambda x: np.correlate(x, b[0]))
Output:
0
0 NaN
1 NaN
2 50
3 75
4 100
5 125
6 150
7 175
8 200
9 209
10 200
11 175
12 150
13 125
14 100
15 75
16 50
You can add center=True here too if you'd like.
(I'm using pandas 0.17.0)

Related

Get sum of values from last nth row by group id

I just want to know how to get the sum of the last 5th values based on id from every rows.
df:
id values
-----------------
a 5
a 10
a 10
b 2
c 2
d 2
a 5
a 10
a 20
a 10
a 15
a 20
expected df:
id values sum(x.tail(5))
-------------------------------------
a 5 NaN
a 10 NaN
a 10 NaN
b 2 NaN
c 2 NaN
d 2 NaN
a 5 NaN
a 10 NaN
a 20 40
a 10 55
a 15 55
a 20 60
For simplicity, I'm trying to find the sum of values from the last 5th rows from every rows with id a only.
I tried to use code df.apply(lambda x: x.tail(5)), but that only showed me last 5 rows from the very last row of the entire df. I want to get the sum of last nth rows from every and each rows. Basically it's like rolling_sum for time series data.
you can calculate the sum of the last 5 as like this:
df["rolling As"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"]
(this includes the current row as one of the 5. not sure if that is what you want)
id values rolling As
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 55.0
8 a 10 60.0
9 a 10 60.0
10 a 15 65.0
11 a 20 75.0
If you don't want it included. you can shift
df["rolling"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"].shift()
to give:
id values rolling
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 NaN
8 a 10 55.0
9 a 10 60.0
10 a 15 60.0
11 a 20 65.0
Try using groupby, transform, and rolling:
df['sum(x.tail(5))'] = df.groupby('id')['values']\
.transform(lambda x: x.rolling(5, min_periods=5).sum().shift())
Output:
id values sum(x.tail(5))
1 a 5 NaN
2 a 10 NaN
3 a 10 NaN
4 b 2 NaN
5 c 2 NaN
6 d 2 NaN
7 a 5 NaN
8 a 10 NaN
9 a 20 40.0
10 a 10 55.0
11 a 15 55.0
12 a 20 60.0

Remove lesser than K consecutive NaNs from pandas DataFrame

I am working Time Series data. I am facing problem while removing consecutive NaNs less than or equal to threshold from a Data Frame column. I tried looking at some of the links like:
Identifying consecutive NaN's with pandas : Identifies where consecutive NaNs are present and what is count.
Pandas: run length of NaN holes : Outputs run Length encoding for NaNs
There are many more others along this lane, but none of them actually tells how can we remove them after identifying.
I found one similar solution but that is in R :
How to remove more than 2 consecutive NA's in a column?
I want solution in Python.
So here is the example:
Here is my dataframe column:
a
0 36.45
1 35.45
2 NaN
3 NaN
4 NaN
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
10 NaN
11 NaN
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
If k = 3, my output should be:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71
How can I go about removing the consecutive NaNs less than or equal to some threshold (k).
There are a few ways, but this is how I've done it:
Determine groups of consecutive numbers using a neat cumsum trick
Use groupby + transform to determine the size of each group
Identify groups of NaNs that are within the threshold
Filter them out with boolean indexing.
k = 3
i = df.a.isnull()
m = ~(df.groupby(i.ne(i.shift()).cumsum().values).a.transform('size').le(k) & i)
df[m]
a
0 36.45
1 35.45
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
You can perform df = df[m]; df.reset_index(drop=True) step at the end if you want a monotonically increasing integer index.
You can create a indicator column to count the consecutive nans.
k = 3
(
df.groupby(pd.notna(df.a).cumsum())
.apply(lambda x: x.dropna() if pd.isna(x.a).sum() <= k else x)
.reset_index(drop=True)
)
Out[375]:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71

Working with NaN values in multiple columns in Pandas

I have multiple datasets with different number of rows and same number of columns.
I would like to find Nan values in each column for example consider these two datasets:
dataset1 : dataset2:
a b a b
1 10 2 11
2 9 3 12
3 8 4 13
4 nan nan 14
5 nan nan 15
6 nan nan 16
I want to find nan values in two datasets a and b :
if it occurs in column b then remove all the rows that have nan values. and if it occurs in column a then fill that values with 0.
this is my snippet code:
a=pd.notnull(data['a'].values.any())
b= pd.notnull((data['b'].values.any()))
if a:
data = data.dropna(subset=['a'])
if b:
data[['a']] = data[['a']].fillna(value=0)
which does not work properly.
You just need fillna and dropna without control flow
data = data.dropna(subset=['b']).fillna(0)
Pass your condition to a dict
df=df.fillna({'a':0,'b':np.nan}).dropna()
You do not need 'b' here
df=df.fillna({'a':0}).dropna()
EDIT :
df.fillna({'a':0}).dropna()
Out[1319]:
a b
0 2.0 11
1 3.0 12
2 4.0 13
3 0.0 14
4 0.0 15
5 0.0 16

Add column in dataframe from list

I have a dataframe with some columns like this:
A B C
0
4
5
6
7
7
6
5
The possible range of values in A are only from 0 to 7.
Also, I have a list of 8 elements like this:
List=[2,5,6,8,12,16,26,32] //There are only 8 elements in this list
If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.
How can I do this in one go without looping over the whole dataframe?
The resulting dataframe would look like this:
A B C D
0 2
4 12
5 16
6 26
7 32
7 32
6 26
5 16
Note: The dataframe is huge and iteration is the last option option. But I can also arrange the elements in 'List' in any other data structure like dict if necessary.
Just assign the list directly:
df['new_col'] = mylist
Alternative
Convert the list to a series or array and then assign:
se = pd.Series(mylist)
df['new_col'] = se.values
or
df['new_col'] = np.array(mylist)
IIUC, if you make your (unfortunately named) List into an ndarray, you can simply index into it naturally.
>>> import numpy as np
>>> m = np.arange(16)*10
>>> m[df.A]
array([ 0, 40, 50, 60, 150, 150, 140, 130])
>>> df["D"] = m[df.A]
>>> df
A B C D
0 0 NaN NaN 0
1 4 NaN NaN 40
2 5 NaN NaN 50
3 6 NaN NaN 60
4 15 NaN NaN 150
5 15 NaN NaN 150
6 14 NaN NaN 140
7 13 NaN NaN 130
Here I built a new m, but if you use m = np.asarray(List), the same thing should work: the values in df.A will pick out the appropriate elements of m.
Note that if you're using an old version of numpy, you might have to use m[df.A.values] instead-- in the past, numpy didn't play well with others, and some refactoring in pandas caused some headaches. Things have improved now.
A solution improving on the great one from #sparrow.
Let df, be your dataset, and mylist the list with the values you want to add to the dataframe.
Let's suppose you want to call your new column simply, new_column
First make the list into a Series:
column_values = pd.Series(mylist)
Then use the insert function to add the column. This function has the advantage to let you choose in which position you want to place the column.
In the following example we will position the new column in the first position from left (by setting loc=0)
df.insert(loc=0, column='new_column', value=column_values)
First let's create the dataframe you had, I'll ignore columns B and C as they are not relevant.
df = pd.DataFrame({'A': [0, 4, 5, 6, 7, 7, 6,5]})
And the mapping that you desire:
mapping = dict(enumerate([2,5,6,8,12,16,26,32]))
df['D'] = df['A'].map(mapping)
Done!
print df
Output:
A D
0 0 2
1 4 12
2 5 16
3 6 26
4 7 32
5 7 32
6 6 26
7 5 16
Old question; but I always try to use fastest code!
I had a huge list with 69 millions of uint64. np.array() was fastest for me.
df['hashes'] = hashes
Time spent: 17.034842014312744
df['hashes'] = pd.Series(hashes).values
Time spent: 17.141014337539673
df['key'] = np.array(hashes)
Time spent: 10.724546194076538
You can also use df.assign:
In [1559]: df
Out[1559]:
A B C
0 0 NaN NaN
1 4 NaN NaN
2 5 NaN NaN
3 6 NaN NaN
4 7 NaN NaN
5 7 NaN NaN
6 6 NaN NaN
7 5 NaN NaN
In [1560]: mylist = [2,5,6,8,12,16,26,32]
In [1567]: df = df.assign(D=mylist)
In [1568]: df
Out[1568]:
A B C D
0 0 NaN NaN 2
1 4 NaN NaN 5
2 5 NaN NaN 6
3 6 NaN NaN 8
4 7 NaN NaN 12
5 7 NaN NaN 16
6 6 NaN NaN 26
7 5 NaN NaN 32

Multi-indexed dataframe: Setting values

I already asked a related question earlier, but I didn't want to start a comment-and-edit-discussion. So here's -boiled down - what the answer to my earlier question lead me to ask. Consider
import pandas as pd
from numpy import arange
from scipy import random
index = pd.MultiIndex.from_product([arange(0,3), arange(10,15)], names=['A', 'B'])
df = pd.DataFrame(columns=['test'], index=index)
someValues = random.randint(0, 10, size=5)
df.loc[0, 'test'], df.loc[0,:] and df.ix[0] all create a representation of a part of the data frame, the first one being a Series and the other two being df slices. However
df.ix[0] = df.loc[0,'test'] = someValues sets the value for the df
df.loc[0,'test'] = someValues gives an error ValueError: total size of new array must be unchanged
df.loc[0,:] = someValues is being ignored. No error, but the df does not contain the numpy array.
I skimmed the docs but there was no clear logical and systematical explanation on what is going on with MultiIndexes in general. So far, I guess that "if the view is a Series, you can set values", and "otherwise, god knows what happens".
Could someone shed some light on the logic? Moreover, is there some deep meaning behind this or are these just constraints due to how it is set up?
These are all with 0.13.1
These are not all 'slice' representations at all.
This is a Series.
In [50]: df.loc[0,'test']
Out[50]:
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
Name: test, dtype: object
These are DataFrames (and the same)
In [51]: df.loc[0,:]
Out[51]:
test
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[5 rows x 1 columns]
In [52]: df.ix[0]
Out[52]:
test
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[5 rows x 1 columns]
This is trying to assign the wrong shape (it looks like it should work, but if you have multiple columns then it won't, that is why this is not allowed)
In [54]: df.ix[0] = someValues
ValueError: could not broadcast input array from shape (5) into shape (5,1)
This works because it knows how to broadcast
In [56]: df.loc[0,:] = someValues
In [57]: df
Out[57]:
test
A B
0 10 4
11 3
12 4
13 2
14 8
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
This works fine
In [63]: df.loc[0,'test'] = someValues+1
In [64]: df
Out[64]:
test
A B
0 10 5
11 4
12 5
13 3
14 9
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
As does this
In [66]: df.loc[0,:] = someValues+1
In [67]: df
Out[67]:
test
A B
0 10 5
11 4
12 5
13 3
14 9
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
Not clear where you generated the cases in your question. I think the logic is pretty straightforward and consistent (their were several inconsistencies in prior versions however).

Categories

Resources