Applying Scipy's ttest Using a Rolling Window - Part 2 - python

My question tackles a slightly different angle of this question:
I have a helper function which computes scipy's ttest for independence. Here it is:
# Helper Function for Testing for Independence
def conduct_ttest(data, variable_1="bias", variable_2="score", nan_policy="omit"):
test_result = ttest_ind(data[variable_1], data[variable_2], nan_policy=nan_policy)
test_statistic = test_result[0]
p_value = test_result[1]
return test_statistic, p_value
I would like to run it using a 5 period rolling window so that it outputs the test results into the dataframe, "data". The dataframe looks like this:
date bias score
1/1/2021 5 1000
1/2/2021 13 1089
1/3/2021 21 1178
1/4/2021 29 1267
1/5/2021 37 1356
1/6/2021 45 1445
1/7/2021 53 1534
1/8/2021 61 1623
1/9/2021 69 1712
1/10/2021 77 1801
1/11/2021 85 1890
1/12/2021 93 1979
1/13/2021 101 2068
1/14/2021 109 2157
1/15/2021 117 2246
1/16/2021 125 2335
1/17/2021 133 2424
What I have tried:
data[["test_statistic", "p_value"]] = \
data.rolling(5).apply(lambda x: conduct_ttest(x, variable_1="bias", variable_2="score", nan_policy="omit")
However, it is not working. Does anyone have tips on what I can do?

I failed to find a buit-in rolling method, so try this simple iterative workaround solution:
#in this function I just added index to returning values:
def conduct_ttest(data, variable_1="bias", variable_2="score", nan_policy="omit"):
test_result = ttest_ind(data[variable_1], data[variable_2], nan_policy=nan_policy)
test_statistic = test_result[0]
p_value = test_result[1]
return data.index.max(), test_statistic, p_value
#define rolling apply period:
window = 5
pd.DataFrame(
[conduct_ttest(df.iloc[range(i,i+window)]) for i in range(len(df)-window)],
columns=['index','test_statistic','p_value']
).set_index('index', drop=True)
result:
test_statistic p_value
index
4 -18.310951 8.140624e-08
5 -19.592876 4.788281e-08
6 -20.874800 2.909324e-08
7 -22.156725 1.819271e-08
8 -23.438650 1.167216e-08
9 -24.720575 7.663247e-09
10 -26.002500 5.136947e-09
11 -27.284425 3.509024e-09
12 -28.566349 2.438519e-09
13 -29.848274 1.721420e-09
14 -31.130199 1.232845e-09
15 -32.412124 8.947394e-10

Related

Pandas DataFrame iterate over a window of rows quickly

I've got a time-series dataframe that looks something like:
datetime gesture left-5-x ...30 columns omitted
2022-09-27 19:54:54.396680 gesture0255 533
2022-09-27 19:54:54.403298 gesture0255 534
2022-09-27 19:54:54.408938 gesture0255 535
2022-09-27 19:54:54.413995 gesture0255 523
2022-09-27 19:54:54.418666 gesture0255 522
... 95 000 rows omitted
And I want to create a new column df['cross_correlation'] which is the function of multiple sequential rows. So the cross_correlation of row i depends on the data from rows i-10 to i+10.
I could do this with df.iterrow(), but that seems like the non-idiomatic version. Is there a function like
df.window(-10, +10).apply(lambda rows: calculate_cross_correlation(rows)
or similar?
EDIT:
Thanks #chris, who pointed me towards df.rolling(), although I now have this example which better reflect the problem I'm having:
Here's a simplified version of the function I want to apply over the moving window. Note that the actual version requires that the input be the full 2D window of shape (window_size, num_columns) but the toy function below doesn't actually need the input to be 2D. I've added an assertion to make sure this is true:
def sum_over_2d(x):
assert len(x.shape) == 2, f'shape of input is {x.shape} and not of length 2'
return x.sum()
And now if I use .rolling with .apply
df.rolling(window=10, center=True).apply(
sum_over_2d
)
, I get an assertion error:
AssertionError: shape of input is (10,) and not of length 2
and if I print the input x before the assertion, I get:
0 533.0
1 534.0
2 535.0
3 523.0
4 522.0
5 526.0
6 510.0
7 509.0
8 502.0
9 496.0
dtype: float64
which is one column from my many-columned dataset. What I'm wanting is for the input x to be a dataframe or 2d numpy array.
IIUC, one way using pandas.Series.rolling.apply.
Example with sum:
df["new"] = df["left-5-x"].rolling(3, center=True, min_periods=1).sum()
Output:
datetime gesture left-5-x new explain
0 2022-09-27 19:54:54.396680 gesture0255 533 1067.0 533+534
1 2022-09-27 19:54:54.403298 gesture0255 534 1602.0 533+534+535
2 2022-09-27 19:54:54.408938 gesture0255 535 1592.0 534+535+523
3 2022-09-27 19:54:54.413995 gesture0255 523 1580.0 535+523+522
4 2022-09-27 19:54:54.418666 gesture0255 522 1045.0 523+522
You can see left-5-x are summed with +1 to -1 neighbors.
Edit:
If you want to use roll-ed dataframe, one way would be iterate over the rolling:
new_df = pd.concat([sum_over_2d(d) for d in df.rolling(window=10)],axis=1).T
Output:
0 1 2 3
0 0 1 2 3
1 4 6 8 10
2 12 15 18 21
3 24 28 32 36
4 40 45 50 55
5 60 66 72 78
6 84 91 98 105
7 112 120 128 136
8 144 153 162 171
9 180 190 200 210
Or as per #Sandwichnick's comment, you can use method=="table", but only if pass engine=="numba". In other words, your sum_over_2d must be numba compilable (which is beyond the scope of this question and my knowledge)
df.rolling(window=10, center=True, method="table").sum(engine="numba")

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

I have two functions that do some calculation and gives me results. For now, I am able to apply it in one column and get the result in the form of a dataframe.
I need to know how I can apply the function on all the columns in the dataframe and get results as well in the form of a dataframe.
Say I have a data frame as below and I need to apply the function on each column in the data frame and get a dataframe with results corresponding for all the columns.
A B C D E F
1456 6744 9876 374 65413 1456
654 2314 674654 2156 872 6744
875 653 36541 345 4963 9876
6875 7401 3654 465 3547 374
78654 8662 35 6987 6874 65413
658 94512 687 489 8756 5854
Results
A B C D E F
2110 9058 684530 2530 66285 8200
1529 2967 711195 2501 5835 16620
7750 8054 40195 810 8510 10250
85529 16063 3689 7452 10421 65787
Here is simple example
df
A B C D
0 10 11 12 13
1 20 21 22 23
2 30 31 32 33
3 40 41 42 43
# Assume your user defined function is
def mul(x, y):
return x * y
which will multiply the values
Let's say you want to multiply first column 'A' with 3
df['A'].apply(lambda x: mul(x,3))
0 30
1 60
2 90
3 120
Now, you want to apply mul function to all columns of dataframe and create new dataframe with results
df1 = df.applymap(lambda x: mul(x, 3))
df1
A B C D
0 30 33 36 39
1 60 63 66 69
2 90 93 96 99
3 120 123 126 129
pd.DataFrame object also has its own apply method.
From the example given in the documentation of the link above:
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Conclusion: you should be able to apply your function to the whole dataframe.
It looks like this is what you are trying to do in your output:
df = pd.DataFrame(
[[1456, 6744, 9876, 374, 65413, 1456],
[654, 2314, 674654, 2156, 872, 6744],
[875, 653, 36541, 345, 4963, 9876],
[6875, 7401, 3654, 465, 3547, 374],
[78654, 8662, 35, 6987, 6874, 65413],
[658, 94512, 687, 489, 8756, 5854]],
columns=list('ABCDEF'))
def fn(col):
return col[:-2].values + col[1:-1].values
Apply the function as mentioned in previous answers:
>>> df.apply(fn)
A B C D E F
0 2110 9058 684530 2530 66285 8200
1 1529 2967 711195 2501 5835 16620
2 7750 8054 40195 810 8510 10250
3 85529 16063 3689 7452 10421 65787

Format Pandas Pivot Table

I met a problem in formatting pivot table that created by Pandas.
So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index.
>> df = PD.read_excel("data.xls")
>> table = PD.pivot_table(df,index=["B"],
values='Count',columns=["A"],aggfunc=[NUM.sum],
fill_value=0,margins=True,dropna= True)
>> table
It returns as:
sum
A 1 2 3 All
B
1 23 52 0 75
2 16 35 12 65
3 56 0 0 56
All 95 87 12 196
And I hope to have a format like this:
A All_B
1 2 3
1 23 52 0 75
B 2 16 35 12 65
3 56 0 0 56
All_A 95 87 12 196
How should I do this? Thanks very much ahead.
The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it.
Before manipulation,
import pandas as pd
import numpy as np
np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)
B 1 2 3 All
A
1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All 1458 1472 1718 4648
After:
multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
A All_B
1 2 3
B 1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All_A 1458 1472 1718 4648

Why can't I apply shift from within a pandas function?

I am trying to build a function that uses .shift() but it is giving me an error.
Consider this:
In [40]:
data={'level1':[20,19,20,21,25,29,30,31,30,29,31],
'level2': [10,10,20,20,20,10,10,20,20,10,10]}
index= pd.date_range('12/1/2014', periods=11)
frame=DataFrame(data, index=index)
frame
Out[40]:
level1 level2
2014-12-01 20 10
2014-12-02 19 10
2014-12-03 20 20
2014-12-04 21 20
2014-12-05 25 20
2014-12-06 29 10
2014-12-07 30 10
2014-12-08 31 20
2014-12-09 30 20
2014-12-10 29 10
2014-12-11 31 10
A normal function works fine. To demonstrate I calculate the same result twice, using the direct and function approach:
In [63]:
frame['horizontaladd1']=frame['level1']+frame['level2']#works
def horizontaladd(x):
test=x['level1']+x['level2']
return test
frame['horizontaladd2']=frame.apply(horizontaladd, axis=1)
frame
Out[63]:
level1 level2 horizontaladd1 horizontaladd2
2014-12-01 20 10 30 30
2014-12-02 19 10 29 29
2014-12-03 20 20 40 40
2014-12-04 21 20 41 41
2014-12-05 25 20 45 45
2014-12-06 29 10 39 39
2014-12-07 30 10 40 40
2014-12-08 31 20 51 51
2014-12-09 30 20 50 50
2014-12-10 29 10 39 39
2014-12-11 31 10 41 41
But while directly applying shift works, in a function it doesn't work:
frame['verticaladd1']=frame['level1']+frame['level1'].shift(1)#works
def verticaladd(x):
test=x['level1']+x['level1'].shift(1)
return test
frame.apply(verticaladd)#error
results in
KeyError: ('level1', u'occurred at index level1')
I also tried applying to a single column which makes more sense in my mind, but no luck:
def verticaladd2(x):
test=x-x.shift(1)
return test
frame['level1'].map(verticaladd2)#error, also with apply
error:
AttributeError: 'numpy.int64' object has no attribute 'shift'
Why not call shift directly? I need to embed it into a function to calculate multiple columns at the same time, along axis 1. See related question Ambiguous truth value with boolean logic
Try passing the frame to the function, rather than using apply (I am not sure why apply doesn't work, even column-wise):
def f(x):
x.level1
return x.level1 + x.level1.shift(1)
f(frame)
returns:
2014-12-01 NaN
2014-12-02 39
2014-12-03 39
2014-12-04 41
2014-12-05 46
2014-12-06 54
2014-12-07 59
2014-12-08 61
2014-12-09 61
2014-12-10 59
2014-12-11 60
Freq: D, Name: level1, dtype: float64
Check if the values you are trying to shift is not an array. Then you need to convert the array to series. With this you will be able to shift the values. I was having same issues,now I am able to get the shift values.
This is my part of the code for your reference.
X = grouped['Confirmed_day'].values
X_series=pd.Series(X)
X_lag1 = X_series.shift(1)
I'm not entirely following along, but if frame['level1'].shift(1) works, then I can only imagine that frame['level1'] is not a numpy.int64 object while whatever you are passing into the verticaladd function is. Probably need to look at your types.

Categories

Resources