Moving window on a Standard Deviation & Mean calculation - python

I have a df which has many columns. I am currently using following command output = df.join(df.expanding().std().fillna(0).add_prefix("SD")) to generate a Standard deviation column for column A based on cumulative values like this:
A SDA
1 x1
2 x2
3 x3
4 x4
5 x5
Where x1 is the SD of 1, x2 is the SD of 1,2 ; x5 is the SD of 1,2,3,4,5 and so on.
I want to move the window in such a way that after it moves to 11, the SD will be calculated on the values 2 to 11.
A SDA
1 x1
2 x2
3 x3
.. ..
9 x9
10 x10
11 x11
12 x12
13 x13
.. ..
20 x20
21 x21
22 x22
So, here x11 will be calculating Standard deviation of values from 2,3,4..11and **x12 ** will be of 2 to 12 .Thus, x20 will be based on 2 to 20. After 20 values, it will again move one step and x21 will be the SD of 3,4,5,6...21`` .x22will be based on values from3 to 21``` and so on. I want to do such operation for multiple columns and generate multiple SD column at a time.
I am not sure how to use expanding function for this kind of moving windows.
For calculating mean in such same way, shall I just use mean function in the place of std() ?

You only need to determine the lower- and upper-bound for each row, then it's easy:
from datar.all import (
f, across, c, tibble, mutate, row_number, std, rnorm
)
# Create a example df
df = tibble(A=rnorm(22), B=rnorm(22))
def stdev(col, lower, upper):
"""Calculate the stdev with the lower- and upper-bound of current column"""
return [std(col[low:up]) for low, up in zip(lower, upper)]
(
df
>> mutate(
# create the lower- and upper-bound
# note it's 0-based
upper=row_number(),
lower=(f.upper - 1) // 10,
)
>> mutate(
across(
# Apply stdev func to each column except the lower and upper columns
~c(f.lower, f.upper),
stdev,
lower=f.lower,
upper=f.upper,
_names="SD{_col}",
)
)
)
A B upper lower SDA SDB
<float64> <float64> <int64> <int64> <float64> <float64>
0 -0.399324 0.740135 1 0 NaN NaN
1 -0.023364 0.468155 2 0 0.265844 0.192318
2 0.839819 -0.899893 3 0 0.635335 0.878940
3 -0.788705 0.497236 4 0 0.695902 0.744258
4 1.838374 -0.153098 5 0 1.053171 0.663758
5 0.174278 -0.938773 6 0 0.943238 0.736899
6 0.265525 1.906103 7 0 0.861060 0.998927
7 0.484971 1.687661 8 0 0.800723 1.058484
8 0.238861 1.378369 9 0 0.749275 1.041054
9 1.068637 -0.075925 10 0 0.747869 0.999481
10 -1.742042 -0.192375 11 1 0.984440 1.013941
11 -0.779599 -1.944827 12 1 0.982807 1.188045
12 -0.478696 0.211798 13 1 0.954120 1.132865
13 -2.598185 -0.747964 14 1 1.179397 1.113613
14 -0.308082 0.171333 15 1 1.134297 1.070135
15 0.700852 -2.719584 16 1 1.113848 1.261954
16 0.917145 0.375815 17 1 1.104229 1.224715
17 1.343796 -0.796525 18 1 1.118582 1.199169
18 1.024335 -0.943663 19 1 1.108354 1.180068
19 -0.877742 -0.431288 20 1 1.101227 1.148623
20 -0.584439 -0.555945 21 2 1.111302 1.141233
21 -0.946391 -1.550432 22 2 1.103871 1.149968
You can finally remove the lower and upper columns using select():
df >> select(~c(f.lower, f.upper))
Disclaimer: I am the author of datar, which is a wrapper around pandas to implement some features from dplyr/tidyr in R.

Related

Sample from dataframe with conditions

I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'
What I have:
df['target'].value_counts()
0 = 4000
1 = 120000
What I need:
new_df['target'].value_counts()
0 = 4000
1 = 6000
I know I can df.sample but I dont know how to insert the conditional.
Thanks
Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:
df.groupby('target').sample(4000)
Demo:
df = pd.DataFrame({'x': [0] * 10 + [1] * 25})
df.groupby('x').sample(5)
x
8 0
6 0
7 0
2 0
9 0
18 1
33 1
24 1
32 1
15 1
If you need to sample conditionally based on the group value, you can do:
df.groupby('target', group_keys=False).apply(
lambda g: g.sample(4000 if g.name == 0 else 6000)
)
Demo:
df.groupby('x', group_keys=False).apply(
lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7 0
8 0
2 0
1 0
18 1
12 1
17 1
22 1
30 1
28 1
Assuming the following input and using the values 4/6 instead of 4000/6000:
df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})
You could groupby your target and sample to take at most N values per group:
df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))
example output:
target
4 0
0 0
8 0
12 0
10 1
14 1
1 1
7 1
11 1
13 1
If you want the same size you can simply use df.groupby('target').sample(n=4)

How to iteratively add a column to a dataframe X based on the values of a separated dataframe y with Pandas?

I am struggling with this problem.
These are my initial matrices:
columnsx = {'X1':[6,11,17,3,12],'X2':[1,2,10,24,18],'X3':[8,14,9,15,7], 'X4':[22,4,20,16,5],'X5':[19,21,13,23,25]}
columnsy = {'y1':[0,1,1,2,0],'y2':[1,0,0,2,1]}
X = pd.DataFrame(columnsx)
y = pd.DataFrame(columnsy)
This is the final solution I am figuring out. It adds a column to X (called X_i), corresponding to the name of y with y value > 0. Therefore, it takes only the positive values of y (y>0) and rensitutes a binary vector with cardinality 2.
columnsx = {'X1':[11,17,3,6,3,12],'X2':[2,10,24,1,24,18],'X3':[14,9,15,8,15,7],
'X4':[4,20,16,22,16,5],'X5':[21,13,23,19,23,25], 'X_i':['y1','y1','y1','y2','y2','y2']}
columnsy = {'y':[1,1,2,1,2,1]}
X = pd.DataFrame(columnsx)
y = pd.DataFrame(columnsy)
Use DataFrame.melt
new_df = (df.melt(df.columns[df.columns.str.contains('X')],
var_name='X_y', value_name='y')
.loc[lambda df: df['y'].gt(0)])
print(new_df)
Output
X1 X2 X3 X4 X5 X_y y
1 11 2 14 4 21 y1 1
2 17 10 9 20 13 y1 1
3 3 24 15 16 23 y1 2
5 6 1 8 22 19 y2 1
8 3 24 15 16 23 y2 2
9 12 18 7 5 25 y2 1

How to extract mean and fluctuation by equal index?

I have a CSV file like the below (after sorted the dataframe by iy):
iy,u
1,80
1,90
1,70
1,50
1,60
2,20
2,30
2,35
2,15
2,25
I'm trying to compute the mean and the fluctuation when iy are equal. For example, for the CSV above, what I want is something like this:
iy,u,U,u'
1,80,70,10
1,90,70,20
1,70,70,0
1,50,70,-20
1,60,70,-10
2,20,25,-5
2,30,25,5
2,35,25,10
2,15,25,-10
2,25,25,0
Where U is the average of u when iy are equal, and u' is simply u-U, the fluctuation. I know that there's a function called groupby.mean() in pandas, but I don't want to group the dataframe, just take the mean, put the values in a new column, and then calculate the fluctuation.
How can I proceed?
Use groupby with transform to calculate a mean for each group and assign that value to a new column 'U', then pandas to subtract two columns:
df['U'] = df.groupby('iy').transform('mean')
df["u'"] = df['u'] - df['U']
df
Output:
iy u U u'
0 1 80 70 10
1 1 90 70 20
2 1 70 70 0
3 1 50 70 -20
4 1 60 70 -10
5 2 20 25 -5
6 2 30 25 5
7 2 35 25 10
8 2 15 25 -10
9 2 25 25 0
You could get fancy and do it in one line:
df.assign(U=df.groupby('iy').transform('mean')).eval("u_prime = u-U")

Subtracting group-wise mean from a matrix or data frame in python (the "within" transformation for panel data)

In datasets where units are observed multiple times, many statistical methods (particularly in econometrics) apply a transformation to the data in which the group-wise mean of each variable is subtracted off, creating a dataset of unit-level (non-standardized) anomalies from a unit level mean.
I want to do this in Python.
In R, it is handled quite cleanly by the demeanlist function in the lfe package. Here's an example dataset, which a grouping variable fac:
> df <- data.frame(fac = factor(c(rep("a", 5), rep("b", 6), rep("c", 4))),
+ x1 = rnorm(15),
+ x2 = rbinom(15, 10, .5))
> df
fac x1 x2
1 a -0.77738784 6
2 a 0.25487383 4
3 a 0.05457782 4
4 a 0.21403962 7
5 a 0.08518492 4
6 b -0.88929876 4
7 b -0.45661751 5
8 b 1.05712683 3
9 b -0.24521251 5
10 b -0.32859966 7
11 b -0.44601716 3
12 c -0.33795597 4
13 c -1.09185690 7
14 c -0.02502279 6
15 c -1.36800818 5
And the transformation:
> library(lfe)
> demeanlist(df[,c("x1", "x2")], list(df$fac))
x1 x2
1 -0.74364551 1.0
2 0.28861615 -1.0
3 0.08832015 -1.0
4 0.24778195 2.0
5 0.11892725 -1.0
6 -0.67119563 -0.5
7 -0.23851438 0.5
8 1.27522996 -1.5
9 -0.02710938 0.5
10 -0.11049653 2.5
11 -0.22791403 -1.5
12 0.36775499 -1.5
13 -0.38614594 1.5
14 0.68068817 0.5
15 -0.66229722 -0.5
In other words, the following numbers are subtracted from groups a, b, and c:
> library(doBy)
> summaryBy(x1+x2~fac, data = df)
fac x1.mean x2.mean
1 a -0.03374233 5.0
2 b -0.21810313 4.5
3 c -0.70571096 5.5
I'm sure I could figure out a function to do this, but I'll be calling it thousands of times on very large datasets, and would like to know if something fast and optimized has already been built, or is obvious to construct.

In Pandas, how to operate between columns in max perfornace

I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you
I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200

Categories

Resources