Update multiple columns per row with loop through pandas dataframe

Update multiple columns per row with loop through pandas dataframe - python

I've reviewed several posts on here about better ways to loop through dataframes, but can't seem to figure out how to apply them to my specific situation.
I have a dataframe of about 2M rows and I need to calculate six statistics for each row, one per column. There are 3 columns so 18 total. However, the issue is that I need to update those stats using a sample of the dataframe so that the mean/median, etc is different per row.
Here's what I have so far:
r = 0
for i in imputed_df.iterrows():
t = imputed_df.sample(n=10)
for (columnName) in cols:
imputed_df.loc[r,columnName + '_mean'] = t[columnName].mean()
imputed_df.loc[r,columnName + '_var'] = t[columnName].var()
imputed_df.loc[r,columnName + '_std'] = t[columnName].std()
imputed_df.loc[r,columnName + '_skew'] = t[columnName].skew()
imputed_df.loc[r,columnName + '_kurt'] = t[columnName].kurt()
imputed_df.loc[r,columnName + '_med'] = t[columnName].median()
But this has been running for two days without finishing. I tried to take a subset of 2000 rows from the original dataframe and even that one has been running for hours.
Is there a better way to do this?
EDIT: Added a sample dataset of what it should look like. each suffixed column should have the calculated value of the subset of 10 rows.
timestamp activityID w2 w3 w4
0 41.21 1.0 -1.34587 9.57245 2.83571
1 41.22 1.0 -1.76211 10.63590 2.59496
2 41.23 1.0 -2.45116 11.09340 2.23671
3 41.24 1.0 -2.42381 11.88590 1.77260
4 41.25 1.0 -2.31581 12.45170 1.50289

The problem is that you do the operation for each column using unnecessary loops.
We could use
DataFrame.agg with DataFrame.unstack and Series.set_axis to get correct names of columns.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 100))).add_prefix('col')
new_serie = df.agg(['sum', 'mean',
'var', 'std',
'skew', 'kurt', 'median']).unstack()
new_df = pd.concat([df, new_serie.set_axis([f'{x}_{y}'
for x, y in new_serie.index])
.to_frame().T], axis=1)
# if new_df already exist:
#new_df.loc[0, :] = new_serie.set_axis([f'{x}_{y}' for x, y in new_serie.index])
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 ... \
0 8 7 6 7 6 5 8 7 8 4 ...
1 8 1 8 7 0 8 8 4 6 1 ...
2 5 6 3 5 4 9 3 0 2 5 ...
3 3 3 3 3 5 4 5 1 3 5 ...
4 7 9 4 5 6 7 0 3 4 6 ...
5 0 5 2 0 8 0 3 7 6 5 ...
6 7 0 1 4 8 9 4 9 2 9 ...
7 0 6 1 0 6 1 3 0 3 4 ...
8 3 6 1 8 3 0 7 6 8 6 ...
9 2 5 8 5 8 4 9 1 9 9 ...
col98_skew col98_kurt col98_median col99_sum col99_mean col99_var \
0 0.456435 -0.939607 3.0 39.0 3.9 6.322222
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
col99_std col99_skew col99_kurt col99_median
0 2.514403 0.402601 1.099343 4.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN

Related

merge two rows in one row and convert to NA

Dataframe:
0 1 2 3 4 slicing
0 NaN Object 1 NaN NaN 0
6 NaN Object 2 NaN NaN 6
12 NaN Object 3 NaN NaN 12
18 NaN Object 4 NaN NaN 18
23 NaN Object 5 NaN NaN 23
desired output:
0 1 2 3 4 slicing
0 NaN Object1 NaN NaN NaN 0
6 NaN Object2 NaN NaN NaN 6
12 NaN Object3 NaN NaN NaN 12
18 NaN Object4 NaN NaN NaN 18
23 NaN Object5 NAN NaN NaN 23
library pandas
iterate through each row in the dataset (since there are only NA's and str'Object' with its corresponding str'1-10' number)
replace str numbers with Na and concatenate data in the same row
Code for now:
df= df[df.apply(lambda row: row.astype(str).str.contains('Desk').any().df[row]+df[row], axis=1)]
Index 0 1 2 3 4
0 NaN Desk 1 NaN NaN
5 NaN Desk 2 NaN NaN
10 NaN Desk 3 NaN NaN
15 NaN Desk 4 NaN NaN
20 NaN Desk 5 NaN NaN

Here's what I did:
Using the following dataframe as an example:
0 1 2 3 4 slicing
index
0 NaN Object 1 NaN NaN 0
6 NaN Object 2 NaN A 6
12 NaN Object 3 NaN NaN 12
18 NaN NaN 4 NaN NaN 18
23 Stuff Object NaN 5 NaN 23
I perform 4 steps in the below 4 lines of code, when 'Object' exists in column 1: 1) replace nans with nothing; 2) set everything to string type; 3) join the row, to column 1, 4) replace all the other columns with nan
df.loc[df['1']=='Object',['0', '2', '3','4']] = df.loc[df['1']=='Object',['0', '2', '3','4']].fillna('')
df.loc[df['1']=='Object',['0','1', '2', '3','4']] = df.loc[df['1']=='Object',['0','1', '2', '3','4']].astype(str)
df.loc[df['1']=='Object', ['1','0', '2', '3','4']] = df.loc[df['1']=='Object', ['1', '0', '2', '3','4']].agg(''.join, axis=1)
df.loc[df['1'].str.contains('Object', na = False), ['0', '2', '3','4']] = np.nan
df
0 1 2 3 4 slicing
index
0 NaN Object1 NaN NaN NaN 0
6 NaN Object2A NaN NaN NaN 6
12 NaN Object3 NaN NaN NaN 12
18 NaN NaN 4 NaN NaN 18
23 NaN ObjectStuff5 NaN NaN NaN 23

If I understand what you are trying to achieve, you should really try to wok with columns instead of iterating. It is way faster. You can try something like this :
import numpy as np
columns = df.columns.tolist()
ix = df[df[columns[1]].str.contains('Object')].index
df.loc[ix:columns[1]] = df.loc[ix:columns[1]]+df.loc[ix:columns[2]]
df.loc[ix:columns[2]] = np.nan

Maintaining dataframe shape when slicing in pandas

I've imported a .csv into pandas and want to extract specific values and put them into a new column whilst maintaining the existing shape.
So df[::3] extracts the data-
1 1
2 4
3 7
4
5
6
7
I want it to look like
1 1
2
3
4 4
5
6
7 7

Here is a solution:
df = pd.read_csv(r"C:/users/k_sego/colsplit.csv",sep=";")
df1 = df[['col1']]
df2 = df[['col2']]
DF = pd.merge(df1,df2, how='outer',left_on=['col1'],right_on=['col2'])
and the result is
col1 col2
0 1.0 1.0
1 2.0 NaN
2 3.0 NaN
3 4.0 4.0
4 5.0 NaN
5 6.0 NaN
6 7.0 7.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN

Find first N non null values in each row

If I have a pandas dataframe like this:
NaN NaN NaN 0 5 7 2 2 3 7 8
NaN NaN 0 1 2 3 5 8 8 NaN 4
NaN 0 3 6 9 NaN 4 6 1 5 1
NaN NaN 0 1 2 3 5 8 8 NaN 2
NaN NaN NaN 0 5 7 2 2 3 7 8
NaN NaN 0 1 2 3 5 8 8 NaN 4
How do I only keep the first five non null values in each row and set the rest to nan such that I get a dataframe that looks like this:
NaN NaN NaN 0 5 7 2 2 NaN NaN NaN
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN
NaN 0 3 6 9 NaN 4 NaN NaN NaN NaN
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN
NaN NaN NaN 0 5 7 2 2 NaN NaN Nan
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN

You can use:
df.mask(df.notna().cumsum(axis=1).gt(5))

How to remove clustered/unclustered values less than a certain length from pandas dataframe?

If I have a pandas data frame like this:
A
1 1
2 1
3 NaN
4 1
5 NaN
6 1
7 1
8 1
9 1
10 NaN
11 1
12 1
13 1
How do I remove values that are clustered in a length less than some value (in this case four) for example? Such that I get an array like this:
A
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 1
7 1
8 1
9 1
10 NaN
11 NaN
12 NaN
13 NaN

Using groupby and np.where
s = df.groupby(df.A.isnull().cumsum()).transform(lambda s: pd.notnull(s).sum())
df['B'] = np.where(s.A>=4, df.A, np.nan)
Outputs
A B
1 1.0 NaN
2 1.0 NaN
3 NaN NaN
4 1.0 NaN
5 NaN NaN
6 1.0 1.0
7 1.0 1.0
8 1.0 1.0
9 1.0 1.0
10 NaN NaN
11 1.0 NaN
12 1.0 NaN
13 1.0 NaN

How to use pandas rolling_sum with sliding windows

I would like to calculate the sum or other calculation with sliding windows.
For example I would like to calculate the sum on the last 10 data point from current position where A is True.
Is there a way to do this ?
With this it didn't return the value that I expect.
I put the expected value and the calculation on the side.
Thank you
In [63]: dt['As'] = pd.rolling_sum( dt.Val[ dt.A == True ], window=10, min_periods=1)
In [64]: dt
Out[64]:
Val A B As
0 1 NaN NaN NaN
1 1 NaN NaN NaN
2 1 NaN NaN NaN
3 1 NaN NaN NaN
4 6 NaN True NaN
5 1 NaN NaN NaN
6 2 True NaN 1 pos 6 = 2
7 1 NaN NaN NaN
8 3 NaN NaN NaN
9 9 True NaN 2 pos 9 + pos 6 = 11
10 1 NaN NaN NaN
11 9 NaN NaN NaN
12 1 NaN NaN NaN
13 1 NaN True NaN
14 1 NaN NaN NaN
15 2 True NaN 3 pos 15 + pos 9 + pos 6 = 13
16 1 NaN NaN NaN
17 8 NaN NaN NaN
18 1 NaN NaN NaN
19 5 True NaN 4 pos 19 + pos 15 = 7
20 1 NaN NaN NaN
21 1 NaN NaN NaN
22 2 NaN NaN NaN
23 1 NaN NaN NaN
24 7 NaN True NaN
25 1 NaN NaN NaN
26 1 NaN NaN NaN
27 1 NaN NaN NaN
28 3 True NaN 5 pos 28 + pos 19 = 8
This almost do it
import numpy as np
import pandas as pd
dt = pd.read_csv('test2.csv')
dt['AVal'] = dt.Val[dt.A == True]
dt['ASum'] = pd.rolling_sum( dt.AVal, window=10, min_periods=1)
dt['ACnt'] = pd.rolling_count( dt.AVal, window=10)
In [4]: dt
Out[4]:
Val A B AVal ASum ACnt
0 1 NaN NaN NaN NaN 0
1 1 NaN NaN NaN NaN 0
2 1 NaN NaN NaN NaN 0
3 1 NaN NaN NaN NaN 0
4 6 NaN True NaN NaN 0
5 1 NaN NaN NaN NaN 0
6 2 True NaN 2 2 1
7 1 NaN NaN NaN 2 1
8 3 NaN NaN NaN 2 1
9 9 True NaN 9 11 2
10 1 NaN NaN NaN 11 2
11 9 NaN NaN NaN 11 2
12 1 NaN NaN NaN 11 2
13 1 NaN True NaN 11 2
14 1 NaN NaN NaN 11 2
15 2 True NaN 2 13 3
16 1 NaN NaN NaN 11 2
17 8 NaN NaN NaN 11 2
18 1 NaN NaN NaN 11 2
19 5 True NaN 5 7 2
20 1 NaN NaN NaN 7 2
21 1 NaN NaN NaN 7 2
22 2 NaN NaN NaN 7 2
23 1 NaN NaN NaN 7 2
24 7 NaN True NaN 7 2
25 1 NaN NaN NaN 5 1
26 1 NaN NaN NaN 5 1
27 1 NaN NaN NaN 5 1
28 3 True NaN 3 8 2
but need to NaN for all the value in ASum and ACount where A is NaN
Is this the way to do it ?

Are you just doing a sum, or is this a simplified example for a more complex problem?
If it's just a sum then you can use a mix of fillna() and the fact that True and False act like 1 and 0 in np.sum:
In [8]: pd.rolling_sum(dt['A'].fillna(False), window=10,
min_periods=1)[dt['A'].fillna(False)]
Out[8]:
6 1
9 2
15 3
19 2
28 2
dtype: float64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Update multiple columns per row with loop through pandas dataframe - python

Related

merge two rows in one row and convert to NA

Maintaining dataframe shape when slicing in pandas

Find first N non null values in each row

How to remove clustered/unclustered values less than a certain length from pandas dataframe?

How to use pandas rolling_sum with sliding windows

Categories

Resources