How to detect deviation with pandas DataFrame?

How to detect deviation with pandas DataFrame? - python

I have some data that contains 5 columns and 1000 rows. Now I just picked up 3 random rows:
5 5 5 0.1 0.2
4 4 4 4 0.3
4 3 3 3 1
How can I detect the deviation in each row? For example in the first row there are two 0s and in the second row there is one 0. I tried using mean but that is not the right solution.

You could do something like this:
n=3
new_df=df.loc[:,~(df.diff(axis=1).abs()>n).any()]
print(new_df)
col1 col2 col3
0 5.0 5.0 5.0
1 4.0 4.0 4.0
2 4.0 3.0 3.0
new_df=df.loc[:,(df.diff(axis=1).abs()>n).any()]
print(new_df)
col4 col5
0 0.1 0.2
1 4.0 0.3
2 3.0 1.0
you can select the interval you want.
Differences
print(df.diff(axis=1).abs())
col1 col2 col3 col4 col5
0 NaN 0.0 0.0 4.9 0.1
1 NaN 0.0 0.0 0.0 3.7
2 NaN 1.0 0.0 0.0 2.0

Related

Pandas: Fillna with local average if a condition is met

Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?

We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0

how to change the rows value by condition on a column (python,pandas)

hey all i want to change the rows by condition on a column.
so where column "type"==A
i want the cols [col1-col5] will be 1 if the value
is biger 2
else i like the value to be 0
the DATA
data={"col1":[np.nan,3,4,5,9,2,6],
"col2":[4,2,4,6,0,1,5],
"col3":[7,6,0,11,3,6,7],
"col4":[14,11,22,8,6,np.nan,9],
"col5":[0,5,7,3,8,2,9],
"type":["A","A","C","A","B","A","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)
df
How I expect the data to be
data={"col1":[0,1,4,1,9,0,6],
"col2":[1,0,4,1,0,0,5],
"col3":[1,1,0,1,3,1,7],
"col4":[1,1,22,1,6,0,9],
"col5":[0,1,7,1,1,0,9],
"type":["A","A","C","A","B","A","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)
df

You can use df.query to get all type A rows, then use df._get_numeric_data/df.select_dtypes('number') to get all numeric fields, then use df.gt and cast them as int using df.astype, now update the DataFrame with new values using df.update
df.update(df.query('type == "A"')._get_numeric_data().gt(2).astype(int))
#.select_dtypes('number')
df
col1 col2 col3 col4 col5 type number
0 0.0 1.0 1.0 1.0 0.0 A one
1 1.0 0.0 1.0 1.0 1.0 A two
2 4.0 4.0 0.0 22.0 7.0 C two
3 1.0 1.0 1.0 1.0 1.0 A one
4 9.0 0.0 3.0 6.0 8.0 B one
5 0.0 0.0 1.0 0.0 0.0 A two
6 6.0 5.0 7.0 9.0 9.0 E two

Use DataFrame.loc for select by condition equal A and columns between first and last column name, then compare for greater like DataFrame.gt, for map True, False to 1,0 is used convert mask to integers, last update by DataFrame.update:
df.update(df.loc[df['type'].eq('A'), 'col1':'col5'].gt(2).astype(int))
print (df)
col1 col2 col3 col4 col5 type number
0 0.0 1.0 1.0 1.0 0.0 A one
1 1.0 0.0 1.0 1.0 1.0 A two
2 4.0 4.0 0.0 22.0 7.0 C two
3 1.0 1.0 1.0 1.0 1.0 A one
4 9.0 0.0 3.0 6.0 8.0 B one
5 0.0 0.0 1.0 0.0 0.0 A two
6 6.0 5.0 7.0 9.0 9.0 E two
Or by assign back:
m = df['type'].eq('A')
df.loc[m, 'col1':'col5'] = df.loc[m, 'col1':'col5'].gt(2).astype(int)
print (df)
col1 col2 col3 col4 col5 type number
0 0.0 1 1 1.0 0 A one
1 1.0 0 1 1.0 1 A two
2 4.0 4 0 22.0 7 C two
3 1.0 1 1 1.0 1 A one
4 9.0 0 3 6.0 8 B one
5 0.0 0 1 0.0 0 A two
6 6.0 5 7 9.0 9 E two

How to choose two rows in a dataframe, calculate the average of both values in each columns and add the new row with the averages in the dataframe

I have a pandas dataframe :
col1 col2 col3
0 8 7 5
1 6 2 17
2 3 1 21
3 4 3 9
I want to calculate the average of each columns of row 1 and row 2 and add the new row to my pandas dataframe and get :
col1 col2 col3
0 8 7 5
1 6 2 2
2 3 1 4
3 4 3 9
4 4.5 1.5 3

You can do a concat:
pd.concat((df, df.iloc[[1,2]].mean().to_frame().T)).reset_index(drop=True)
Output:
col1 col2 col3
0 8.0 7.0 5.0
1 6.0 2.0 17.0
2 3.0 1.0 21.0
3 4.0 3.0 9.0
4 4.5 1.5 19.0
Or an append:
df.append(df.iloc[[1,2]].mean().rename(len(df)))
Output:
col1 col2 col3
0 8.0 7.0 5.0
1 6.0 2.0 17.0
2 3.0 1.0 21.0
3 4.0 3.0 9.0
4 4.5 1.5 19.0

How to delete the first and last rows with NaN of a dataframe and replace the remaining NaN with the average of the values below and above?

Let's take this dataframe as a simple example:
df = pd.DataFrame(dict(Col1=[np.nan,1,1,2,3,8,7], Col2=[1,1,np.nan,np.nan,3,np.nan,4], Col3=[1,1,np.nan,5,1,1,np.nan]))
Col1 Col2 Col3
0 NaN 1.0 1.0
1 1.0 1.0 1.0
2 1.0 NaN NaN
3 2.0 NaN 5.0
4 3.0 3.0 1.0
5 8.0 NaN 1.0
6 7.0 4.0 NaN
I would like first to remove first and last rows until there is no longer NaN in the first and last row.
Intermediate expected output :
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 NaN NaN
3 2.0 NaN 5.0
4 3.0 3.0 1.0
Then, I would like to replace the remaining NaN by the mean of the nearest value below which is not a NaN, and the one above.
Final expected output :
Col1 Col2 Col3
0 1.0 1.0 1.0
1 1.0 2.0 3.0
2 2.0 2.0 5.0
3 3.0 3.0 1.0
I know I can have the positions of NaN in my dataframe through
df.isna()
But I can't solve my problem. How please could I do ?

My approach:
# identify the rows with some NaN
s = df.notnull().all(1)
# remove those with NaN at beginning and at the end:
new_df = df.loc[s.idxmax():s[::-1].idxmax()]
# average:
new_df = (new_df.ffill()+ new_df.bfill())/2
Output:
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 2.0 3.0
3 2.0 2.0 5.0
4 3.0 3.0 1.0

Another option would be to use DataFrame.interpolate with round:
nans = df.notna().all(axis=1).cumsum().drop_duplicates()
low, high = nans.idxmin(), nans.idxmax()
df.loc[low+1: high].interpolate().round()
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 2.0 3.0
3 2.0 2.0 5.0
4 3.0 3.0 1.0

how to concat between columns keeping sequence unchanged in 2 dataframes pandas

I have 2 dataframes and i want to concat each other as follows:
df1:
index 394 min FIC-2000 398 min FFC
0 Recycle Gas min 20K20 Compressor min 20k
1 TT date kg/h AT date ..
2 nan 2011-03-02 -20.7 2011-03-02
08:00:00 08:00:00
3 nan 2011-03-02 -27.5 ...
08:00:10
df2:
index Unnamed:0 0 1 .. 394 395 .....
0 Service Prop Prop1 Recycle Gas RecG
the output df3 should be like this:
df3
index Unnamed:0 0 .. 394 395..
0 Service Prop Recycle Gas RecG
1 Recycle Gas min FIC-2000
2 min 20K20
3 TT date kg/h
4 nan 2011-03-02 -20.7
08:00:00
5 nan 2011-03-02 -27.5
08:00:10
i've tried to use this code:
df3=pd.concat([df1,df2), axis=1)
but this just concat index 394 and the rest of df1 is appended to the end of the dataframe of df2.
Any idea how to?

Just change to axis=0.
Consider this:
Input:
>>> df
col1 col2 col3
0 1 4 2
1 2 1 5
2 3 6 319
>>> df_1
col4 col5 col6
0 1 4 12
1 32 12 3
2 3 2 319
>>> df_2
col1 col3 col6
0 12 14 2
1 4 132 3
2 23 22 9
Concat mismatched (per column name)
>>> pd.concat([df, df_1], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
Concat matching:
>>> pd.concat([df, df_1, df_2], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
0 12.0 NaN 14.0 NaN NaN 2.0
1 4.0 NaN 132.0 NaN NaN 3.0
2 23.0 NaN 22.0 NaN NaN 9.0
Concat matched, fill NaN-s (analogically you can fill None-s)
>>> pd.concat([df, df_1, df_2], axis=0).fillna(0) #in case you wish to prettify it, maybe in case of strings do .fillna('')
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 0.0 0.0 0.0
1 2.0 1.0 5.0 0.0 0.0 0.0
2 3.0 6.0 319.0 0.0 0.0 0.0
0 0.0 0.0 0.0 1.0 4.0 12.0
1 0.0 0.0 0.0 32.0 12.0 3.0
2 0.0 0.0 0.0 3.0 2.0 319.0
0 12.0 0.0 14.0 0.0 0.0 2.0
1 4.0 0.0 132.0 0.0 0.0 3.0
2 23.0 0.0 22.0 0.0 0.0 9.0
EDIT
Triggered by the conversation with OP in the comment section below.
So you do:
(1) To concat dataframes
df3=pd.concat([df1,df2], axis=0)
(2) To join another dataframe on them:
df5=pd.merge(df3, df4[["FIC", "min"]], on="FIC", how="outer")
(you may want to consider suffixes field if you think it's relevant)
REF: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to detect deviation with pandas DataFrame? - python

Related

Pandas: Fillna with local average if a condition is met

how to change the rows value by condition on a column (python,pandas)

How to choose two rows in a dataframe, calculate the average of both values in each columns and add the new row with the averages in the dataframe

How to delete the first and last rows with NaN of a dataframe and replace the remaining NaN with the average of the values below and above?

how to concat between columns keeping sequence unchanged in 2 dataframes pandas

Categories

Resources