How to detect deviation with pandas DataFrame? - python

I have some data that contains 5 columns and 1000 rows. Now I just picked up 3 random rows:
5 5 5 0.1 0.2
4 4 4 4 0.3
4 3 3 3 1
How can I detect the deviation in each row? For example in the first row there are two 0s and in the second row there is one 0. I tried using mean but that is not the right solution.

You could do something like this:
n=3
new_df=df.loc[:,~(df.diff(axis=1).abs()>n).any()]
print(new_df)
col1 col2 col3
0 5.0 5.0 5.0
1 4.0 4.0 4.0
2 4.0 3.0 3.0
new_df=df.loc[:,(df.diff(axis=1).abs()>n).any()]
print(new_df)
col4 col5
0 0.1 0.2
1 4.0 0.3
2 3.0 1.0
you can select the interval you want.
Differences
print(df.diff(axis=1).abs())
col1 col2 col3 col4 col5
0 NaN 0.0 0.0 4.9 0.1
1 NaN 0.0 0.0 0.0 3.7
2 NaN 1.0 0.0 0.0 2.0

Related

Pandas: Fillna with local average if a condition is met

Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?
We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0

how to change the rows value by condition on a column (python,pandas)

hey all i want to change the rows by condition on a column.
so where column "type"==A
i want the cols [col1-col5] will be 1 if the value
is biger 2
else i like the value to be 0
the DATA
data={"col1":[np.nan,3,4,5,9,2,6],
"col2":[4,2,4,6,0,1,5],
"col3":[7,6,0,11,3,6,7],
"col4":[14,11,22,8,6,np.nan,9],
"col5":[0,5,7,3,8,2,9],
"type":["A","A","C","A","B","A","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)
df
How I expect the data to be
data={"col1":[0,1,4,1,9,0,6],
"col2":[1,0,4,1,0,0,5],
"col3":[1,1,0,1,3,1,7],
"col4":[1,1,22,1,6,0,9],
"col5":[0,1,7,1,1,0,9],
"type":["A","A","C","A","B","A","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)
df
You can use df.query to get all type A rows, then use df._get_numeric_data/df.select_dtypes('number') to get all numeric fields, then use df.gt and cast them as int using df.astype, now update the DataFrame with new values using df.update
df.update(df.query('type == "A"')._get_numeric_data().gt(2).astype(int))
#.select_dtypes('number')
df
col1 col2 col3 col4 col5 type number
0 0.0 1.0 1.0 1.0 0.0 A one
1 1.0 0.0 1.0 1.0 1.0 A two
2 4.0 4.0 0.0 22.0 7.0 C two
3 1.0 1.0 1.0 1.0 1.0 A one
4 9.0 0.0 3.0 6.0 8.0 B one
5 0.0 0.0 1.0 0.0 0.0 A two
6 6.0 5.0 7.0 9.0 9.0 E two
Use DataFrame.loc for select by condition equal A and columns between first and last column name, then compare for greater like DataFrame.gt, for map True, False to 1,0 is used convert mask to integers, last update by DataFrame.update:
df.update(df.loc[df['type'].eq('A'), 'col1':'col5'].gt(2).astype(int))
print (df)
col1 col2 col3 col4 col5 type number
0 0.0 1.0 1.0 1.0 0.0 A one
1 1.0 0.0 1.0 1.0 1.0 A two
2 4.0 4.0 0.0 22.0 7.0 C two
3 1.0 1.0 1.0 1.0 1.0 A one
4 9.0 0.0 3.0 6.0 8.0 B one
5 0.0 0.0 1.0 0.0 0.0 A two
6 6.0 5.0 7.0 9.0 9.0 E two
Or by assign back:
m = df['type'].eq('A')
df.loc[m, 'col1':'col5'] = df.loc[m, 'col1':'col5'].gt(2).astype(int)
print (df)
col1 col2 col3 col4 col5 type number
0 0.0 1 1 1.0 0 A one
1 1.0 0 1 1.0 1 A two
2 4.0 4 0 22.0 7 C two
3 1.0 1 1 1.0 1 A one
4 9.0 0 3 6.0 8 B one
5 0.0 0 1 0.0 0 A two
6 6.0 5 7 9.0 9 E two

How to choose two rows in a dataframe, calculate the average of both values in each columns and add the new row with the averages in the dataframe

I have a pandas dataframe :
col1 col2 col3
0 8 7 5
1 6 2 17
2 3 1 21
3 4 3 9
I want to calculate the average of each columns of row 1 and row 2 and add the new row to my pandas dataframe and get :
col1 col2 col3
0 8 7 5
1 6 2 2
2 3 1 4
3 4 3 9
4 4.5 1.5 3
You can do a concat:
pd.concat((df, df.iloc[[1,2]].mean().to_frame().T)).reset_index(drop=True)
Output:
col1 col2 col3
0 8.0 7.0 5.0
1 6.0 2.0 17.0
2 3.0 1.0 21.0
3 4.0 3.0 9.0
4 4.5 1.5 19.0
Or an append:
df.append(df.iloc[[1,2]].mean().rename(len(df)))
Output:
col1 col2 col3
0 8.0 7.0 5.0
1 6.0 2.0 17.0
2 3.0 1.0 21.0
3 4.0 3.0 9.0
4 4.5 1.5 19.0

How to delete the first and last rows with NaN of a dataframe and replace the remaining NaN with the average of the values below and above?

Let's take this dataframe as a simple example:
df = pd.DataFrame(dict(Col1=[np.nan,1,1,2,3,8,7], Col2=[1,1,np.nan,np.nan,3,np.nan,4], Col3=[1,1,np.nan,5,1,1,np.nan]))
Col1 Col2 Col3
0 NaN 1.0 1.0
1 1.0 1.0 1.0
2 1.0 NaN NaN
3 2.0 NaN 5.0
4 3.0 3.0 1.0
5 8.0 NaN 1.0
6 7.0 4.0 NaN
I would like first to remove first and last rows until there is no longer NaN in the first and last row.
Intermediate expected output :
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 NaN NaN
3 2.0 NaN 5.0
4 3.0 3.0 1.0
Then, I would like to replace the remaining NaN by the mean of the nearest value below which is not a NaN, and the one above.
Final expected output :
Col1 Col2 Col3
0 1.0 1.0 1.0
1 1.0 2.0 3.0
2 2.0 2.0 5.0
3 3.0 3.0 1.0
I know I can have the positions of NaN in my dataframe through
df.isna()
But I can't solve my problem. How please could I do ?
My approach:
# identify the rows with some NaN
s = df.notnull().all(1)
# remove those with NaN at beginning and at the end:
new_df = df.loc[s.idxmax():s[::-1].idxmax()]
# average:
new_df = (new_df.ffill()+ new_df.bfill())/2
Output:
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 2.0 3.0
3 2.0 2.0 5.0
4 3.0 3.0 1.0
Another option would be to use DataFrame.interpolate with round:
nans = df.notna().all(axis=1).cumsum().drop_duplicates()
low, high = nans.idxmin(), nans.idxmax()
df.loc[low+1: high].interpolate().round()
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 2.0 3.0
3 2.0 2.0 5.0
4 3.0 3.0 1.0

how to concat between columns keeping sequence unchanged in 2 dataframes pandas

I have 2 dataframes and i want to concat each other as follows:
df1:
index 394 min FIC-2000 398 min FFC
0 Recycle Gas min 20K20 Compressor min 20k
1 TT date kg/h AT date ..
2 nan 2011-03-02 -20.7 2011-03-02
08:00:00 08:00:00
3 nan 2011-03-02 -27.5 ...
08:00:10
df2:
index Unnamed:0 0 1 .. 394 395 .....
0 Service Prop Prop1 Recycle Gas RecG
the output df3 should be like this:
df3
index Unnamed:0 0 .. 394 395..
0 Service Prop Recycle Gas RecG
1 Recycle Gas min FIC-2000
2 min 20K20
3 TT date kg/h
4 nan 2011-03-02 -20.7
08:00:00
5 nan 2011-03-02 -27.5
08:00:10
i've tried to use this code:
df3=pd.concat([df1,df2), axis=1)
but this just concat index 394 and the rest of df1 is appended to the end of the dataframe of df2.
Any idea how to?
Just change to axis=0.
Consider this:
Input:
>>> df
col1 col2 col3
0 1 4 2
1 2 1 5
2 3 6 319
>>> df_1
col4 col5 col6
0 1 4 12
1 32 12 3
2 3 2 319
>>> df_2
col1 col3 col6
0 12 14 2
1 4 132 3
2 23 22 9
Concat mismatched (per column name)
>>> pd.concat([df, df_1], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
Concat matching:
>>> pd.concat([df, df_1, df_2], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
0 12.0 NaN 14.0 NaN NaN 2.0
1 4.0 NaN 132.0 NaN NaN 3.0
2 23.0 NaN 22.0 NaN NaN 9.0
Concat matched, fill NaN-s (analogically you can fill None-s)
>>> pd.concat([df, df_1, df_2], axis=0).fillna(0) #in case you wish to prettify it, maybe in case of strings do .fillna('')
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 0.0 0.0 0.0
1 2.0 1.0 5.0 0.0 0.0 0.0
2 3.0 6.0 319.0 0.0 0.0 0.0
0 0.0 0.0 0.0 1.0 4.0 12.0
1 0.0 0.0 0.0 32.0 12.0 3.0
2 0.0 0.0 0.0 3.0 2.0 319.0
0 12.0 0.0 14.0 0.0 0.0 2.0
1 4.0 0.0 132.0 0.0 0.0 3.0
2 23.0 0.0 22.0 0.0 0.0 9.0
EDIT
Triggered by the conversation with OP in the comment section below.
So you do:
(1) To concat dataframes
df3=pd.concat([df1,df2], axis=0)
(2) To join another dataframe on them:
df5=pd.merge(df3, df4[["FIC", "min"]], on="FIC", how="outer")
(you may want to consider suffixes field if you think it's relevant)
REF: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Categories

Resources