Cleaning outliers inside a column with interpolation

Cleaning outliers inside a column with interpolation - python

I'm trying to do the following.
I have some data with wrong values (x<=0 or x>=1100) inside a dataframe.
I am trying to change those values to values inside an acceptable range.
For the time being, this is what I do code-wise
def while_non_nan(A, k):
init = k
if k+1 >= len(A)-1:
return A.iloc[k-1]
while np.isnan(A[k+1]):
k += 1
#Calculate the value.
n = k-init+1
value = (n*A.iloc[init-1] + A.iloc[k])/(n+1)
return value
evoli.loc[evoli['T1'] >= 1100, 'T1'] = np.nan
evoli.loc[evoli['T1'] <= 0, 'T1'] = np.nan
inds = np.where(np.isnan(evoli))
#Place column means in the indices. Align the arrays using take
for k in inds[0] :
evoli['T1'].iloc[k] = while_non_nan(evoli['T1'], k)
I transform the outlier values into nan.
Afterwards, I get the position of those nan.
Finally, I modify the nan to the mean value between the previous value and the next one.
Since, several nan can be next to each other, the whie_non_nan search for the next non_nan value and get the ponderated mean.
Example of what I'm hoping to get:
Input :
[nan 0 1 2 nan 4 nan nan 7 nan ]
Output:
[0 0 1 2 3 4 5 6 7 7 ]
Hope it is clear enough. Thanks !

Pandas has a builtin interpolation you could use after setting your limits to NaN:
from numpy import NaN
import pandas as pd
df = pd.DataFrame({"T1": [1, 2, NaN, 3, 5, NaN, NaN, 4, NaN]})
df["T1"] = df["T1"].interpolate(method='linear', axis=0).ffill().bfill()
print(df)
Interpolate is a DataFrame method that fills NaN values with specified interpolation method (linear in this case). Calling .bfill() for backward fill and .ffill() for forward fill ensures the 1st and last item are also replaced if needed, with 2nd and 2nd to last item respectively. If you want some fancier strategy for 1st and last item you need to write it yourself.

Related

Creating a DataFrame from a dictionary of Series results in lost indices and NaNs

dict_with_series = {'Even':pd.Series([2,4,6,8,10]),'Odd':pd.Series([1,3,5,7,9])}
Data_frame_using_dic_Series = pd.DataFrame(dict_with_series)
# Data_frame_using_dic_Series = pd.DataFrame(dict_with_series,index=\[1,2,3,4,5\]), gives a NaN value I dont know why
display(Data_frame_using_dic_Series)
I tried labeling the index but when i did it eliminates the first column and row instead it prints extra column and row at the bottom with NaN value. Can anyone explain me why is it behaving like this , have I done something wrong
If I don't use the index labeling argument it works fine

When you run:
Data_frame_using_dic_Series = pd.DataFrame(dict_with_series,index=[1,2,3,4,5])
You request to only use the indices 1-5 from the provided Series, but the original indexing of a Series is from 0, thus resulting in a reindexing.
If you want to change the index, do it afterwards:
Data_frame_using_dic_Series = (pd.DataFrame(dict_with_series)
.set_axis([1, 2, 3, 4, 5])
)
Output:
Even Odd
1 2 1
2 4 3
3 6 5
4 8 7
5 10 9

How can I forward fill a dataframe column where the limit of rows filled is based on the value of a cell in another column?

So I am trying to forward fill a column with the limit being the value in another column. This is the code I run and I get this error message.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['NM'] = [0, 0, 1, np.nan, np.nan, np.nan, 0]
df['length'] = [0, 0, 2, 0, 0, 0, 0]
print(df)
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 NaN 0
4 NaN 0
5 NaN 0
6 0.0 0
df['NM'] = df['NM'].fillna(method='ffill', limit=df['length'])
print(df)
ValueError: Limit must be an integer
The dataframe I want looks like this:
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 1.0 0
4 1.0 0
5 NaN 0
6 0.0 0
Thanks in advance for any help you can provide!

I do not think you want to use ffill for this instance.
Rather I would recommend filtering to where length is greater than 0, then iterating through those rows to enter the NM value from that row in the next n+length rows.
for row in df.loc[df.length.gt(0)].reset_index().to_dict(orient='records'):
df.loc[row['index']+1:row['index']+row['length'], 'NM'] = row['NM']
To better break this down:
Get rows containing change information be sure to include the index.
df.loc[df.length.gt(0)].reset_index().to_dict(orient='records')
iterate through them... I prefer to_dict for performance reasons on large datasets. It is a habit.
sets NM rows to the NM value of your row with the defined length.

You can first group the dataframe by the length column before filling. Only issue is that for the first group in your example limit would be 0 which causes an error, so we can make sure it's at least 1 with max. This might cause unexpected results if there are nan values before the first non-zero value in length but from the given data it's not clear if that can happen.
# make groups
m = df.length.gt(0).cumsum()
# fill the column
df["NM"] = df.groupby(m).apply(
lambda f: f.NM.fillna(
method="ffill",
limit=max(f.length.iloc[0], 1))
).values

Optimizing a function to replace a row with a previous row given, a condition in Pandas

I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).
As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.
Here's my function thus far:
def checkPVals(DataFrame, CutOff):
for Cols in DataFrame.columns.values:
if Cols % 3 == 0:
for Vals in DataFrame.index.values:
if float(DataFrame[Cols][Vals]) < CutOff:
if (Vals != 0):
PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
return(DataFrame)
Here is a sample of the input data frame:
pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
Here is a sample of the desired output:
x y Pval
0 1 5 1.0
1 2 4 1.0
2 2 4 1.0
3 4 2 1.0
With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.
Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.
Any help is very much appreciated. Thank you!

I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.
If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.
df = pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()
print(new_df)
x y likelihood
0 1.0 5.0 1.0
1 2.0 4.0 1.0
2 2.0 4.0 1.0
3 4.0 2.0 1.0

Sorting pandas dataframe to get min value along diagonal

I have a panda dataframe, it is used for a heatmap. I would like the minimal value of each column to be along the diagonal.
I've sorted the columsn using
data = data.loc[:, data.min().sort_values().index]
This works. Now I just need to sort the values such that the index of the min value in the first column is row 0, then the min value of second column is row 1, and so on.
Example
import seaborn as sns
import pandas as pd
data = [[5,1,9],
[7,8,6],
[5,3,2]]
data = pd.DataFrame(data)
#sns.heatmap(data)
data = data.loc[:, data.min().sort_values().index]
#sns.heatmap(data) # Gives result in step 1
# Step1, Columsn sorted by min value, 1, 2, 5
data = [[1,9,5],
[8,6,7],
[3,2,5]]
data = pd.DataFrame(data)
#sns.heatmap(data)
# How do i perform step two, maintinaing column order.
# Step 2, Rows sorted by min value 1,2,7
data = [[1,9,5],
[3,2,5],
[8,6,7]]
data = pd.DataFrame(data)
sns.heatmap(data)
Is this possible in panda in a clever way?

Setup
data = pd.DataFrame([[5, 1, 9], [7, 8, 6], [5, 3, 2]])
You can accomplish this by using argsort of the diagonal elements of your sorted DataFrame, then indexing the DataFrame using these values.
Step 1
Use your initial sort:
data = data.loc[:, data.min().sort_values().index]
1 2 0
0 1 9 5
1 8 6 7
2 3 2 5
Step 2
Use np.argsort with np.diag:
data.iloc[np.argsort(np.diag(data))]
1 2 0
0 1 9 5
2 3 2 5
1 8 6 7

I'm not quite sure, but you've already done the following to sort column
data = data.loc[:, data.min().sort_values().index]
the same trick could also be applied to sort row
data = data.loc[data.min(axis=1).sort_values().index, :]

To move some values around so that the min value within each column is placed along the diagonal you could try something like this:
for i in range(len(data)):
min_index = data.iloc[:, i].idxmin()
if data.iloc[i,i] != data.iloc[min_index, i]:
data.iloc[i,i], data.iloc[min_index,i] = data.iloc[min_index, i], data.iloc[i,i]
Basically just swap the min with the diagonal.

Replace (every) element in a list by the median of the nearest neighbors

I have an array A, say :
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
And I wish to create a new array B by replacing each element in A by the median of its four nearest neighbors, without taking into account the value at the given position... for example :
B[2] = np.median([A[0], A[1], A[3], A[4]]) (=3)
The thing is that I need to perform this on a gigantic A and I want to optimize times, so I want to avoid for loops or similar. And... I don't care about the result at the edges.
I already tried scipy.ndimage.filters.median_filter but it is not producing the desired output :
import scipy.ndimage
B = scipy.ndimage.filters.median_filter(A,footprint=[1,1,0,1,1],mode='wrap')
which produces B=[7,4,4,5,6,7,6,6], which is clearly not the correct answer.
Any idea is welcome.

On way could be using np.roll to shift the number in your array such as:
A_1 = np.roll(A,1)
# output: array([8, 1, 2, 3, 4, 5, 6, 7])
And then the same thing with rolling by -2, -1 and 2:
A_2 = np.roll(A,2)
A_m1 = np.roll(A,-1)
A_m2 = np.roll(A,-2)
Now you just need to sum your 4 arrays, as for each index you have the 4 neighbors in one of them:
B = (A_1 + A_2 + A_m1 + A_m2)/4.
And as you said you don't care about the edges, I think it works for you!
EDIT: I guess I was focus on the rolling idea that I mixed up mean and median, the median can be calculated by B = np.median([A_1,A_2,A_m1,A_m2],axis=0)

I'd make a rolling, central window of length 5 in pandas, and apply the median function to the values of the window, the middle one masked away:
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
mask = np.array(np.ones(5), bool)
mask[5//2] = False
import pandas as pd
df = pd.DataFrame(A)
r5 = df.rolling(5, center=True)
result = r5.apply(lambda x: np.median(x[mask]))
result
0
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 NaN
7 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning outliers inside a column with interpolation - python

Related

Creating a DataFrame from a dictionary of Series results in lost indices and NaNs

How can I forward fill a dataframe column where the limit of rows filled is based on the value of a cell in another column?

Optimizing a function to replace a row with a previous row given, a condition in Pandas

Sorting pandas dataframe to get min value along diagonal

Replace (every) element in a list by the median of the nearest neighbors

Categories

Resources