Detecting outliers in a DataFrame column with small value changes in pandas - python

I am working with a column which values should have small changes between the rows. The values are physical measurements, and due to the environment factors, the measurement values can be incorrect, with a very high increments between the consecutive samples. Rate of change is a quantity that is considered as an input to the problem, as it can be changed to adapt to the precision needs of this outliers detection.
The detection method could either calculate the mean of the values seen so far and mark outliers as values that are above it by the given rate of change or check the value changes between the rows and mark the index value where the distance was greater than the rate of change and the index value where the values returned below the accepted rate of change with respect to the first value before the ones marked as outliers. The first approach could be harder, since the mean should be calculated from values that are correct, that is the values marked as outliers should not be considered into the calculation of the mean.
The correct solution should return the list of indices that indicate the outliers, which would be then used to set the corresponding values to f.e. NaN or use an interpolation method to fill in those values.
Example
df = pd.DataFrame({'small_changing': [5.14, 5.18, 5.22, 5.18, 5.20, 5.17, 5.25, 5.55, 5.62, 5.78, 6.21, 6.13, 5.71, 5.35, 5.29, 5.24, 5.16, 5.18, 5.20, 5.15, 5.17, 5.00, 4.96, 4.88, 4.71, 4.65, 4.73, 4.79, 4.89, 4.92, 5.05, 5.11, 5.14, 5.17, 5.22, 5.24, 5.18, 5.20]})
Assuming the rate of change of 0.15 there are two outliers groups to detect assuming the second approach of detection where the difference between the rows is taken into account.
The first group corresponds to the index values [7, 12], because the difference between the rows 6 and 7 is 0.3, which is higher than the 0.15 limit, and the difference between the rows 6 and 13 is 0.1, row 13 being the first row with the difference within the 0.15 limit.
The second group corresponds to the index values [21, 29], because the difference between the rows 20 and 21 is 0.17, which is higher than the 0.15 limit, and the difference between the rows 20 and 30 is 0.12, row 30 being the first row with the difference within the 0.15 limit.
Result for this example: [7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26, 27, 28, 29]

I hope it will be helpful.
I think it isn't pythonic, but it works:
def outlier_detection(points, limit):
outliers_index = list()
k=0
for i in range(0,len(points)-1):
if abs(points[i-k] - points[i+1]) >= limit:
k+=1
outliers_index.append(i+1)
else:
k=0
return outliers_index
outlier_detection(df['small_changing'].values, 0.15)
OUT: [7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26, 27, 28, 29]

This might save time on sparsely distributed outliers on a big dataset -
def df_outlier(df, threshold=0.15):
column = df.columns[0]
df["outlier"] = False
df_difference = df.copy()
df_difference["difference"] = abs(df[column] - df[column].shift(1)).shift(-1)
df_difference = df_difference.loc[df_difference["difference"] > threshold]
for index in df_difference.index:
row = df.loc[index]
if not row["outlier"]:
df_check = df[index+1:].copy()
df_check["a_difference"] = abs(df_check[column] - row[column])
df_check.loc[df_check["a_difference"] > threshold, "outlier"] = True
df.loc[((df.index >= df_check.index[0]) & (df.index < df_check["outlier"].ne(True).idxmax())), "outlier"] = True
return list(df.loc[df["outlier"] == True].index)
I am using this.

Related

Python get specific value from HDF table

I have two tables, the first one contains 300 rows, each row presents a case with 3 columns in which we find 2 constant values presenting the case, the second table is my data table collected from sensors contains the same indicators as the first except the case column, the idea is to detect to which case belongs each line of the second table knowing that the data are not the same as the first but in the range.
example:
First table:
[[1500, 22, 0], [1100, 40, 1], [2400, 19, 2]]
columns=['analog', 'temperature', 'case'])**
second table:
[[1420, 20], [1000, 39], [2300, 29]]
columns=['analog', 'temperature']
I want to detect my first row (1420 20) belongs to which case?
You can simply use a classifier; K-NN for instance...
import pandas as pd
df = pd.DataFrame([[1500, 22, 0], [1100, 40, 1], [2400, 19, 2]],columns=['analog', 'temperature', 'case'])
df1 = pd.DataFrame([[1420, 10], [1000, 39], [2300, 29]],columns=['analog', 'temperature'])
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 1, metric = 'minkowski', p = 2)
classifier.fit(df[['analog', 'temperature']], df["case"])
df1["case"] = classifier.predict(df1)
Output of df1;
analog temperature case
0 1420 10 0
1 1000 39 1
2 2300 29 2
so, first row (1420 20) in df1 (2nd table) belongs to case 0...
What do you mean by belong? [1420, 20] belongs to [?,?,?]?

How to plot a histogram in matplotlib in python?

I know how to plot a histogram when individual datapoints are given like:
(33, 45, 54, 33, 21, 29, 15, ...)
by simply using something matplotlib.pyplot.hist(x, bins=10)
but what if I only have grouped data like:
and so on.
I know that I can use bar plots to mimic a histogram by changing xticks but what if I want to do this by using only hist function of matplotlib.pyplot?
Is it possible to do this?
You can build the hist() params manually and use the existing value counts as weights.
Say you have this df:
>>> df = pd.DataFrame({'Marks': ['0-10', '10-20', '20-30', '30-40'], 'Number of students': [8, 12, 24, 26]})
Marks Number of students
0 0-10 8
1 10-20 12
2 20-30 24
3 30-40 26
The bins are all the unique boundary values in Marks:
>>> bins = pd.unique(df.Marks.str.split('-', expand=True).astype(int).values.ravel())
array([ 0, 10, 20, 30, 40])
Choose one x value per bin, e.g. the left edge to make it easy:
>>> x = bins[:-1]
array([ 0, 10, 20, 30])
Use the existing value counts (Number of students) as weights:
>>> weights = df['Number of students'].values
array([ 8, 12, 24, 26])
Then plug these into hist():
>>> plt.hist(x=x, bins=bins, weights=weights)
One possibility is to “ungroup” data yourself.
For example, for the 8 students with a mark between 0 and 10, you can generate 8 data points of value of 5 (the mean). For the 12 with a mark between 10 and 20, you can generate 12 data points of value 15.
However, the “ungrouped” data will only be an approximation of the real data. Thus, it is probably better to just use a matplotlib.pyplot.bar plot.

Pandas Python multiply column values by chosing coeffiients from row of another dataframe

I can not figure out how to multiply a dataframe with the values selected as the row from another dataframe.
I have a number of factors that I observe for variables in a universe:
df_observations = pd.DataFrame([
['variable_1', 1, 1.1, 1.2],
['variable_2', 2, 2.1, 2.2],
['variable_3', 3, 3.1, 3.2]
],
columns=['observation',
'factor_1',
'factor_2',
'factor_3'])
df_observations.set_index(['observation'], inplace=True)
df_observations
For each of the factors, I want to multiply by a coefficient, conditional on a state of the system
df_factor_state_coeff = pd.DataFrame([
['state_1', 10, 11, 12],
['state_2', 20, 21, 22],
['state_3', 30, 31, 32],
['state_4', 40, 41, 42],
],
columns=['sys_state',
'factor_1',
'factor_2',
'factor_3'])
df_factor_state_coeff.set_index(['sys_state'], inplace=True)
df_factor_state_coeff
I can only make a very "procedural" approach work:
for c in df_factor_state_coeff.loc['state_2'].index:
df_observations[c] = df_observations[c] * df_factor_state_coeff.loc['state_2'][c]
df_observations
I feel like I am missing something here and should be able to do this as a multiplication of df_observations.multiply(df_factor_state_coeff.loc['state_2']) without having to loop but i cant figure that out. Would really appreciate a smarter way to do this.
thanks

Creating new arrays based on the difference of previous and following elements of two other arrays

I have some events with their start and end time steps. Array “start” represents the start time steps of 4 events, array “end” represents the end time steps for these events, and array “prop” contains one numerical property for each event (e.g. the 2nd event (1 index) started at time step 12 and finished at time step 14, and its property is 20). Array “diff” shows the difference between the events (from the end of the previous event to the start of the next one). The time difference between the end of the 1st event and the start of the 2nd event is 7 steps. Array “diff” is smaller than the other arrays (“start”, “end”, "prop") by 1 element.
import numpy as np
start=np.array([3,12,16,30])
end = np.array([5,14,18,32])
prop=np.array([10,20,10,30])
diff=np.zeros(len(start)-1)
for i in range(1,len(start)):
diff[i-1] = start[i] - end[i-1]
print('diff',diff)
diff [ 7. 2. 12.]
The events which are close timewise need to merge. If the difference between 2 neighboring events is smaller than 3 timesteps, they need to merge. For example the 2nd and 3rd events differ by 2 time steps, so they will merge into a new event whose start is time step:12, and its end time step is 18). As for the “prop” array, the max prop[i] between the merged events need to be kept (prop[1] >prop[2]), so 20 will be assigned to the new merged event (merged_prop[1]=20). I would like to have 3 new arrays with the characteristics of all events (merged and not merged) like those:
merged_start=np.array([3,12,30])
merged_end = np.array([5,18,32]) #2nd and 3rd event have been merged
merged_prop=np.array([10,20,30])
I have attached another larger example as well to be more clear about what I want. The 2nd and 3rd events merged to 1 large event, and so did the 4th up to (included) 7th did.
start_2=np.array([3,12,16,38,42,46,50,60])
end_2= np.array([5,14,32,40,44,48,54,70])
prop_2= np.array([10,8,20,10,35,10,10,10])
diff_2=np.zeros(len(start_2)-1)
for i in range(1,len(start_2)):
diff_2[i-1] = start_2[i] - end_2[i-1]
print('diff_2',diff_2)
diff_2 [7. 2. 6. 2. 2. 2. 6.]
#Desirable outputs
merged_start_2=np.array([3,12,38,60])
merged_end_2 = np.array([5,32,54,70])
merged_prop_2= np.array([10,20,35,10])
Another Example
start_3 = np.array([ 3, 12, 18, 38, 42, 46, 50, 60])
end_3 = np.array([ 5, 14, 32, 40, 44, 48, 54, 70])
prop_3 = np.array([10, 8, 20, 10, 35, 10, 10, 10])
#Desirable outputs
merged_start_3=np.array([3,12,18,38,60])
merged_end_3 = np.array([5,14,32,54,70])
merged_prop_3= np.array([10,8,20,35,10])
How can I do it? I am able to extract the indices from arrays "diff","diff_2" which values are lower than 3 but I do not know how to continue.
Here is a way you can do that:
import numpy as np
MERGE_THRESHOLD = 3
start = np.array([ 3, 12, 16, 38, 42, 46, 50, 60])
end = np.array([ 5, 14, 32, 40, 44, 48, 54, 70])
prop = np.array([10, 8, 20, 10, 35, 10, 10, 10])
# Gap between events
dists = start[1:] - end[:-1]
# Mask events to merge
m = dists >= MERGE_THRESHOLD
# Find first and last indices of each merged group
first_indices = np.flatnonzero(np.r_[True, m])
last_indices = np.r_[first_indices[1:], len(start)] - 1
# Make results
merged_start = start[first_indices]
merged_end = end[last_indices]
merged_prop_max = np.maximum.reduceat(prop, first_indices)
merged_prop_sum = np.add.reduceat(prop, first_indices)
elems_per_merge = last_indices - first_indices + 1
merged_prop_avg = merged_prop_sum / elems_per_merge
print(merged_start)
# [ 3 12 38 60]
print(merged_end)
# [ 5 32 54 70]
print(merged_prop_max)
# [10 20 35 10]
print(merged_prop_sum)
# [10 28 65 10]
print(merged_prop_avg)
# [10. 14. 16.25 10. ]

How to sum a slice from a pandas dataframe

I'm trying to sum the a portion of the sessions in my dictionary so I can get totals for the current and previous week.
I've converted the JSON into a pandas dataframe in one test. I'm summing the total of the sessions using the .sum() function in pandas. However, I also need to know the total sessions from this week and the week prior. I've tried a few methods to sum values (-1:-7) and (-8:-15), but I'm pretty sure I need to use .iloc.
IN:
response = requests.get("url")
data = response.json()
df=pd.DataFrame(data['DailyUsage'])
total_sessions = df['Sessions'].sum()
current_week= df['Sessions'].iloc[-1:-7]
print(current_week)
total_sessions =['current_week'].sum
OUT:
Series([], Name: Sessions, dtype: int64)
AttributeError 'list' object has no attribute 'sum'
Note: I've tried this with and without pd.to_numeric and also with variations on the syntax of the slice and sum methods. Pandas doesn't feel very Pythonic and I'm out of ideas as to what to try next.
Assuming that df['Sessions'] holds each day, and you are comparing current and previous week only, you can use reshape to create a weekly sum for the last 14 values.
weekly_matrix = df['Sessions'][:-15:-1].values.reshape((2, 7))
Then, you can sum each row and get the weekly sum, most recent will be the first element.
import numpy as np
weekly_sum = np.sum(weekly_matrix, axis=1)
current_week = weekly_sum[0]
previous_week = weekly_sum[1]
EDIT: how the code works
Let's take the 1D-array which is accessed by the values attribute of the pandas Series. It contains the last 14 days, which is ordered from most recent to the oldest. I will call it x.
x = array([14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
The array's reshape function is then called on x to split this data into a 2D-array (matrix) with 2 rows and 7 columns.
The default behavior of the reshape function is to first fill all columns in a row before moving to the next row. Therefore, x[0] will be the element (1,1) in the reshaped array, x[1] will be the element (1,2), and so on. After the element (1,7) is filled with x[6] (ending the current week), the next element x[7] will then be placed in (2,1). This continues until finishing the reshape operation, with the placement of x[13] in (2,7).
This results in placing the first 7 elements of x (current week) in the first row, and the last 7 elements of x (previous week) in the second row. This was called weekly_matrix.
weekly_matrix = x.reshape((2, 7))
# weekly_matrix = array([[14, 13, 12, 11, 10, 9, 8],
# [ 7, 6, 5, 4, 3, 2, 1]])
Since now we have the values of each week organized in a matrix, we can use numpy.sum function to finish our operation. numpy.sum can take an axis argument, which will control how the value is computed:
if axis=None, all elements are added in a grand total.
if axis=0, all rows in each column will be added. In the case of weekly_matrix, this will result in a 7 element 1D-array ([21, 19,
17, 15, 13, 11, 9], which is not the result we want, as we are
actually adding equivalent days on each week).
if axis=1 (as the case of the solution), all columns in each row will be added, producing a 2 element 1D-array in the case of weekly_matrix. Order of this result
array follows the same order of the rows in the matrix (i.e., element
0 is the total of the first row, and element 1 is the total of the
second row). Since we know that the first row is the current week, and
the second row is the previous week, we can extract the information
using these indexes, which is
# weekly_sum = array([77, 28])
current_week = weekly_sum[0] # sum of [14, 13, 12, 11, 10, 9, 8] = 77
previous_week = weekly_sum[1] # sum of [ 7, 6, 5, 4, 3, 2, 1] = 28
To group and sum by a fixed number of values, for instance with daily data and weekly aggregation, consider groupby. You can do this forwards or backwards by slicing your series as appropriate:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 21)})
print(df['col'].values)
# array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5])
# forwards groupby
res = df['col'].groupby(df.index // 7).sum()
# 0 30
# 1 40
# 2 35
# Name: col, dtype: int32
# backwards groupby
df['col'].iloc[::-1].reset_index(drop=True).groupby(df.index // 7).sum()
# 0 35
# 1 40
# 2 30
# Name: col, dtype: int32

Categories

Resources