Remove jumps like peaks and steps in timeseries

Remove jumps like peaks and steps in timeseries - python

I have quite a few sensors in the field that measure water pressure. In the past the height of these sensors have been changed quite a few times creating jumps in the timeseries. Since these timeseries are continuous and I have a manual measurement I should technically be able to remove the jumps (by hand this is easy, but there are too many measurements so I need to do it in python).
I've tried removing the jumps using a median filter but this doesn't really work.
My code:
# filter out noise in signal (peaks)
minimumPeak = 0.03 # filter peaks larger than 0.03m
filtered_value = np.array(im.median_filter(data['value'], 5))
noise = np.array((filtered_value-data['value']).abs() > minimumPeak)
data.loc[noise, 'value'] = filtered_value[noise]
data is pandas dataframe containing two columns: 'datetime' and 'value'.
I've also tried to do this manually and got it working in a simple case, but not well in any other. Any idea how I would filter out the jumps?
An example is shown in the picture below (yellow indicating the jumps, red the measurement by hand (it is very well possible that this measurement is not in the beginning as it is in this example))

You have sharp peaks and steps in your data. I guess you want to
remove the peaks and replace by some averaged values
remove the steps by cumulative changing the offset of the remaining data values
That's in line with what you said in your last comment. Please note, that this will alter (shift) big parts of your data!
It's important to recognize that the width of both, peaks and steps, is one pixel in your data. Also you can handle both effects pretty much independently.
I suggest to first remove the peaks, then remove the steps.
Remove peaks by calculating the absolute difference to the previous and to the next data value, then take the minimum of both, i.e. if your data series is y(i) compute p(i)=min(abs(y(i)-y(i-1)), abs(y(i+1)-y(i))). All values above a threshold are peaks. Take them and replace the data values with the mean of the previous and the next pixel like.
Now remove the steps, again by looking for absolute differences of consecutive values (as suggested in the comment by AreTor), s(i)=abs(y(i)-y(i-1)) and look for values above a certain threshold. The positions are the step positions. Create an zero-valued offset array of the same size, then insert the differences of the data points (without the absolute value), then form the cumulative sum and subtract the result from the original data to remove the steps.
Please note that this removes peaks and steps which go up as well as down. If you want to remove only one kind, just don't take the absolute value.

You can try it like this:
import numpy as np
import matplotlib.pyplot as plt
import h5py
%matplotlib inline
# I'm not sure that you need all of this packedges
filepath = 'measurment.hdf5'
with h5py.File(filepath, 'r') as hdf:
data_y = hdf['y'][:]
data_x = hdf['x'][:]
data = data_y
delta_max = 1 # maximum difference in y between two points
delta = 0 # running correction value
data_cor = [] # corrected array
data_cor.append(data[0:1]) # we append two first points
for i in range(len(data_x)-2): # two first points are allready appended
i += 2
delta_i = data[i] - data[i-1]
if np.abs(delta_i) > delta_max:
delta += (delta_i - (data_cor[i-1] - data_cor[i-2]))
data_cor.append(data[i]-delta)
else:
data_cor.append(data[i]-delta)
plt.plot(data_x, data_cor)

Related

Bokeh line color based on the True/Fase condition

I am trying to plot anomaly regions in Bokeh. The idea is to have a line that will use red color to show that those samples are anomalous ones.
Here is a sample reproducible code.
import numpy as np
import random
n=300
dat = pd.DataFrame()
dat['X_axis'] = np.linspace(start=0.0, stop=1000, num = n)
mean = 4
std = 1
dat['Y_axis']=np.random.normal(loc=mean, scale=std, size = n)
dat['anom'] = np.random.choice([False, True ], size = (n,), p= [0.90, 0.10])
I was able to implement the Box Annotation, and I am trying to do the same thing but this time, the same region will just have a red color for that portion of the line.
EDIT:
Following a comment/suggestion, I plotted those two lines as separate. However, Bokeh interpolates between values, instead of having a smooth transaction. Is there a way to drop interpolation, or at least minimize between it to two adjacent values?
EDIT 2:
I was able to break it into individual segments. However, now there are gaps between data samples that need to be eliminated. Any suggestion on how to do that?

You will have to split your data up and use either multiple calls to line or a single call to multi_line. It is not possible to specify different colors along different parts of a single line.

how to deal with negative values when subtracting current pixel with the mean

I am subtracting the mean value of the image to each pixel using the following code (hopefully I am doing it right):
for filename in glob.glob('video//frames//*.png'):
img = cv2.imread(filename,0)
values_list.append(img[150,:]) #Get all rows at x-axis 17 which is the row pixels
values_mean.append(np.round(np.mean(img[150,:]), decimals=0))
output = np.array(values_list)
values_mean = np.array(values_mean).reshape(-1,1)
new_column_value = values_mean - output
When I plot the graph, I get
What is the best way to deal with negative values? Should I simply add an if statement if goes above 255 or below 0 to make it 0? but them someone mentioned "...you kill information on where the negatives are.." so how to deal with this correctly ?
I intend to calculate the shift value between frames by getting the maximal correlation value of the subtracted image, and comparing it to the adjacent frame
There are countless similar questions, but cannot find a solid ground here, here, here, here....etc

If you're trying to determine "how far away from the mean is a given pixel", then wouldn't it make more sense to take the absolute value of your result?
new_column_value = np.absolute(values_mean - output)

Understanding the output of fftfreq function and the fft plot for a single row in an image

I am trying to understand the function fftfreq and the resulting plot generated by adding real and imaginary components for one row in the image. Here is what I did:
import numpy as np
import cv2
import matplotlib.pyplot as plt
image = cv2.imread("images/construction_150_200_background.png", 0)
image_fft = np.fft.fft(image)
real = image_fft.real
imag = image_fft.imag
real_row_bw = image_fft[np.ceil(image.shape[0]/2).astype(np.int),0:image.shape[1]]
imag_row_bw = image_fft[np.ceil(image.shape[0]/2).astype(np.int),0:image.shape[1]]
sum = real_row_bw + imag_row_bw
plt.plot(np.fft.fftfreq(image.shape[1]), sum)
plt.show()
Here is image of the plot generated :
I read the image from the disk, calculate the Fourier transform and extract the real and imaginary parts. Then I sum the sine and cosine components and plot using the pyplot library.
Could someone please help me understand the fftfreq function? Also what does the peak represent in the plot for the following image:
I understand that Fourier transform maps the image from spatial domain to the frequency domain but I cannot make much sense from the graph.
Note: I am unable to upload the images directly here, as at the moment of asking the question, I am getting an upload error.

I don't think that you really need fftfreq to look for frequency-domain information in images, but I'll try to explain it anyway.
fftfreq is used to calculate the frequencies that correspond to each bin in an FFT that you calculate. You are using fftfreq to define the x coordinates on your graph.
fftfreq has two arguments: one mandatory, one optional. The mandatory first argument is an integer, the window length you used to calculate an FFT. You will have the same number of frequency bins in the FFT as you had samples in the window. The optional second argument is the time period per window. If you don't specify it, the default is a period of 1. I don't know whether a sample rate is a meaningful quantity for an image, so I can understand you not specifying it. Maybe you want to give the period in pixels? It's up to you.
Your FFT's frequency bins start at the negative Nyquist frequency, which is half the sample rate (default = -0.5), or a little higher; and it ends at the positive Nyquist frequency (+0.5), or a little lower.
The fftfreq function returns the frequencies in a funny order though. The zero frequency is always the zeroth element. The frequencies count up to the maximum positive frequency, and then flip to the maximum negative frequency and count upwards towards zero. The reason for this strange ordering is that if you're doing FFT's with real-valued data (you are, image pixels do not have complex values), the negative frequency data is exactly equal to the corresponding positive frequency data and is redundant. This ordering makes it easy to throw the negative frequencies away: just take the first half of the array. Since you aren't doing that, you're plotting the negative frequencies too. If you should choose to ignore the second half of the array, the negative frequencies will be removed.
As for the strong spike that you see at the zero frequency in your image, this is probably because your image data is RGB values which range from 0 to 255. There's a huge "DC offset" in your data. It looks like you're using Matplotlib. If you are plotting in an interactive window, you can use the zoom rectangle to look at that horizontal line. If you push the DC offset off scale, setting the Y axis scale to perhaps ±500, I bet you will start to see that the horizontal line isn't exactly horizontal after all.
Once you know which bin contains your DC offset, if you don't want to see it, you can just assign the value of the fft in that bin to zero. Then the graph will scale automatically.
By the way, these two lines of code perform identical calculations, so you aren't actually taking the sine and cosine components like your text says:
real_row_bw = image_fft[np.ceil(image.shape[0]/2).astype(np.int),0:image.shape[1]]
imag_row_bw = image_fft[np.ceil(image.shape[0]/2).astype(np.int),0:image.shape[1]]
And one last thing: to sum the sine and cosine components properly (once you have them), since they're at right angles, you need to use a vector sum rather than a scalar sum. Look at the function numpy.linalg.norm.

Numpy Correlate is not providing an offset

I am trying to look at astronomical spectra using Python, and I'm using numpy.correlate to try and find a radial velocity shift. I'm comparing each spectrum I have to one template spectrum. The problem that I'm encountering is that, no matter which spectra I use, numpy.correlate states that the maximal value of the correlation function occurs with a shift of zero pixels, i.e. the spectra already line up, which is very clearly not true. Here is some of the relevant code:
corr = np.correlate(temp_data, imag_data, mode='same')
ax1.plot(delta_data, corr, c='g')
ax1.plot(delta_data, 100*temp_data, c='b')
ax1.plot(delta_data, 100*imag_data, c='r')
The output of this code is shown here:
What I Have
Note that the cross-correlation function peaks at an offset of zero pixels despite the template (blue) and observed (red) spectra clearly showing an offset. What I would expect to see would be something a bit like (albeit not exactly like; this is merely the closest representation I could produce):
What I Want
Here I have introduced an artificial offset of 50 pixels in the template data, and they more or less line up now. What I would like is, for a case like this, for a peak to appear at an offset of 50 pixels rather than at zero (I don't care if the spectra at the bottom appear lined up; that is merely for visual representation). However, despite several hours of work and research online, I can't find someone who even describes this problem, let alone a solution. I've attempted to use ScyPy's correlate and MatLib's xcorr, and bot show this same thing (although I'm led to believe that they are essentially the same function).
Why is the cross-correlation not acting the way I expect, and how to do I get it to act in a useful way?

The issue you're experiencing is probably because your spectra are not zero-centered; their RMS value looks to be about 100 in whichever units you're plotting, instead of 0. The reason this is an issue is because numpy.correlate works by "sliding" imag_data over temp_data to get their dot product at each possible offset between the two series. (See the wikipedia on cross-correlation to understand the operation itself.) When using mode='same' to produce an output that is the same length as your first input (temp_data), NumPy has to "pad" a bunch of dummy values--zeroes--to the ends of imag_data in order to be able to calculate the dot products of all the shifted versions of the imag_data. When we have any non-zero offset between the spectra, some of the values in temp_data are being multiplied by those dummy zero-padding values instead of the values in image_data. If the values in the spectra were centered around zero (RMS=0), then this zero-padding would not impact our expectation of the dot product, but because these spectra have RMS values around 100 units, that dot product (our correlation) is largest when we lay the two spectra on top of one another with no offset.
Notice that your cross-correlation result looks like a triangular pulse, which is what you might expect from the cross-correlation of two square pulses (c.f. Convolution of a Rectangular "Pulse" With Itself. That's because your spectra, once padded, look like a step function from zero up to a pulse of slightly noisy values around 100. You can try convolving with mode='full' to see the entire response of the two spectra you're correlating, or, notice that with mode='valid' that you should only get one value in return, since your two spectra are the exact same length, so there is only one offset (zero!) where you can entirely line them up.
To sidestep this issue, you can try either subtracting away the RMS value of the spectra so that they are zero-centered, or manually padding the beginning and end of imag_data with (len(temp_data)/2-1) dummy values equal to np.sqrt(np.mean(imag_data**2))
Edit:
In response to your questions in the comments, I thought I'd include a graphic to make the point I'm trying to describe a little clearer.
Say we have two vectors of values, not entirely unlike your spectra, each with some large non-zero mean.
# Generate two noisy, but correlated series
t = np.linspace(0,250,250) # time domain from 0 to 250 steps
# signal_model = narrow_peak + gaussian_noise + constant
f = 10*np.exp(-((t-90)**2)/8) + np.random.randn(250) + 40
g = 10*np.exp(-((t-180)**2)/8) + np.random.randn(250) + 40
f has a spike around t=90, and g has a spike around t=180. So we expect the correlation of g and f to have a spike around a lag of 90 timesteps (in the case of spectra, frequency bins instead of timesteps.)
But in order to get an output that is the same shape as our inputs, as in np.correlate(g,f,mode='same'), we have to "pad" f on either side with half its length in dummy values: np.correlate pads with zeroes. If we don't pad f (as in np.correlate(g,f,mode='valid')), we will only get one value in return (the correlation with zero offset), because f and g are the same length, and there is no room to shift one of the signals relative to the other.
When you calculate the correlation of g and f after that padding, you find that it peaks when the non-zero portion of signals aligns completely, that is, when there is no offset between the original f and g. This is because the RMS value of the signals is so much higher than zero--the size of the overlap of f and g depends much more strongly on the number of elements overlapping at this high RMS level than on the relatively small fluctuations each function has around it. We can remove this large contribution to the correlation by subtracting the RMS level from each series. In the graph below, the gray line on the right shows the cross-correlation the two series before zero-centering, and the teal line shows the cross-correlation after. The gray line is, like your first attempt, triangular with the overlap of the two non-zero signals. The teal line better reflects the correlation between the fluctuation of the two signals, as we desired.
xcorr = np.correlate(g,f,'same')
xcorr_rms = np.correlate(g-40,f-40,'same')
fig, axes = plt.subplots(5,2,figsize=(18,18),gridspec_kw={'width_ratios':[5,2]})
for n, axis in enumerate(axes):
offset = (0,75,125,215,250)[n]
fp = np.pad(f,[offset,250-offset],mode='constant',constant_values=0.)
gp = np.pad(g,[125,125],mode='constant',constant_values=0.)
axis[0].plot(fp,color='purple',lw=1.65)
axis[0].plot(gp,color='orange',lw=lw)
axis[0].axvspan(max(125,offset),min(375,offset+250),color='blue',alpha=0.06)
axis[0].axvspan(0,max(125,offset),color='brown',alpha=0.03)
axis[0].axvspan(min(375,offset+250),500,color='brown',alpha=0.03)
if n==0:
axis[0].legend(['f','g'])
axis[0].set_title('offset={}'.format(offset-125))
axis[1].plot(xcorr/(40*40),color='gray')
axis[1].plot(xcorr_rms,color='teal')
axis[1].axvline(offset,-100,350,color='maroon',lw=5,alpha=0.5)
if n == 0:
axis[1].legend(["$g \star f$","$g' \star f'$","offset"],loc='upper left')
plt.show()

Counting the number of times a threshold is met or exceeded in a multidimensional array in Python

I have an numpy array that I brought in from a netCDF file with the shape (930, 360, 720) where it is organized as (time, latitudes, longitudes).
At each lat/lon pair for each of the 930 time stamps, I need to count the number of times that the value meets or exceeds a threshold "x" (such as 0.2 or 0.5 etc.) and ultimately calculate the percentage that the threshold was exceeded at each point, then output the results so they can be plotted later on.
I have attempted numerous methods but here is my most recent:
lat_length = len(lats)
#where lats has been defined earlier when unpacked from the netCDF dataset
lon_length = len(lons)
#just as lats; also these were defined before using np.meshgrid(lons, lats)
for i in range(0, lat_length):
for j in range(0, lon_length):
if ice[:,i,j] >= x:
#code to count number of occurrences here
#code to calculate percentage here
percent_ice[i,j] += count / len(time) #calculation
#then go on to plot percent_ice
I hope this makes sense! I would greatly appreciate any help. I'm self taught in Python so I may be missing something simple.
Would this be a time to use the any() function? What would be the most efficient way to count the number of times the threshold was exceeded and then calculate the percentage?

You can compare the input 3D array with the threshold x and then sum along the first axis with ndarray.sum(axis=0) to get the count and thereby the percentages, like so -
# Calculate count after thresholding with x and summing along first axis
count = (ice > x).sum(axis=0)
# Get percentages (ratios) by dividing with first axis length
percent_ice = np.true_divide(count,ice.shape[0])

Ah, look, another meteorologist!
There are probably multiple ways to do this and my solution is unlikely to be the fastest since it uses numpy's MaskedArray, which is known to be slow, but this should work:
Numpy has a data type called a MaskedArray which actually contains two normal numpy arrays. It contains a data array as well as a boolean mask. I would first mask all data that are greater than or equal to my threshold (use np.ma.masked_greater() for just greater than):
ice = np.ma.masked_greater_equal(ice)
You can then use ice.count() to determine how many values are below your threshold for each lat/lon point by specifying that you want to count along a specific axis:
n_good = ice.count(axis=0)
This should return a 2-dimensional array containing the number of good points. You can then calculate the number of bad by subtracting n_good from ice.shape[0]:
n_bad = ice.shape[0] - n_good
and calculate the percentage that are bad using:
perc_bad = n_bad/float(ice.shape[0])
There are plenty of ways to do this without using MaskedArray. This is just the easy way that comes to mind for me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.