Related
I am currently trying to learn how to utilize csv data via pandas and matplotlib. I have this issue where for a dataset that clearly has spikes in the data, I would need to "clean up" before evaluating anything out of it. But I am having difficulties understanding how to "detect" spikes in a graph...
So the datatset I am working is as follows:
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
Which produces this graph:
So all of these values are in the range of 32 to 38. I've intentionally placed very large numbers on indexes of [0, 30, 38, 48, 82] to create spikes in the graph.
Now I was trying to look up how to do this so called "step detection" on a graph, and the only real useful answer I have found is through this question here, and so utilizing that I have come up with this overall code...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import argrelextrema
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
# df.plot()
# plt.show()
threshold = int(len(df['price']) * 0.75)
maxPeaks = argrelextrema(df['price'].values, np.greater, order=threshold)
minPeaks = argrelextrema(df['price'].values, np.less, order=threshold)
df2 = df.copy()
price_column_index = df2.columns.get_loc('price')
allPeaks = maxPeaks + minPeaks
for peakList in allPeaks:
for peak in peakList:
print(df2.iloc[peak]['price'])
But the issue with this is that it only seems to be returning the indexes of 30 and 82, and its not grabbing the large value in index 0, and also is not grabbing anything in the negative dips. Though I am very sure I am using these methods incorrectly.
Now, I understand for this SPECIFIC issue I COULD just look for values in a column that is either greater or less than a certain value, but I am thinking of situations of dealing with 1000+ entries where dealing with the "lowest/highest normal values" can not accurately be determined, and therefore I just would like a spike detection that works regardless of scale.
So my questions are as follows:
1) The information I've been looking at about step detection seemed really really dense, and very difficult for me to comprehend. Could anyone provide a general rule about how to approaching these "step detection" issues?
2) Are there any public libraries that allows for this kind of work to be done with a little more ease? If so what are they?
3) How can you achieve the same results using vanilla Python? I've been in many workplaces that do not allow for any other libraries to be installed, forcing solutions to be made that does not utilize any of these useful external libraries, so I am wondering if there is some kind of formula/function that could be written to achieve similar results...
4) What other approaches could I use from a Data Analysis standpoint on dealing with this issue? I read something about correlation, standard deviation, but I don't actually know how any of these can be utilized to identify WHERE the spikes are...
EDIT: also, I found this answer as well using scipy's find_peaks method, but reading its doc I don't really understand what they represent, and where the values passed came from... Any clarification of this would be greatly appreciated...
Solution using scipy.signal.find_peaks
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
x = df['price'].values
x = np.insert(x, 0, 0) # added padding to catch any initial peaks in data
# for positive peaks
peaks, _ = find_peaks(x, height=50) # hieght is the threshold value
peaks = peaks - 1
print("The indices for peaks in the dataframe: ", peaks)
print(" ")
print("The values extracted from the dataframe")
print(df['price'][peaks])
# for negative peaks
x = x * - 1
neg_peaks, _ = find_peaks(x, height=0) # hieght is the threshold value
neg_peaks = neg_peaks - 1
print(" ")
print("The indices for negative peaks in the dataframe: ", neg_peaks)
print(" ")
print("The values extracted from the dataframe")
print(df['price'][neg_peaks])
First note that the algorithm works in a way that makes comparrisons between values. The upshot being that the first value of the array gets ignored, I suspect that this was the probelm with the solution you posted.
To get around this I padded the x array with an extra 0 at position 0 the value you put there is upto you,
x = np.insert(x, 0, 0)
The algorthim then returns the indices of where the peak values are to be found in the array into the variable peaks,
peaks, _ = find_peaks(x, height=50) # hieght is the threshold value
As I have added an initial value I have to subtract 1 from each of these indices,
peaks = peaks - 1
I can now use these indices to extract the peak values from the dataframe,
print(df['price'][peaks])
In terms of not detecting the peak at the beginning of the data, what you would usually do is re-sample the data set periodically and overlap the start of this sample with the end of the previous sample by a little bit. This "sliding window" over the data helps you avoid exactly this scenario, missing peaks on the boundary between scans of the data. The overlap should be greater than whatever your signal detection width is, in the above examples it appears to be a single data point.
For instance, if you are looking at daily data over a period of a month, with a resolution of "1 day," you would start your scan on the last day of the previous month, in order to detect a peak that happened on the first day of this month.
I'm a new-learner of python, recently I'm working on some project to perform computation of Joint distribution of a markov process.
An example of a stochastic kernel is the one used in a recent study by Hamilton (2005), who investigates a nonlinear statistical model of the business cycle based on US unemployment data. As part of his calculation he estimates the kernel
pH := 0.971 0.029 0
0.145 0.778 0.077
0 0.508 0.492
Here S = {x1, x2, x3} = {NG, MR, SR}, where NG corresponds to normal growth, MR to mild recession, and SR to severe recession. For example, the probability of transitioning from severe recession to mild recession in one period is 0.508. The length of the period is one month.
the excercise based on the above markov process is
With regards to Hamilton’s kernel pH, and using the same initial condition ψ = (0.2, 0.2, 0.6) , compute the probability that the economy starts and remains in recession through periods 0, 1, 2 (i.e., that xt = NG fort = 0, 1, 2).
My script is like
import numpy as np
## In this case, X should be a matrix rather than vector
## and we compute w.r.t P rather than merely its element [i][j]
path = []
def path_prob2 (p, psi , x2): # X a sequence giving the path
prob = psi # initial distribution is an row vector
for t in range(x2.shape[1] -1): # .shape[1] grasp # of columns
prob = np.dot(prob , p) # prob[t]: marginal distribution at period t
ression = np.dot(prob, x2[:,t])
path.append(ression)
return path,prob
p = ((0.971, 0.029, 0 ),
(0.145, 0.778, 0.077),
(0 , 0.508, 0.492))
# p must to be a 2-D numpy array
p = np.array(p)
psi = (0.2, 0.2, 0.6)
psi = np.array(psi)
x2 = ((0,0,0),
(1,1,1),
(1,1,1))
x2 = np.array(x2)
path_prob2(p,psi,x2)
During the execute process, two problems arise. The first one is , in the first round of loop, I don't need the initial distribution psi to postmultiply transaction matrix p, so the probability of "remaining in recession" should be 0.2+0.6 = 0.8, but I don't know how to write the if-statement.
The second one is , as you may note, I use a list named path to collect the probility of "remaining in recession" in each period. And finally I need to multiply every element in the list one-by-one, I don't manage to find a method to implement such task , like path[0]*path[1]*path[2] (np.multiply can only take two arguments as far as I know). Please give me some clues if such method do exist.
An additional ask is please give me any suggestion that you think can make the code more efficient. Thank you.
If I understood you correctly this should work (I'd love for some manual calculations for some of the steps/outcome), take notice of the fact that I didn't use if/else statement but instead started iterating from the second column:
import numpy as np
# In this case, X should be a matrix rather than vector
# and we compute w.r.t P rather than merely its element [i][j]
path = []
def path_prob2(p, psi, x2): # X a sequence giving the path
path.append(np.dot(psi, x2[:, 0])) # first step
prob = psi # initial distribution is an row vector
for t in range(1, x2.shape[1]): # .shape[1] grasp # of columns
prob = np.dot(prob, p) # prob[t]: marginal distribution at period t
path.append(np.prod(np.dot(prob, t)))
return path, prob
# p must to be a 2-D numpy array
p = np.array([[0.971, 0.029, 0],
[0.145, 0.778, 0.077],
[0, 0.508, 0.492]])
psi = np.array([0.2, 0.2, 0.6])
x2 = np.array([[0, 0, 0],
[1, 1, 1],
[1, 1, 1]])
print path_prob2(p, psi, x2)
For your second question I believe Numpy.prod will give you a multiplication between all elements of a list/array.
You can use the prod as such:
>>> np.prod([15,20,31])
9300
Hi I have two numpy arrays (in this case representing depth and percentage depth dose data) as follows:
depth = np.array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ,
1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.2,
2.4, 2.6, 2.8, 3. , 3.5, 4. , 4.5, 5. , 5.5])
pdd = np.array([ 80.40649399, 80.35692155, 81.94323956, 83.78981286,
85.58681373, 87.47056637, 89.39149833, 91.33721651,
93.35729334, 95.25343909, 97.06283306, 98.53761309,
99.56624117, 100. , 99.62820672, 98.47564754,
96.33163961, 93.12182427, 89.0940637 , 83.82699219,
77.75436857, 63.15528566, 46.62287768, 29.9665386 ,
16.11104226, 6.92774817, 0.69401413, 0.58247614,
0.55768992, 0.53290371, 0.5205106 ])
which when plotted give the following curve:
I need to find the depth at which the pdd falls to a given value (initially 50%). I have tried slicing the arrays at the point where the pdd reaches 100% as I'm only interested in the points after this.
Unfortunately np.interp only appears to work where both x and y values are incresing.
Could anyone suggest where I should go next?
If I understand you correctly, you want to interpolate the function depth = f(pdd) at pdd = 50.0. For the purposes of the interpolation, it might help for you to think of pdd as corresponding to your "x" values, and depth as corresponding to your "y" values.
You can use np.argsort to sort your "x" and "y" by ascending order of "x" (i.e. ascending pdd), then use np.interp as usual:
# `idx` is an an array of integer indices that sorts `pdd` in ascending order
idx = np.argsort(pdd)
depth_itp = np.interp([50.0], pdd[idx], depth[idx])
plt.plot(depth, pdd)
plt.plot(depth_itp, 50, 'xr', ms=20, mew=2)
This isn't really a programming solution, but it's how you can find the depth. I'm taking the liberty of renaming your variables, so x(i) = depth(i) and y(i) = pdd(i).
In a given interval [x(i),x(i+1)], your linear interpolant is
p_1(X) = y(i) + (X - x(i))*(y(i+1) - y(i))/(x(i+1) - x(i))
You want to find X such that p_1(X) = 50. First find i such that x(i)>50 and x(i+1), then the above equation can be rearranged to give
X = x(i) + (50 - y(i))*((x(i+1) - x(i))/(y(i+1) - y(i)))
For your data (with MATLAB; sorry, no python code) I make it approximately 2.359. This can then be verified with np.interp(X, depth, pdd)
There are several methods to carry out interpolation. For your case, you are basically looking for the depth at 50% which is not available in your data. The simplest interpolation is the linear case. I'm using numerical recipes library in C++ for acquiring the interpolated value via several techniques, therefore,
Linear Interpolation: see page 117
interpolated value depth(50%): 2.35915
Polynomial Interpolation: see page 117
interpolated value depth(50%): 2.36017
Cubic Spline Interpolation: see page 120
interpolated value depth(50%): 2.19401
Rational Function Interpolation: see page 124
interpolated value depth(50%): 2.35986
Lets say I have a list of numbers (all numbers are within 0.5 to 1.5 in this particular example and of course it is a discrete set ).
my_list= [0.564, 1.058, 0.779, 1.281, 0.656, 0.863, 0.958, 1.146, 0.742, 1.139, 0.957, 0.548, 0.572, 1.204, 0.868, 0.57, 1.456, 0.586, 0.718, 0.966, 0.625, 0.951, 0.766, 1.458, 0.83, 1.25, 0.7, 1.334, 1.015, 1.43, 1.376, 0.942, 1.252, 1.441, 0.795, 1.25, 0.851, 1.383, 0.969, 0.629, 1.008, 0.729, 0.841, 0.619, 0.63, 1.189, 0.514, 0.899, 0.807, 0.63, 1.101, 0.528, 1.385, 0.838, 0.538, 1.364, 0.702, 1.129, 0.639, 0.557, 1.28, 0.664, 1.021, 1.43, 0.792, 1.229, 0.837, 1.183, 0.54, 0.831, 1.279, 1.385, 1.377, 0.827, 1.32, 0.537, 1.19, 1.446, 1.222, 0.762, 1.302, 0.626, 1.352, 1.316, 1.286, 1.239, 1.027, 1.198, 0.961, 0.515, 0.989, 0.979, 1.123, 0.889, 1.484, 0.734, 0.718, 0.758, 0.782, 1.163, 0.579, 0.744, 0.711, 1.13, 0.598, 0.913, 1.305, 0.684, 1.108, 1.373, 0.945, 0.837, 1.129, 1.005, 1.447, 1.393, 1.493, 1.262, 0.73, 1.232, 0.838, 1.319, 0.971, 1.234, 0.738, 1.418, 1.397, 0.927, 1.309, 0.784, 1.232, 1.454, 1.387, 0.851, 1.132, 0.958, 1.467, 1.41, 1.359, 0.529, 1.139, 1.438, 0.672, 0.756, 1.356, 0.736, 1.436, 1.414, 0.921, 0.669, 1.21, 1.041, 0.597, 0.541, 1.162, 1.292, 0.538, 1.011, 0.828, 1.356, 0.897, 0.831, 1.018, 1.412, 1.363, 1.371, 1.231, 1.278, 0.564, 1.134, 1.324, 0.593, 1.307, 0.66, 1.376, 1.469, 1.315, 0.959, 1.099, 1.313, 1.032, 1.128, 1.175, 0.64, 0.581, 1.09, 0.934, 0.698, 1.272]
I can make a histogram distribution plot from it as
hist(my_list, bins=20, range=[0.5,1.5])
show()
which produces
Now, I want to create another list of random numbers (lets say this new list consists of 100 numbers) that will follow the same distribution (not sure how to link a discrete set in to a continuous distribution !!! ) as the old list ( my_list ) , so if I plot the histogram distribution from the new list, it will essentially produce the same histogram distribution.
Is there any way to do so in Python 2.7 ? I appreciate any help in advance.
You first need to "bucket up" the range of interest, and of course you can do it with tools from scipy &c, but for the sake of understanding what's going on a little Python version might help - with no optimizations, for ease of understanding:
import collections
def buckets(discrete_set, amin=None, amax=None, bucket_size=None):
if amin is None: amin=min(discrete_set)
if amax is None: amax=min(discrete_set)
if bucket_size is None: bucket_size = (amax-amin)/20
def to_bucket(sample):
if not (amin <= sample <= amax): return None # no bucket fits
return int((sample - amin) // bucket_size)
b = collections.Counter(to_bucket(s)
for s in discrete_set if to_bucket(s) is not None)
return amin, amax, bucket_size, b
So, now you have a Counter (essentially a dict) mapping each bucket from 0 up to its count as observed in the discrete set.
Next, you'll want to generate a random sample matching the bucket distribution measured by calling buckets(discrete_set). A Counter's elements method can help, but you need a list for random.sample...:
mi, ma, bs, bks = buckets(discrete_set)
buckelems = list(bks.elements())
(this may waste a lot of space, but you can optimize it later, separately from this understanding-focused overview:-).
Now it's easy to get an N-sized sample, e.g:
def makesample(N, buckelems, mi, ma, bs):
s = []
for _ in range(N):
buck = random.choice(buckelems)
x = random.uniform(mi+buck*bs, mi+(buck+1)*bs)
s.append(x)
return s
Here I'm assuming the buckets are fine-grained enough that it's OK to use a uniform distribution within each bucket.
Now, optimizing this is of course interesting -- buckelems will have as many items as originally were in discrete_set, and if that imposes an excessive load on memory, cumulative distributions can be built and used instead.
Or, one could bypass the Counter altogether, and just "round" each item in the discrete set to its bucket's lower bound, if memory's OK but one wants more speed. Or, one could leave discrete_set alone and random.choice within it before "perturbing" the chosen value (in different ways depending on the constraints of the exact problem). No end of fun...!-)
Don't read too much into valleys and peaks of histograms with low sample sizes when you're trying to do distribution fitting.
I performed a Kolmogorov-Smirnov test on your data to test the hypothesis that they come from a Uniform(0.5,1.5) distribution, and failed to reject. Consequently, you can generate any size sample you want of Uniform(0.5,1.5)'s.
Given your statement that the underlying distribution is continuous, I think that a distribution-fitting approach is better than a histogram/bucket-based approach.
I want to develop some python code to align datasets obtained by different instruments recording the same event.
As an example, say I have two sets of measurements:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({'TIME':[1.1, 2.4, 3.2, 4.1, 5.3],\
'VALUE':[10.3, 10.5, 11.0, 10.9, 10.7],\
'ERROR':[0.2, 0.1, 0.4, 0.3, 0.2]})
data2 = pd.DataFrame({'TIME':[0.9, 2.1, 2.9, 4.2],\
'VALUE':[18.4, 18.7, 18.9, 18.8],\
'ERROR':[0.3, 0.2, 0.5, 0.4]})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE, yerr=data2.ERROR, fmt='bo')
plt.show()
The result is plotted here:
What I would like to do now is to align the second dataset (data2) to the first one (data1). i.e. to get this:
The second dataset must be shifted to match the first one by subtracting a constant (to be determined) from all its values. All I know is that the datasets are correlated since the two instruments are measuring the same event but with different sampling rates.
At this stage I do not want to make any assumptions about what function best describes the data (fitting will be done after alignment).
I am cautious about using means to perform shifts since it may produce bad results, depending on how the data is sampled. I was considering taking each data2[TIME_i] and working out the shortest distance to data1[~TIME_i]. Then minimizing the sum of those. But I am not sure that would work well either.
Does anyone have any suggestions on a good method to use? I looked at mlpy but it seems to only work on 1D arrays.
Thanks.
You can substract the mean of the difference: data2.VALUE-(data2.VALUE - data1.VALUE).mean()
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({
'TIME': [1.1, 2.4, 3.2, 4.1, 5.3],
'VALUE': [10.3, 10.5, 11.0, 10.9, 10.7],
'ERROR': [0.2, 0.1, 0.4, 0.3, 0.2],
})
data2 = pd.DataFrame({
'TIME': [0.9, 2.1, 2.9, 4.2],
'VALUE': [18.4, 18.7, 18.9, 18.8],
'ERROR': [0.3, 0.2, 0.5, 0.4],
})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE-(data2.VALUE - data1.VALUE).mean(),
yerr=data2.ERROR, fmt='bo')
plt.show()
Another possibility is to subtract the mean of each series
You can calculate the offset of the average and subtract that from every value. If you do this for every value they should align relatively well. This would assume both dataset look relatively similar, so it might not work the best.
Although this question is not Matlab related, you might still be interested in this:
Remove unknown DC Offset from a non-periodic discrete time signal