Step/Spike detection breakdown for pandas dataframe graph? - python
I am currently trying to learn how to utilize csv data via pandas and matplotlib. I have this issue where for a dataset that clearly has spikes in the data, I would need to "clean up" before evaluating anything out of it. But I am having difficulties understanding how to "detect" spikes in a graph...
So the datatset I am working is as follows:
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
Which produces this graph:
So all of these values are in the range of 32 to 38. I've intentionally placed very large numbers on indexes of [0, 30, 38, 48, 82] to create spikes in the graph.
Now I was trying to look up how to do this so called "step detection" on a graph, and the only real useful answer I have found is through this question here, and so utilizing that I have come up with this overall code...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import argrelextrema
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
# df.plot()
# plt.show()
threshold = int(len(df['price']) * 0.75)
maxPeaks = argrelextrema(df['price'].values, np.greater, order=threshold)
minPeaks = argrelextrema(df['price'].values, np.less, order=threshold)
df2 = df.copy()
price_column_index = df2.columns.get_loc('price')
allPeaks = maxPeaks + minPeaks
for peakList in allPeaks:
for peak in peakList:
print(df2.iloc[peak]['price'])
But the issue with this is that it only seems to be returning the indexes of 30 and 82, and its not grabbing the large value in index 0, and also is not grabbing anything in the negative dips. Though I am very sure I am using these methods incorrectly.
Now, I understand for this SPECIFIC issue I COULD just look for values in a column that is either greater or less than a certain value, but I am thinking of situations of dealing with 1000+ entries where dealing with the "lowest/highest normal values" can not accurately be determined, and therefore I just would like a spike detection that works regardless of scale.
So my questions are as follows:
1) The information I've been looking at about step detection seemed really really dense, and very difficult for me to comprehend. Could anyone provide a general rule about how to approaching these "step detection" issues?
2) Are there any public libraries that allows for this kind of work to be done with a little more ease? If so what are they?
3) How can you achieve the same results using vanilla Python? I've been in many workplaces that do not allow for any other libraries to be installed, forcing solutions to be made that does not utilize any of these useful external libraries, so I am wondering if there is some kind of formula/function that could be written to achieve similar results...
4) What other approaches could I use from a Data Analysis standpoint on dealing with this issue? I read something about correlation, standard deviation, but I don't actually know how any of these can be utilized to identify WHERE the spikes are...
EDIT: also, I found this answer as well using scipy's find_peaks method, but reading its doc I don't really understand what they represent, and where the values passed came from... Any clarification of this would be greatly appreciated...
Solution using scipy.signal.find_peaks
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
x = df['price'].values
x = np.insert(x, 0, 0) # added padding to catch any initial peaks in data
# for positive peaks
peaks, _ = find_peaks(x, height=50) # hieght is the threshold value
peaks = peaks - 1
print("The indices for peaks in the dataframe: ", peaks)
print(" ")
print("The values extracted from the dataframe")
print(df['price'][peaks])
# for negative peaks
x = x * - 1
neg_peaks, _ = find_peaks(x, height=0) # hieght is the threshold value
neg_peaks = neg_peaks - 1
print(" ")
print("The indices for negative peaks in the dataframe: ", neg_peaks)
print(" ")
print("The values extracted from the dataframe")
print(df['price'][neg_peaks])
First note that the algorithm works in a way that makes comparrisons between values. The upshot being that the first value of the array gets ignored, I suspect that this was the probelm with the solution you posted.
To get around this I padded the x array with an extra 0 at position 0 the value you put there is upto you,
x = np.insert(x, 0, 0)
The algorthim then returns the indices of where the peak values are to be found in the array into the variable peaks,
peaks, _ = find_peaks(x, height=50) # hieght is the threshold value
As I have added an initial value I have to subtract 1 from each of these indices,
peaks = peaks - 1
I can now use these indices to extract the peak values from the dataframe,
print(df['price'][peaks])
In terms of not detecting the peak at the beginning of the data, what you would usually do is re-sample the data set periodically and overlap the start of this sample with the end of the previous sample by a little bit. This "sliding window" over the data helps you avoid exactly this scenario, missing peaks on the boundary between scans of the data. The overlap should be greater than whatever your signal detection width is, in the above examples it appears to be a single data point.
For instance, if you are looking at daily data over a period of a month, with a resolution of "1 day," you would start your scan on the last day of the previous month, in order to detect a peak that happened on the first day of this month.
Related
I want to detect ranges with the same numerical boundaries of a dataset using matplotlib or pandas in python 3.7
I have a ton of ranges. They all consist of numbers. The range has a maximum and a minimum which can not be exceeded, but given the example that you have two ranges and one max point of the range reaches above the min area of the other. That would mean that you have a small area that covers both of them. You can write one range that includes the others. I want to see if some ranges overlap or if I can find some ranges that cover most of the other. The goal would be to see if I can simplify them by using one smaller range that fits inside the other. For example 7,8 - 9,6 and 7,9 - 9,6 can be covered with one range. You can see my attempt to visualize them. But when I use my entire dataset consisting of hundreds of ranges my graph is not longer useful. I know that I can detect recurrent ranges using python. But I don't want to know how often a range occurs. I want to know how many ranges lay in the same numerical boundaries.I want see if I can have a couple of ranges covering all of them. Finally my goal is to have the masterranges sorted in categories. Meaning that I have range 1 covering 50 other ranges. then range 2 covering 25 ranges and so on. My current program shows the penetration of ranges but I also want that in a printed output with the exact digits. It would be nice if you share some ideas to solve that program or if you have any suggestions on tools within python 3.7 import matplotlib.pyplot as plt intervals = [[3.6,4.5], [3.6,4.5], [7.8,9.6], [7.9,9.6], [7.8,9.6], [3.4,4.1], [2.8,3.4], [8.25,9.83], [3.62,3.96], [8.25,9.83], [0.62,0.68], [2.15,2.49], [0.8,1.0], [0.8,1.0], [3.1,3.9], [6.7,8.3], [1,1.5], [1,1.2], [1.5,1.8], [1.8,2.5], [3,4.0], [6.5,8.0], [1.129,1.35], [2.82,3.38], [1.69,3.38], [3.38,6.21], [2.25,2.82], [5.649,6.214], [1.920,6.214] ] for int in intervals: plt.plot(int,[0,0], 'b', alpha = 0.2, linewidth = 100) plt.show()
Here is an idea, You make a pandas data frame with the array. You substract the values in column2 - colum1 ( column 1 is x, and column 2 is y ). After that you create a histogram in which you take the range and the frecuency. import pandas as pd import numpy as np import matplotlib.pyplot as plt intervals = [[3.6,4.5], [3.6,4.5], [7.8,9.6], [7.9,9.6], [7.8,9.6], [3.4,4.1], [2.8,3.4], [8.25,9.83], [3.62,3.96], [8.25,9.83], [0.62,0.68], [2.15,2.49], [0.8,1.0], [0.8,1.0], [3.1,3.9], [6.7,8.3], [1,1.5], [1,1.2], [1.5,1.8], [1.8,2.5], [3,4.0], [6.5,8.0], [1.129,1.35], [2.82,3.38], [1.69,3.38], [3.38,6.21], [2.25,2.82], [5.649,6.214], [1.920,6.214]] intervals_ar = np.array(intervals) df = pd.DataFrame({'Column1': intervals_ar[:, 0], 'Column2': intervals_ar[:, 1]}) df['Ranges'] = df['Column2'] - df ['Column1'] print(df) frecuency_range = df['Ranges'].value_counts().sort_index() print(frecuency_range) df.Ranges.value_counts().sort_index().plot(kind = 'hist', bins = 5) plt.title("Histogram Frecuency vs Range (column 2- column1)") plt.show()
QQ-plot python mean and standard deviation
I am trying to plot a Q-Q plot using python. I was checking scipy.stats.probplot, and the input seems to be the measurement against a normal distributiom. import numpy as np import pylab import scipy.stats as stats measurements = np.random.normal(loc = 20, scale = 5, size=100) stats.probplot(measurements, dist="norm", plot=pylab) pylab.show() and in my code, I had stats.probplot(mean, dist="norm", plot=plt) to compare distributions. But I am wondering where can I input standard deviation? I thought that's a very important factor when comparing distributions but so far I can only input the mean. Thanks
Let's suppose you have a list on float X = [-1.31, 4.82, 2.18, 1.99, 4.37, 2.58, 7.22, 3.93, 6.95, 2.41, 2.02, 2.48, -1.01, 2.3, 2.87, -0.06, 2.13, 3.62, 5.24, 0.57] If you want to make a QQ_plot test you need to compare X against a distribution. For example : N(0, 1) a normal distribution whose mean = 0 and sigma = 1 In OpenTURNS, it goes like that: import openturns as ot sample = ot.Sample([[p] for p in X]) graph = ot.VisualTest.DrawQQplot(sample, ot.Normal(0,1)) View(graph); Explanation: I tell OpenTURNS I have a sample of 20 points [p] coming from X and not 1 point in dimension 20. Then I call ot.VisualTest.DrawQQplot with 2 arguments: sample and the Normal distribution (0,1) ot.Normal(0,1). We see on the graph that the test fails: The question now is: what is the best Normal Distribution fitting the sample? Thanks to NormalFactory() the answer is simple: BestNormalDistribution = ot.NormalFactory().build(sample) If you print(BestNormalDistribution) you get the parameters of this distribution: Normal(mu = 2.76832, sigma = 2.27773) If we repeat the QQ_plot test of sample against BestNormalDistribution it would be much better
Expanding an array with equal increments towards zero
I have various dataframes, each has a different depth range. For a more complex computation (this is a fragment of a question posted here: Curve fitting for each column in Pandas + extrapolate values), I need to write a function, so it would expand the depth column / array for equal increments dz (in this case 0.5) towards zero (surface). Here the missing quotas are 0.15, 0.65 and 1.15 import numpy as np depth = np.array([1.65, 2.15, 2.65, 3.15, 3.65, 4.15, 4.65, 5.15, 5.65, 6.15, 6.65, 7.15, 7.65, 8.15, 8.65]) Any ideas how to write a function so it does it each time for a different depth range ( i.e. depending on the varying minimum value)?
A very simple solution I did is: depth_min = np.min(depth) step = 0.5 missing_vals = np.arange(depth_min - step, 0, -step)[::-1] depth_tot = np.concatenate((missing_vals, depth), axis=0) I'm sure there exist better ways
How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly
Lets say I have the following data: s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]) s2.value_counts(normalize=True).plot() What I want to show in the plot is that there are a few numbers that make up the majority of cases.The problem is that this will be seen in the far left side of the graph and then there will be a straight line for all the other categories. In the real data the x axis will be categorical with about 18000 categories and 4% of the counts will be around 10000 high then the rest will drop of and be around 50. I want to show this for an audience of "ordinary" business people so cant be some fanzy hard to read solution. Update: see #unutbu answere Updated code and im getting an error for qcut when trying to use tuples. TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple' df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]}) perms = df.apply(tuple, axis=1) prob = perms.value_counts(normalize=True).reset_index(drop='True') category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.], labels=['bottom 25%', 'mid 70%', 'top 5%']) prob_groups = prob.groupby(category_classes).sum() prob_groups.plot(kind='bar') plt.xticks(rotation=0) plt.show()
You could keep the normalized value counts above a certain threshold. Then sum together the values below the threshold and clump them together in one category which could be called, say, "other". By choosing threshold high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other": import matplotlib.pyplot as plt import pandas as pd s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]) prob = s2.value_counts(normalize=True) threshold = 0.02 mask = prob > threshold tail_prob = prob.loc[~mask].sum() prob = prob.loc[mask] prob['other'] = tail_prob prob.plot(kind='bar') plt.xticks(rotation=25) plt.show() There is a limit to the number of category labels you can sensibly display on a bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is probably not reasonable to expect an audience to glean any meaning out of reading 3000 labels. The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut to categorize the cases into simple categories such as bottom 25%, mid 70%, and top 5%: import numpy as np import matplotlib.pyplot as plt import pandas as pd N = 18000 categories = np.arange(N) np.random.shuffle(categories) M = int(N*0.04) prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M), np.random.randint(0, 100, size=N-M), ]), index=categories) prob /= prob.sum() category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.], labels=['bottom 25%', 'mid 70%', 'top 5%']) prob_groups = prob.groupby(category_classes).sum() prob_groups.plot(kind='bar') plt.xticks(rotation=0) plt.show()
Just log the axis (I have no pandas, but it should be similar): import numpy as np import matplotlib.pyplot as plt s2 = np.log([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]) plt.plot(s2) plt.show()
Fitting a Python Scipy truncnorm model to observed values and then sampling
I am struggling with the Scipy truncnorm fit method and I would appreciate help so that the fitted parameter coefficients are consistent with the observed data. As an example, I have created a small sample from the right hand tail of a N(0,1) distribution (where the observations are larger than 2 stdev) and have thrown in a few outliers. import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy.stats import truncnorm values = np.array([2.01, 2.06, 2.71, 2.31, 2.58, 2.17, 3.03, 2.24, 2.12, 2.72, 2.46, 2.66, 2.49, 3.41, 2.46, 2.12, 2.12, 2.65, 2.32, 2.49, 5.15, 2.62, 2.48, 2.27, 2.05]) pd.Series(values).describe() This then produces the following summary statistics: count 25.000 mean 2.548 std 0.633 min 2.01 25% 2.17 50% 2.46 75% 2.65 max 5.15 To illustrate the problem I am having with the scipy fit method and to better understand the truncnorm implementation, I have built the following intuitive models from inspecting the above summary statistics and sampled histograms to the observed values (see charts below). What I am struggling with is why the fit method gives such poor results when I attempt to sample using the estimated parameters? In the event I am not using the fit results correctly or making some other mistake, I would appreciate help with the transformations? The code to build up these examples: size = 10000 bins = 30 intuitive_models = {"model1":(2, 5), "model2":(1, 4, 1), "model3":(0.8, 4, 1, 1.25), "fitted":truncnorm.fit(values)} # store the tuncnorm random sample into a dict model_results = dict() for model, params in intuitive_models.items(): model_results[model] = truncnorm(*params).rvs(size) # plot the random sample vs the oserved values for model, params in intuitive_models.iteritems(): plt.figure() plt.hist(model_results[model], bins=bins, normed=True) plt.title(model + ': ' + repr(params)) plt.hist(values, normed=True, alpha=0.5) # tabular comparison print pd.DataFrame(model_results).describe() which produced the following tabular data: fitted model1 model2 model3 count 10000.000000 10000.000000 10000.000000 10000.000000 mean 1.024707 2.372819 2.524923 2.698601 std 0.014362 0.333144 0.443857 0.584215 min 1.000019 2.000040 2.000007 2.000019 25% 1.012248 2.121838 2.181642 2.245088 50% 1.024518 2.280975 2.407814 2.557983 75% 1.036996 2.534782 2.757778 2.998948 max 1.049991 4.829619 4.982337 5.905201 Thanks Bertie. p.s. I hope this is accepted as a coding question and not a stats question - which is why I have posted it here. -- Update 28-Aug-2014 -- The idea with this post was to see get some help with the scipy.stats.truncnorm.fit method and in the couple of days, I have built my own clunky algorithm. From my discussions with Robert, I get the impression the R or standard implementation for truncnorm is to only take 3 parameters. For those coming to this post later, once scipy has an improved fitting engine, this is what I have estimated (assuming we want an asymptotic right tail).