Related
I'm trying to compute/visualize the time lag between 2 time series (I want to know the time lag between the humidity progression of outside and inside a room).
Each data point of my series was taken hourly. Plotting the 2 series together, I can clearly see a shift between them: Sorry for hiding the axis
Here are a part of my time series data. I will pack them in 2 arrays:
inside_humidity =
[11.77961297, 11.59755268, 12.28761522, 11.88797553, 11.78122077, 11.5694668,
11.70421932, 11.78122077, 11.74272005, 11.78122077, 11.69438733, 11.54126933,
11.28460592, 11.05624965, 10.9611012, 11.07527934, 11.25417308, 11.56040908,
11.6657186, 11.51171572, 11.49246536, 11.78594142, 11.22968373, 11.26840678,
11.26840678, 11.29447992, 11.25553344, 11.19711371, 11.17764047, 11.11922075,
11.04132778, 10.86996123, 10.67410607, 10.63493504, 10.74922916, 10.74922916,
10.6294765, 10.61011497, 10.59075345, 10.80373021, 11.07479154, 11.15223764,
11.19711371, 11.17764047, 11.15816723, 11.22250051, 11.22250051, 11.202915,
11.18332948, 11.16374396, 11.14415845, 11.12457293, 11.10498742, 11.14926578,
11.16896413, 11.16896413, 11.14926578, 10.8307902, 10.51742195, 10.28187137,
10.12608544, 9.98977276, 9.62267727, 9.31289289, 8.96438546, 8.77077022,
8.69332413, 8.51907042, 8.30609366, 8.38353975, 8.4513867, 8.47085994,
8.50980642, 8.52927966, 8.50980642, 8.55887037, 8.51969934, 8.48052831,
8.30425867, 8.2177078, 7.98402891, 7.92560918, 7.89950166, 7.83489682,
7.75789537, 7.5984808, 7.28426807, 7.39778913, 7.71943214, 8.01149931,
8.18276652, 8.23009255, 8.16215295, 7.93822471, 8.00350215, 7.93843482,
7.85072729, 7.49778011, 7.31782649, 7.29862668, 7.60162032, 8.29665484,
8.58797834, 8.50011383, 8.86757784, 8.76600556, 8.60491125, 8.4222628,
8.24923231, 8.14470714, 8.17351638, 8.52530093, 8.72220151, 9.26745883,
9.1580007, 8.61762692, 8.22187405, 8.43693644, 8.32414835, 8.32463974,
8.46833012, 8.55865487, 8.72647164, 9.04112806, 9.35578449, 9.59465974,
10.47339785, 11.07218093, 10.54091351, 10.56138918, 10.46099958, 10.38129168,
10.16434831, 10.10612612, 10.009246, 10.53502351, 10.8307902, 11.13420052,
11.64337309, 11.18958511, 10.49630791, 10.60856932, 10.37029108, 9.86281478,
9.64699826, 9.95341012, 10.24329812, 10.6848196, 11.47604231, 11.30505352,
10.72194974, 10.30058448, 10.05022037, 10.06318411, 9.90118897, 9.68530059,
9.47790657, 9.48585784, 9.61639418, 9.86244265, 10.29009361, 10.28297229,
10.32073088, 10.65389513, 11.09656351, 11.20188562, 11.24124169, 10.40503955,
9.74632512, 9.07606098, 8.85145589, 9.37080152, 9.65082743, 10.0707891,
10.68776091, 11.25879751, 11.0416348, 10.89558456, 10.7908258, 10.66539685,
10.7297755, 10.77571398, 10.9268264, 11.16021492, 11.60961709, 11.43827534,
11.96155427, 12.16116437, 12.80412266, 12.52540805, 11.96752965, 11.58099292]
outside_humidity =
[10.17449206, 10.4823292, 11.06818167, 10.82768699, 11.27582592, 11.4196233,
10.99393027, 11.4122507, 11.18192837, 10.87247831, 10.68664321, 10.37949651,
9.57155882, 10.86611665, 11.62547196, 11.32004266, 11.75537602, 11.51292063,
11.03107569, 10.7297755, 10.4345622, 10.61271497, 9.49271162, 10.15594248,
9.99053828, 9.80915398, 9.6452438, 10.06900573, 11.18075689, 11.8289847,
11.83334752, 11.27480708, 11.14370467, 10.88149985, 10.73930381, 10.7236597,
10.26210496, 11.01260226, 11.05428228, 11.58321342, 12.70523808, 12.5181118,
11.90023799, 11.67756426, 11.28859471, 10.86878222, 9.73984486, 10.18253902,
9.80915398, 10.50980784, 11.38673459, 11.22751685, 10.94171823, 10.56484228,
10.38220753, 10.05388847, 9.96147203, 9.90698862, 9.7732203, 9.85262125,
8.7412938, 8.88281702, 8.07919545, 8.02883587, 8.32341424, 8.07357711,
7.27302616, 6.73660684, 6.66722819, 7.29408637, 7.00046542, 6.46322019,
6.07150988, 6.00207234, 5.8818402, 6.82443881, 7.20212882, 7.52167696,
7.88857771, 8.351627, 8.36547023, 8.24802846, 8.18520693, 7.92420816,
7.64926024, 7.87944972, 7.82118727, 8.02091833, 7.93071882, 7.75789457,
7.5416447, 6.94430133, 6.65907535, 6.67454591, 7.25493614, 7.76939457,
7.55357806, 6.61479472, 7.17641357, 7.24664082, 8.62732387, 8.66913548,
8.70925667, 9.0477017, 8.24558224, 8.4330502, 8.44366397, 8.17995798,
8.1875752, 9.33296518, 9.66567041, 9.88581085, 8.95449382, 8.3587624,
9.20584448, 8.90605388, 8.87494884, 9.12694892, 8.35055177, 7.91879933,
7.78867253, 8.22800878, 9.03685287, 12.49630018, 11.11819755, 10.98869374,
10.65897176, 10.36444573, 10.052609, 10.87627021, 10.07379564, 10.02233847,
9.62022856, 11.21575473, 10.85483543, 11.67324627, 11.89234248, 11.10068132,
10.06942096, 8.50405894, 8.13168561, 8.83616476, 8.35675085, 8.33616802,
8.35675085, 9.02209801, 9.5530404, 9.44738836, 10.89645958, 11.44771721,
11.79943601, 10.7765335, 11.1453622, 10.74874776, 10.55195175, 10.34494483,
9.83813522, 11.26931785, 11.20641798, 10.51555027, 10.90808954, 11.80923545,
11.68300879, 11.60313809, 7.95163365, 7.77213815, 7.54209557, 7.30603673,
7.17842173, 8.25899805, 8.56494995, 10.44245578, 11.08542758, 11.74129079,
11.67979686, 12.94362214, 11.96285343, 11.8289847, 11.01388413, 10.6793698,
11.20662595, 11.97684701, 12.46383177, 11.34178655, 12.12477078, 12.48698059,
12.89325064, 12.07470295, 12.6777319, 10.91689448, 10.7676326, 10.66710434]
I know cross correlation is the right term to use, but after a while I still don't get the idea of using scipy.signal.correlate and numpy.correlate, because all I got is an array full of NaNs. So clearly I need some more knowledge in this area.
What I expect to achieve is probably a plot like those in the answer section of this thread How to make a correlation plot with a certain lag of two time series where I can see at how many hours the time lag is most likely.
Thank you a lot in advance!
With the given data, you can use the numpy and matplotlib modules to achieve the desired result.
so, you can do something like this:
import numpy as np
from matplotlib import pyplot as plt
x = np.array(inside_humidity)
y = np.array(outside_humidity)
fig = plt.figure()
# fit a curve of your choice
a, b = np.polyfit(inside_humidity, outside_humidity, 1)
y_fit = a * x + b
# scatter plot, and fitted plot (best fit used)
plt.scatter(inside_humidity, outside_humidity)
plt.plot(x, y_fit)
plt.show()
which gives this:
I have data that looks like:
Scenario ymin lower middle upper ymax
One 16362.586379 20911.338893 27121.693254 35219.449009 46406.087619
Two 19779.003240 25390.096116 33108.174561 43545.202225 58464.277060
Rather than use all 50 k data points for every Scenario (there are many more than One and Two), I've computed the positions I need for the box and whiskers.
I try to plot this via
import pandas
import plotnine as p9
df = pandas.read_excel('boxplot_data.xlsx', sheet='Sheet1')
gg = p9.ggplot()
gg += p9.geoms.geom_boxplot(mapping=p9.aes(x='Scenario', ymin='ymin', lower='lower', middle='middle', upper='upper', ymax='ymax'), data=df, color='k', show_legend=False, inherit_aes=False)
gg += p9.themes.theme_seaborn()
gg += p9.labels.xlab('Scenario')
gg.save(filename='scenario_boxplot.png', dpi=300)
The documentation at https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_boxplot.html#plotnine.geoms.geom_boxplot indicates that the geom_boxplot line of code supplies the required aesthetic parameters to define the box and whiskers.
Running this, however, gives
plotnine.exceptions.PlotnineError: 'stat_boxplot requires the
following missing aesthetics: y'
Why is stat_boxplot being called, with its required aesthetics, not geom_boxplot?
And more importantly, does anybody know how to correct this?
You are using geom_boxplot with stat_boxplot instead of stat_identity.
geom_boxplot(stat='identity', ...)
I am new to working with pymc3 and I am having trouble generating an easy-to-read traceplot.
I'm fitting a mixture of 4 multivariate gaussians to some (x, y) points in a dataset. The model runs fine. My question is with regard to manipulating the pm.traceplot() command to make the output more user-friendly.
Here's my code:
import matplotlib.pyplot as plt
import numpy as np
model = pm.Model()
N_CLUSTERS = 4
with model:
#cluster prior
w = pm.Dirichlet('w', np.ones(N_CLUSTERS))
#latent cluster of each observation
category = pm.Categorical('category', p=w, shape=len(points))
#make sure each cluster has some values:
w_min_potential = pm.Potential('w_min_potential', tt.switch(tt.min(w) < 0.1, -np.inf, 0))
#multivariate normal means
mu = pm.MvNormal('mu', [0,0], cov=[[1,0],[0,1]], shape = (N_CLUSTERS,2) )
#break symmetry
pm.Potential('order_mu_potential', tt.switch(
tt.all(
[mu[i, 0] < mu[i+1, 0] for i in range(N_CLUSTERS - 1)]), -np.inf, 0))
#multivariate centers
data = pm.MvNormal('data', mu =mu[category], cov=[[1,0],[0,1]], observed=points)
with model:
trace = pm.sample(1000)
A call to pm.traceplot(trace, ['w', 'mu']) produces this image:
As you can see, it is ambiguous which mean peak corresponds to an x or y value, and which ones are paired together. I have managed a workaround as follows:
from cycler import cycler
#plot the x-means and y-means of our data!
fig, (ax0, ax1) = plt.subplots(nrows=2)
plt.xlabel('$\mu$')
plt.ylabel('frequency')
for i in range(4):
ax0.hist(trace['mu'][:,i,0], bins=100, label='x{}'.format(i), alpha=0.6);
ax1.hist(trace['mu'][:,i,1],bins=100, label='y{}'.format(i), alpha=0.6);
ax0.set_prop_cycle(cycler('color', ['c', 'm', 'y', 'k']))
ax1.set_prop_cycle(cycler('color', ['c', 'm', 'y', 'k']))
ax0.legend()
ax1.legend()
This produces the following, much more legible plot:
I have looked through the pymc3 documentation and recent questions here, but to no avail. My question is this: is it possible to do what I have done here with matplotlib via builtin methods in pymc3, and if so, how?
Better differentiation between multidimensional variables and the different chains was recently added to ArviZ (the library PyMC3 relies on for plotting).
In ArviZ latest version, you should be able to do:
az.plot_trace(trace, compact=True, legend=True)
to get the different dimensions of each variable distinguished by color and the different chains distinguished by linestyle. The default setting is using matplotlib's default color cycle and 4 different linestyles, solid, dashed, dotted and dash-dotted. Both properties can be set to custom aesthetics and custom values by using compact_prop to customize dimension representation and chain_prop to customize chain representation. In addition, if using compact, it may also be a good idea to use combined=True to reduce the clutter in the first column. As an example:
az.plot_trace(trace, compact=True, combined=True, legend=True, chain_prop=("ls", "-"))
would plot the KDEs in the first column using the data from all chains, and would plot all chains using a solid linestyle (due to combined arg, only relevant for the second column). Two legends will be shown, one for the chain info and another for the compact info.
At least in recent versions, you can use compact=True as in:
pm.traceplot(trace, var_names = ['parameters'], compact=True)
to get one graph with all you params combined
Docs in: https://arviz-devs.github.io/arviz/_modules/arviz/plots/traceplot.html
However, I haven't been able to get the colors to differ between lines
I am trying to plot a linear line with associated error.
I calculated values for slope (a) and intercepts (b). In addition, I calculated the error associated with these values. So I drew the line given by the typical formula below.
y=ax+b
However, in addition to the line, I also want to draw the associated error. I came up with the idea to draw the lines associated with these formulas and color the space between the lines gray.
y=(a+a_sd)x+(b+b_sd)
y=(a-a_sd)x+(b-b_sd)
Uisng the following piece of code, I am able to color part of the surface between the lines, but not the whole span (see included output).
I think this may be due to the fact that "distance" is not sorted, and fill_between is using distance[0] and distance[-1] as begin and end for the span, respectively.
As always, any help would be highly appreciated!
import matplotlib.pyplot as plt
distance=[0.35645334340084989, 0.55406894241607718, 0.10201413273193734, 0.13401365724625941, 0.71918808865838735, 0.14151335417722818]
time=[2.4004984846346171, 2.4909766335028447, 1.9852064018125195, 1.9083156734132103, 2.6380396934372863, 1.9114505780323543]
time_SD=[0.062393810960652669, 0.056945715242838917, 0.073960838867327183, 0.084111239062664475, 0.026912957190265499, 0.08595664694840538]
distance_SD=[0.035160608598240162, 0.032976715460514235, 0.02782911002465227, 0.035465701695038584, 0.043009444687382707, 0.038387585107200854]
a=1.17887019041
b=1.83339229489
a_sd=0.159771527859
b_sd=0.0762509747218
plt.errorbar(distance,time,yerr=time_SD, xerr=distance_SD, linestyle="None")
abline_values = [(a)*i + (b) for i in distance]
abline_values_plus = [(a+a_sd)*i + (b+b_sd) for i in distance]
abline_values_minus = [(a-a_sd)*i + (b-b_sd) for i in distance]
plt.plot(distance, abline_values,"r")
plt.fill_between(distance,abline_values_minus,abline_values_plus,facecolor='lightgrey', interpolate=True, edgecolors="None")
leg = plt.legend(loc="lower right", frameon=False, handlelength=0, handletextpad=0)
for item in leg.legendHandles:
item.set_visible(False)
plt.show()
In order to use pyplot.fill_between() the list to plot the horizontal coordinate should be sorted. Using an unsorted list of x values is possible, but can lead to undesired results.
Sorting a list can be done using sorted(list).
import matplotlib.pyplot as plt
distance=[0.35645334340084989, 0.55406894241607718, 0.10201413273193734, 0.13401365724625941, 0.71918808865838735, 0.14151335417722818]
time=[2.4004984846346171, 2.4909766335028447, 1.9852064018125195, 1.9083156734132103, 2.6380396934372863, 1.9114505780323543]
time_SD=[0.062393810960652669, 0.056945715242838917, 0.073960838867327183, 0.084111239062664475, 0.026912957190265499, 0.08595664694840538]
distance_SD=[0.035160608598240162, 0.032976715460514235, 0.02782911002465227, 0.035465701695038584, 0.043009444687382707, 0.038387585107200854]
a=1.17887019041
b=1.83339229489
a_sd=0.159771527859
b_sd=0.0762509747218
distance_sorted = sorted(distance)
plt.errorbar(distance,time,yerr=time_SD, xerr=distance_SD, linestyle="None")
abline_values = [(a)*i + (b) for i in distance_sorted]
abline_values_plus = [(a+a_sd)*i + (b+b_sd) for i in distance_sorted]
abline_values_minus = [(a-a_sd)*i + (b-b_sd) for i in distance_sorted]
plt.plot(distance_sorted, abline_values,"r")
plt.fill_between(distance_sorted,abline_values_minus,abline_values_plus, facecolor='lightgrey', edgecolors="None")
plt.show()
The documentation does not mention the requirement of x values being sorted. The reason is probably that fill_between actually works even with unsorted lists, just not the way one might expect. Maybe the following animation gives a more intuitive understanding on the issue:
You are right fill_between seems to expect the values to be sorted. The documentation is not clear about this behaviour though. The following example however shows the same effect:
import matplotlib.pyplot as plt
from numpy import random, array
#x = random.randn(20) #does not work
x = array(sorted(random.randn(20))) #works
a = 2
d = .5
y_h = x*(a+d)
y_l = x*(a-d)
plt.fill_between(x,y_h, y_l)
plt.show()
As a workaround just sort your values before calculating your errorlines using sorted.
I deal with simulation data and have been using matplotlib a lot lately and have been encountering something (a bug?) that's annoying.
I have been allowing matplotlib to automatically set the tick labels and their type (scientific, etc) and with some data I get weird scientific ticker labels.
In searching for a resolution to this I found that you can call set_powerlimits((n,m)) to set the the limits of data that will be displayed using scientific notation. But I have encountered this problem (if I remember correctly) with data spanning several orders of magnitude, also my data is all over the place so I need a programmatic solution of some sort, not a hard set solution.
see: http://matplotlib.org/api/ticker_api.html
Below I have included example data, code, and a screenshot.
#! /usr/bin/env python
from matplotlib import pyplot as plt
data = [
[1.83186088e-08,0.03275],
[1.07139009e-07,0.03275],
[2.06376627e-07,0.03275],
[3.03918517e-07,0.03275],
[4.06032883e-07,0.03275],
[5.01194017e-07,0.03275],
[6.02195723e-07,0.03275],
[7.03536925e-07,0.03275],
[8.04625154e-07,0.03275],
[9.06401951e-07,0.03275],
[1.00041895e-06,0.03275],
[1.10230745e-06,0.03275],
[1.2042525e-06,0.03275],
[1.30647822e-06,0.03275],
[1.40109887e-06,0.03275],
[1.50380097e-06,0.03275],
[1.60683242e-06,0.03275],
[1.70208505e-06,0.03275],
[1.80545692e-06,0.03275],
[1.90090648e-06,0.03275],
[2.00453092e-06,0.03275],
[2.10018627e-06,0.03275],
[2.20401747e-06,0.03275],
[2.30009359e-06,0.03275],
[2.4043033e-06,0.03275],
[2.50066449e-06,0.03275],
[2.60513728e-06,0.03275],
[2.70165405e-06,0.03275],
[2.80635938e-06,0.03275],
[2.90331342e-06,0.03275],
[3.00021199e-06,0.03275],
[3.10546819e-06,0.03275],
[3.20257899e-06,0.03275],
[3.30032923e-06,0.0327499999],
[3.40612833e-06,0.0327499999],
[3.50401732e-06,0.0327499997],
[3.60153069e-06,0.0327499996],
[3.70700708e-06,0.0327499993],
[3.80456907e-06,0.0327499988],
[3.90259984e-06,0.0327499982],
[4.00084149e-06,0.0327499973],
[4.10700266e-06,0.0327499959],
[4.2047462e-06,0.0327499942],
[4.30209468e-06,0.0327499918],
[4.40018204e-06,0.0327499886],
[4.50712875e-06,0.032749984],
[4.60630591e-06,0.0327499785],
[4.70519881e-06,0.0327499715],
[4.80398305e-06,0.0327499628],
[4.90251297e-06,0.0327499521],
[5.00182752e-06,0.032749939],
[5.10157551e-06,0.0327499232],
[5.20157575e-06,0.0327499043],
[5.30145192e-06,0.0327498822],
[5.40127044e-06,0.0327498565],
[5.500537e-06,0.0327498272],
[5.60773155e-06,0.0327497911],
[5.70660709e-06,0.0327497534],
[5.80610521e-06,0.0327497112],
[5.90651786e-06,0.0327496642],
[6.00749437e-06,0.0327496124],
[6.10822094e-06,0.0327495566],
[6.20042255e-06,0.0327495018],
[6.30049028e-06,0.0327494386],
[6.40035803e-06,0.0327493715],
[6.50035477e-06,0.0327493004],
[6.60056805e-06,0.0327492251],
[6.70029936e-06,0.0327491461],
[6.80054193e-06,0.0327490625],
[6.90130872e-06,0.0327489743],
[7.00202598e-06,0.0327488818],
[7.10217348e-06,0.0327487855],
[7.20243015e-06,0.0327486847],
[7.30199609e-06,0.0327485801],
[7.40193254e-06,0.0327484707],
[7.50188319e-06,0.0327483567],
[7.60306205e-06,0.0327482367],
[7.70357184e-06,0.0327481129],
[7.80343389e-06,0.0327479853],
[7.90330165e-06,0.0327478532],
[8.00348513e-06,0.0327477162],
[8.10167039e-06,0.0327475777],
[8.206328e-06,0.0327474253],
[8.3020567e-06,0.0327472819],
[8.40527826e-06,0.0327471228],
[8.50095898e-06,0.0327469714],
[8.60536828e-06,0.0327468019],
[8.70106059e-06,0.0327466426],
[8.80396558e-06,0.032746467],
[8.90727378e-06,0.0327462865],
[9.00225164e-06,0.0327461166],
[9.10359892e-06,0.0327459311],
[9.20470894e-06,0.0327457418],
[9.30582982e-06,0.0327455481],
[9.40750123e-06,0.0327453488],
[9.50134495e-06,0.0327451608],
[9.60358199e-06,0.0327449513],
[9.70705637e-06,0.0327447344],
[9.80377546e-06,0.0327445269],
[9.90091941e-06,0.032744314],
]
times=[]
vals=[]
for elem in data:
times.append(elem[0])
vals.append(elem[1])
plt.plot(times,vals)
plt.show()
screen_shot
You might try using the Engineering Formatter:
times=[]
vals=[]
for elem in data:
times.append(elem[0])
vals.append(elem[1])
plt.plot(times,vals)
plt.show()
formatter = matplotlib.ticker.EngFormatter(unit='S', places=3)
formatter.ENG_PREFIXES[-6] = 'u'
plt.axes().yaxis.set_major_formatter(formatter)
Which will look like this:
This is a known problem. You'd be better to analyse the data manually for its limits, like you have done in the screen shot, and use ax.set_ylim(min, max) yourself after plotting. You can also turn off the offset with:
import matplotlib.ticker as mticker
# plot some stuff
# ...
y_formatter = mticker.ScalarFormatter(useOffset=False)
ax.yaxis.set_major_formatter(y_formatter)
I think that you best option is to use logaritmic axis, but if you need to create the graphic with linear axis, you must set the power limits yourself. You can compute the power limits using math.log10:
import math
from matplotlib import ticker
# Compute the span of the data
pow_min = math.floor(math.log10(min(vals)))
pow_max = math.ceil(math.log10(max(vals)))
# Create a scalar formatter without offset, in order to have
# the right exponent over the yaxis
fmt = ticker.ScalarFormatter(useOffset=False)
fmt.set_powerlimits((pow_min, pow_max))
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
ax1.plot(times, vals)
ax1.yaxis.set_major_formatter(fmt) # Set the formatter