I'm trying to compute/visualize the time lag between 2 time series (I want to know the time lag between the humidity progression of outside and inside a room).
Each data point of my series was taken hourly. Plotting the 2 series together, I can clearly see a shift between them: Sorry for hiding the axis
Here are a part of my time series data. I will pack them in 2 arrays:
inside_humidity =
[11.77961297, 11.59755268, 12.28761522, 11.88797553, 11.78122077, 11.5694668,
11.70421932, 11.78122077, 11.74272005, 11.78122077, 11.69438733, 11.54126933,
11.28460592, 11.05624965, 10.9611012, 11.07527934, 11.25417308, 11.56040908,
11.6657186, 11.51171572, 11.49246536, 11.78594142, 11.22968373, 11.26840678,
11.26840678, 11.29447992, 11.25553344, 11.19711371, 11.17764047, 11.11922075,
11.04132778, 10.86996123, 10.67410607, 10.63493504, 10.74922916, 10.74922916,
10.6294765, 10.61011497, 10.59075345, 10.80373021, 11.07479154, 11.15223764,
11.19711371, 11.17764047, 11.15816723, 11.22250051, 11.22250051, 11.202915,
11.18332948, 11.16374396, 11.14415845, 11.12457293, 11.10498742, 11.14926578,
11.16896413, 11.16896413, 11.14926578, 10.8307902, 10.51742195, 10.28187137,
10.12608544, 9.98977276, 9.62267727, 9.31289289, 8.96438546, 8.77077022,
8.69332413, 8.51907042, 8.30609366, 8.38353975, 8.4513867, 8.47085994,
8.50980642, 8.52927966, 8.50980642, 8.55887037, 8.51969934, 8.48052831,
8.30425867, 8.2177078, 7.98402891, 7.92560918, 7.89950166, 7.83489682,
7.75789537, 7.5984808, 7.28426807, 7.39778913, 7.71943214, 8.01149931,
8.18276652, 8.23009255, 8.16215295, 7.93822471, 8.00350215, 7.93843482,
7.85072729, 7.49778011, 7.31782649, 7.29862668, 7.60162032, 8.29665484,
8.58797834, 8.50011383, 8.86757784, 8.76600556, 8.60491125, 8.4222628,
8.24923231, 8.14470714, 8.17351638, 8.52530093, 8.72220151, 9.26745883,
9.1580007, 8.61762692, 8.22187405, 8.43693644, 8.32414835, 8.32463974,
8.46833012, 8.55865487, 8.72647164, 9.04112806, 9.35578449, 9.59465974,
10.47339785, 11.07218093, 10.54091351, 10.56138918, 10.46099958, 10.38129168,
10.16434831, 10.10612612, 10.009246, 10.53502351, 10.8307902, 11.13420052,
11.64337309, 11.18958511, 10.49630791, 10.60856932, 10.37029108, 9.86281478,
9.64699826, 9.95341012, 10.24329812, 10.6848196, 11.47604231, 11.30505352,
10.72194974, 10.30058448, 10.05022037, 10.06318411, 9.90118897, 9.68530059,
9.47790657, 9.48585784, 9.61639418, 9.86244265, 10.29009361, 10.28297229,
10.32073088, 10.65389513, 11.09656351, 11.20188562, 11.24124169, 10.40503955,
9.74632512, 9.07606098, 8.85145589, 9.37080152, 9.65082743, 10.0707891,
10.68776091, 11.25879751, 11.0416348, 10.89558456, 10.7908258, 10.66539685,
10.7297755, 10.77571398, 10.9268264, 11.16021492, 11.60961709, 11.43827534,
11.96155427, 12.16116437, 12.80412266, 12.52540805, 11.96752965, 11.58099292]
outside_humidity =
[10.17449206, 10.4823292, 11.06818167, 10.82768699, 11.27582592, 11.4196233,
10.99393027, 11.4122507, 11.18192837, 10.87247831, 10.68664321, 10.37949651,
9.57155882, 10.86611665, 11.62547196, 11.32004266, 11.75537602, 11.51292063,
11.03107569, 10.7297755, 10.4345622, 10.61271497, 9.49271162, 10.15594248,
9.99053828, 9.80915398, 9.6452438, 10.06900573, 11.18075689, 11.8289847,
11.83334752, 11.27480708, 11.14370467, 10.88149985, 10.73930381, 10.7236597,
10.26210496, 11.01260226, 11.05428228, 11.58321342, 12.70523808, 12.5181118,
11.90023799, 11.67756426, 11.28859471, 10.86878222, 9.73984486, 10.18253902,
9.80915398, 10.50980784, 11.38673459, 11.22751685, 10.94171823, 10.56484228,
10.38220753, 10.05388847, 9.96147203, 9.90698862, 9.7732203, 9.85262125,
8.7412938, 8.88281702, 8.07919545, 8.02883587, 8.32341424, 8.07357711,
7.27302616, 6.73660684, 6.66722819, 7.29408637, 7.00046542, 6.46322019,
6.07150988, 6.00207234, 5.8818402, 6.82443881, 7.20212882, 7.52167696,
7.88857771, 8.351627, 8.36547023, 8.24802846, 8.18520693, 7.92420816,
7.64926024, 7.87944972, 7.82118727, 8.02091833, 7.93071882, 7.75789457,
7.5416447, 6.94430133, 6.65907535, 6.67454591, 7.25493614, 7.76939457,
7.55357806, 6.61479472, 7.17641357, 7.24664082, 8.62732387, 8.66913548,
8.70925667, 9.0477017, 8.24558224, 8.4330502, 8.44366397, 8.17995798,
8.1875752, 9.33296518, 9.66567041, 9.88581085, 8.95449382, 8.3587624,
9.20584448, 8.90605388, 8.87494884, 9.12694892, 8.35055177, 7.91879933,
7.78867253, 8.22800878, 9.03685287, 12.49630018, 11.11819755, 10.98869374,
10.65897176, 10.36444573, 10.052609, 10.87627021, 10.07379564, 10.02233847,
9.62022856, 11.21575473, 10.85483543, 11.67324627, 11.89234248, 11.10068132,
10.06942096, 8.50405894, 8.13168561, 8.83616476, 8.35675085, 8.33616802,
8.35675085, 9.02209801, 9.5530404, 9.44738836, 10.89645958, 11.44771721,
11.79943601, 10.7765335, 11.1453622, 10.74874776, 10.55195175, 10.34494483,
9.83813522, 11.26931785, 11.20641798, 10.51555027, 10.90808954, 11.80923545,
11.68300879, 11.60313809, 7.95163365, 7.77213815, 7.54209557, 7.30603673,
7.17842173, 8.25899805, 8.56494995, 10.44245578, 11.08542758, 11.74129079,
11.67979686, 12.94362214, 11.96285343, 11.8289847, 11.01388413, 10.6793698,
11.20662595, 11.97684701, 12.46383177, 11.34178655, 12.12477078, 12.48698059,
12.89325064, 12.07470295, 12.6777319, 10.91689448, 10.7676326, 10.66710434]
I know cross correlation is the right term to use, but after a while I still don't get the idea of using scipy.signal.correlate and numpy.correlate, because all I got is an array full of NaNs. So clearly I need some more knowledge in this area.
What I expect to achieve is probably a plot like those in the answer section of this thread How to make a correlation plot with a certain lag of two time series where I can see at how many hours the time lag is most likely.
Thank you a lot in advance!
With the given data, you can use the numpy and matplotlib modules to achieve the desired result.
so, you can do something like this:
import numpy as np
from matplotlib import pyplot as plt
x = np.array(inside_humidity)
y = np.array(outside_humidity)
fig = plt.figure()
# fit a curve of your choice
a, b = np.polyfit(inside_humidity, outside_humidity, 1)
y_fit = a * x + b
# scatter plot, and fitted plot (best fit used)
plt.scatter(inside_humidity, outside_humidity)
plt.plot(x, y_fit)
plt.show()
which gives this:
I'm trying to visualize how a distribution changes over time -- each vertical slice should be the distribution at that timestep.
I want it to look something like this (there are two such curves/temporal-histograms here).
The closest I've found is this seaborn time series, but I want the distribution or at least the min, mean, and max -- this band is a confidence interval, which I can't use (it's also prohibitively slow).
https://seaborn.pydata.org/examples/errorband_lineplots.html
Update:
Here's a snippet to product some sample data.
import numpy as np
num_timesteps = 20
samples_per_timestep = 100
timesteps = np.arange(num_timesteps)
def get_std(t):
return t if t < num_timesteps//2 else abs(num_timesteps - t)
samples = np.stack([np.random.normal(t, get_std(t), samples_per_timestep) for t in timesteps])
samples[t] is a sample of the ditribution a timestep t. The distribution starts as constant (std = 0), widens, then narrows again.
Update: So I asked about this in the Seaborn repository, and this can be way simpler using Seaborn.
Here's an example:
import seaborn as sns
import seaborn.objects as so
fmri = sns.load_dataset("fmri").query("region == 'parietal'")
p = so.Plot(fmri, "timepoint", "signal")
for tail in [25, 10, 5, 1]:
p = p.add(so.Band(), so.Perc([tail, 100 - tail]))
p.add(so.Line(), so.Agg("median"))
Which will result in this plot:
You can read more about it in Statistical estimation and error bars.
This is a lot less work and better scalable. Hope it helps!
I had the exact same problem, and took quite a few detours, but this is most definitely possible!
Imports
We need to import matplotlib, NumPy and Pandas.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Input data
I assume you have the data as a Pandas Series, with the time/step on the index and the values as values.
Step
Wealth
0
0
0
0
0
0
0
0
1
7.89338
1
7.50838
1
2.00948
1
8.74963
I load my data from a pickle (in the format specified above):
wealth = pd.read_pickle("wealth.pickle")
You can download a ZIP with this pickle file here: wealth.zip.
Aggregate the data to percentiles
This part is a bit ugly. We first define a partial NumPy function for each percentile we want to calculate:
# Define functions to calculate percentiles
def q1(x):
return x.quantile(0.01)
def q5(x):
return x.quantile(0.05)
def q25(x):
return x.quantile(0.25)
def q50(x):
return x.quantile(0.50)
def q75(x):
return x.quantile(0.75)
def q95(x):
return x.quantile(0.95)
def q99(x):
return x.quantile(0.99)
If anyone reading this has a better/cleaner way to aggerate this, please let me know!
Note we use numpy.quantile, because (in my case) it works better with the data in the series, but numpy.percentile should be equivalent.
The data now needs to be grouped by the index (level=0), and then aggerated using the functions defined above:
w_agg = wealth.groupby(level=0).agg([q1, q5, q25, q50, q75, q95, q99])
And now w_agg looks like this:
Step
q1
q5
q25
q50
q75
q95
q99
0
0
0
0
0
0
0
0
1
-3.2311
0.759751
3.2881
6.03641
8.43206
11.9663
15.4515
2
-3.22888
-1.15079
3.13756
6.41804
8.43206
12.7269
15.4515
3
-5.31614
-1.91156
3.22287
6.54126
8.77544
14.644
15.5798
4
-5.64095
-2.52143
2.65959
6.22455
9.40699
14.6545
15.9647
Plotting
Now we can start with the plotting. Aside from the regulars, we're using matplotlib.filbetween for this.
We create a figure, and then start with the widest percentile range: From 1 to 99. Then we draw 5 to 95 on top of that, then 25 to 75 on top of that and finally the median as a line.
Play a bit with the alpha and color values to make it look nice!
# Create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Add the bands to the axes
ax.fill_between(x=w_agg.index, y1=w_agg["q1"], y2=w_agg["q99"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q5"], y2=w_agg["q95"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q25"], y2=w_agg["q75"], alpha=0.3, color="tab:blue")
# Plot the median as line
ax.plot(w_agg.index, w_agg["q50"], '-', color="tab:blue")
# Add title, legend and plot
ax.set_title("Distribution of wealth between population over time")
ax.set_xlabel("Time")
ax.set_ylabel("Wealth")
ax.legend([f"{n}% distribution" for n in [99, 90, 50]] + ["Median"], loc="upper left")
fig.tight_layout()
The result
For my dataset, this was the result:
I have a ton of ranges. They all consist of numbers. The range has a maximum and a minimum which can not be exceeded, but given the example that you have two ranges and one max point of the range reaches above the min area of the other. That would mean that you have a small area that covers both of them. You can write one range that includes the others.
I want to see if some ranges overlap or if I can find some ranges that cover most of the other. The goal would be to see if I can simplify them by using one smaller range that fits inside the other. For example 7,8 - 9,6 and 7,9 - 9,6 can be covered with one range.
You can see my attempt to visualize them. But when I use my entire dataset consisting of hundreds of ranges my graph is not longer useful.
I know that I can detect recurrent ranges using python. But I don't want to know how often a range occurs. I want to know how many ranges lay in the same numerical boundaries.I want see if I can have a couple of ranges covering all of them. Finally my goal is to have the masterranges sorted in categories. Meaning that I have range 1 covering 50 other ranges. then range 2 covering 25 ranges and so on.
My current program shows the penetration of ranges but I also want that in a printed output with the exact digits.
It would be nice if you share some ideas to solve that program or if you have any suggestions on tools within python 3.7
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]
]
for int in intervals:
plt.plot(int,[0,0], 'b', alpha = 0.2, linewidth = 100)
plt.show()
Here is an idea, You make a pandas data frame with the array. You substract the values in column2 - colum1 ( column 1 is x, and column 2 is y ). After that you create a histogram in which you take the range and the frecuency.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]]
intervals_ar = np.array(intervals)
df = pd.DataFrame({'Column1': intervals_ar[:, 0], 'Column2': intervals_ar[:, 1]})
df['Ranges'] = df['Column2'] - df ['Column1']
print(df)
frecuency_range = df['Ranges'].value_counts().sort_index()
print(frecuency_range)
df.Ranges.value_counts().sort_index().plot(kind = 'hist', bins = 5)
plt.title("Histogram Frecuency vs Range (column 2- column1)")
plt.show()
I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3
I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :