How to get create a histogram over time? - python

I'm trying to visualize how a distribution changes over time -- each vertical slice should be the distribution at that timestep.
I want it to look something like this (there are two such curves/temporal-histograms here).
The closest I've found is this seaborn time series, but I want the distribution or at least the min, mean, and max -- this band is a confidence interval, which I can't use (it's also prohibitively slow).
https://seaborn.pydata.org/examples/errorband_lineplots.html
Update:
Here's a snippet to product some sample data.
import numpy as np
num_timesteps = 20
samples_per_timestep = 100
timesteps = np.arange(num_timesteps)
def get_std(t):
return t if t < num_timesteps//2 else abs(num_timesteps - t)
samples = np.stack([np.random.normal(t, get_std(t), samples_per_timestep) for t in timesteps])
samples[t] is a sample of the ditribution a timestep t. The distribution starts as constant (std = 0), widens, then narrows again.

Update: So I asked about this in the Seaborn repository, and this can be way simpler using Seaborn.
Here's an example:
import seaborn as sns
import seaborn.objects as so
fmri = sns.load_dataset("fmri").query("region == 'parietal'")
p = so.Plot(fmri, "timepoint", "signal")
for tail in [25, 10, 5, 1]:
p = p.add(so.Band(), so.Perc([tail, 100 - tail]))
p.add(so.Line(), so.Agg("median"))
Which will result in this plot:
You can read more about it in Statistical estimation and error bars.
This is a lot less work and better scalable. Hope it helps!

I had the exact same problem, and took quite a few detours, but this is most definitely possible!
Imports
We need to import matplotlib, NumPy and Pandas.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Input data
I assume you have the data as a Pandas Series, with the time/step on the index and the values as values.
Step
Wealth
0
0
0
0
0
0
0
0
1
7.89338
1
7.50838
1
2.00948
1
8.74963
I load my data from a pickle (in the format specified above):
wealth = pd.read_pickle("wealth.pickle")
You can download a ZIP with this pickle file here: wealth.zip.
Aggregate the data to percentiles
This part is a bit ugly. We first define a partial NumPy function for each percentile we want to calculate:
# Define functions to calculate percentiles
def q1(x):
return x.quantile(0.01)
def q5(x):
return x.quantile(0.05)
def q25(x):
return x.quantile(0.25)
def q50(x):
return x.quantile(0.50)
def q75(x):
return x.quantile(0.75)
def q95(x):
return x.quantile(0.95)
def q99(x):
return x.quantile(0.99)
If anyone reading this has a better/cleaner way to aggerate this, please let me know!
Note we use numpy.quantile, because (in my case) it works better with the data in the series, but numpy.percentile should be equivalent.
The data now needs to be grouped by the index (level=0), and then aggerated using the functions defined above:
w_agg = wealth.groupby(level=0).agg([q1, q5, q25, q50, q75, q95, q99])
And now w_agg looks like this:
Step
q1
q5
q25
q50
q75
q95
q99
0
0
0
0
0
0
0
0
1
-3.2311
0.759751
3.2881
6.03641
8.43206
11.9663
15.4515
2
-3.22888
-1.15079
3.13756
6.41804
8.43206
12.7269
15.4515
3
-5.31614
-1.91156
3.22287
6.54126
8.77544
14.644
15.5798
4
-5.64095
-2.52143
2.65959
6.22455
9.40699
14.6545
15.9647
Plotting
Now we can start with the plotting. Aside from the regulars, we're using matplotlib.filbetween for this.
We create a figure, and then start with the widest percentile range: From 1 to 99. Then we draw 5 to 95 on top of that, then 25 to 75 on top of that and finally the median as a line.
Play a bit with the alpha and color values to make it look nice!
# Create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Add the bands to the axes
ax.fill_between(x=w_agg.index, y1=w_agg["q1"], y2=w_agg["q99"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q5"], y2=w_agg["q95"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q25"], y2=w_agg["q75"], alpha=0.3, color="tab:blue")
# Plot the median as line
ax.plot(w_agg.index, w_agg["q50"], '-', color="tab:blue")
# Add title, legend and plot
ax.set_title("Distribution of wealth between population over time")
ax.set_xlabel("Time")
ax.set_ylabel("Wealth")
ax.legend([f"{n}% distribution" for n in [99, 90, 50]] + ["Median"], loc="upper left")
fig.tight_layout()
The result
For my dataset, this was the result:

Related

How can I calculate the time lag between two similar time series?

I'm trying to compute/visualize the time lag between 2 time series (I want to know the time lag between the humidity progression of outside and inside a room).
Each data point of my series was taken hourly. Plotting the 2 series together, I can clearly see a shift between them: Sorry for hiding the axis
Here are a part of my time series data. I will pack them in 2 arrays:
inside_humidity =
[11.77961297, 11.59755268, 12.28761522, 11.88797553, 11.78122077, 11.5694668,
11.70421932, 11.78122077, 11.74272005, 11.78122077, 11.69438733, 11.54126933,
11.28460592, 11.05624965, 10.9611012, 11.07527934, 11.25417308, 11.56040908,
11.6657186, 11.51171572, 11.49246536, 11.78594142, 11.22968373, 11.26840678,
11.26840678, 11.29447992, 11.25553344, 11.19711371, 11.17764047, 11.11922075,
11.04132778, 10.86996123, 10.67410607, 10.63493504, 10.74922916, 10.74922916,
10.6294765, 10.61011497, 10.59075345, 10.80373021, 11.07479154, 11.15223764,
11.19711371, 11.17764047, 11.15816723, 11.22250051, 11.22250051, 11.202915,
11.18332948, 11.16374396, 11.14415845, 11.12457293, 11.10498742, 11.14926578,
11.16896413, 11.16896413, 11.14926578, 10.8307902, 10.51742195, 10.28187137,
10.12608544, 9.98977276, 9.62267727, 9.31289289, 8.96438546, 8.77077022,
8.69332413, 8.51907042, 8.30609366, 8.38353975, 8.4513867, 8.47085994,
8.50980642, 8.52927966, 8.50980642, 8.55887037, 8.51969934, 8.48052831,
8.30425867, 8.2177078, 7.98402891, 7.92560918, 7.89950166, 7.83489682,
7.75789537, 7.5984808, 7.28426807, 7.39778913, 7.71943214, 8.01149931,
8.18276652, 8.23009255, 8.16215295, 7.93822471, 8.00350215, 7.93843482,
7.85072729, 7.49778011, 7.31782649, 7.29862668, 7.60162032, 8.29665484,
8.58797834, 8.50011383, 8.86757784, 8.76600556, 8.60491125, 8.4222628,
8.24923231, 8.14470714, 8.17351638, 8.52530093, 8.72220151, 9.26745883,
9.1580007, 8.61762692, 8.22187405, 8.43693644, 8.32414835, 8.32463974,
8.46833012, 8.55865487, 8.72647164, 9.04112806, 9.35578449, 9.59465974,
10.47339785, 11.07218093, 10.54091351, 10.56138918, 10.46099958, 10.38129168,
10.16434831, 10.10612612, 10.009246, 10.53502351, 10.8307902, 11.13420052,
11.64337309, 11.18958511, 10.49630791, 10.60856932, 10.37029108, 9.86281478,
9.64699826, 9.95341012, 10.24329812, 10.6848196, 11.47604231, 11.30505352,
10.72194974, 10.30058448, 10.05022037, 10.06318411, 9.90118897, 9.68530059,
9.47790657, 9.48585784, 9.61639418, 9.86244265, 10.29009361, 10.28297229,
10.32073088, 10.65389513, 11.09656351, 11.20188562, 11.24124169, 10.40503955,
9.74632512, 9.07606098, 8.85145589, 9.37080152, 9.65082743, 10.0707891,
10.68776091, 11.25879751, 11.0416348, 10.89558456, 10.7908258, 10.66539685,
10.7297755, 10.77571398, 10.9268264, 11.16021492, 11.60961709, 11.43827534,
11.96155427, 12.16116437, 12.80412266, 12.52540805, 11.96752965, 11.58099292]
outside_humidity =
[10.17449206, 10.4823292, 11.06818167, 10.82768699, 11.27582592, 11.4196233,
10.99393027, 11.4122507, 11.18192837, 10.87247831, 10.68664321, 10.37949651,
9.57155882, 10.86611665, 11.62547196, 11.32004266, 11.75537602, 11.51292063,
11.03107569, 10.7297755, 10.4345622, 10.61271497, 9.49271162, 10.15594248,
9.99053828, 9.80915398, 9.6452438, 10.06900573, 11.18075689, 11.8289847,
11.83334752, 11.27480708, 11.14370467, 10.88149985, 10.73930381, 10.7236597,
10.26210496, 11.01260226, 11.05428228, 11.58321342, 12.70523808, 12.5181118,
11.90023799, 11.67756426, 11.28859471, 10.86878222, 9.73984486, 10.18253902,
9.80915398, 10.50980784, 11.38673459, 11.22751685, 10.94171823, 10.56484228,
10.38220753, 10.05388847, 9.96147203, 9.90698862, 9.7732203, 9.85262125,
8.7412938, 8.88281702, 8.07919545, 8.02883587, 8.32341424, 8.07357711,
7.27302616, 6.73660684, 6.66722819, 7.29408637, 7.00046542, 6.46322019,
6.07150988, 6.00207234, 5.8818402, 6.82443881, 7.20212882, 7.52167696,
7.88857771, 8.351627, 8.36547023, 8.24802846, 8.18520693, 7.92420816,
7.64926024, 7.87944972, 7.82118727, 8.02091833, 7.93071882, 7.75789457,
7.5416447, 6.94430133, 6.65907535, 6.67454591, 7.25493614, 7.76939457,
7.55357806, 6.61479472, 7.17641357, 7.24664082, 8.62732387, 8.66913548,
8.70925667, 9.0477017, 8.24558224, 8.4330502, 8.44366397, 8.17995798,
8.1875752, 9.33296518, 9.66567041, 9.88581085, 8.95449382, 8.3587624,
9.20584448, 8.90605388, 8.87494884, 9.12694892, 8.35055177, 7.91879933,
7.78867253, 8.22800878, 9.03685287, 12.49630018, 11.11819755, 10.98869374,
10.65897176, 10.36444573, 10.052609, 10.87627021, 10.07379564, 10.02233847,
9.62022856, 11.21575473, 10.85483543, 11.67324627, 11.89234248, 11.10068132,
10.06942096, 8.50405894, 8.13168561, 8.83616476, 8.35675085, 8.33616802,
8.35675085, 9.02209801, 9.5530404, 9.44738836, 10.89645958, 11.44771721,
11.79943601, 10.7765335, 11.1453622, 10.74874776, 10.55195175, 10.34494483,
9.83813522, 11.26931785, 11.20641798, 10.51555027, 10.90808954, 11.80923545,
11.68300879, 11.60313809, 7.95163365, 7.77213815, 7.54209557, 7.30603673,
7.17842173, 8.25899805, 8.56494995, 10.44245578, 11.08542758, 11.74129079,
11.67979686, 12.94362214, 11.96285343, 11.8289847, 11.01388413, 10.6793698,
11.20662595, 11.97684701, 12.46383177, 11.34178655, 12.12477078, 12.48698059,
12.89325064, 12.07470295, 12.6777319, 10.91689448, 10.7676326, 10.66710434]
I know cross correlation is the right term to use, but after a while I still don't get the idea of using scipy.signal.correlate and numpy.correlate, because all I got is an array full of NaNs. So clearly I need some more knowledge in this area.
What I expect to achieve is probably a plot like those in the answer section of this thread How to make a correlation plot with a certain lag of two time series where I can see at how many hours the time lag is most likely.
Thank you a lot in advance!
With the given data, you can use the numpy and matplotlib modules to achieve the desired result.
so, you can do something like this:
import numpy as np
from matplotlib import pyplot as plt
x = np.array(inside_humidity)
y = np.array(outside_humidity)
fig = plt.figure()
# fit a curve of your choice
a, b = np.polyfit(inside_humidity, outside_humidity, 1)
y_fit = a * x + b
# scatter plot, and fitted plot (best fit used)
plt.scatter(inside_humidity, outside_humidity)
plt.plot(x, y_fit)
plt.show()
which gives this:

Making a plot that has an x-axis that has neg. values representing hours prior to the start of the event, then pos. values representing hours after

I'm not sure if my question makes sense, so apologies on that.
Basically, I am plotting some data that is ~100 hours long. On the x-axis, I want to make it so that the range goes from -50 to 50, with -1 to -50 representing the 50 hours prior to the event, 0 being in the middle representing the start of the event, and 1-50 representing the 50 hours following the start of the event. Basically, there are 107 hours worth of data and I want to try to divide the hours between each side of 0.
I initially tried using the plt.xlim() function, but that just shifts all the data to one side of the plot.
I've tried using plt.xticks and then labeling the x ticks with "-50", "-25", "0", "25", and "50", and while that somewhat works, it still does not look great. I'll add an example figure of doing it this way to add better clarification of what I'm trying to do, as well as the original plot:
Original plot:
Goal:
edit
Here's my code for plotting it:
fig_1 = plt.figure(figsize=(30,20))
file.plot(x='start',y='value')
plt.xlabel('hour')
plt.ylabel('value')
plt.xticks([0,25,50,75,100],["-50","-25","0","25","50"])
You could obtain a zero mean for the ticks using df.sub(df.mean() or np.mean().
Alternative 1:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate data
left = np.linspace(10,60, 54)
right = np.linspace(60,10, 53)
noise_left = np.random.normal(0, 1, 54)
noise_right = np.random.normal(0, 1, 53)
all = np.append(left + noise_left, right + noise_right)
file = pd.DataFrame({'start':np.linspace(1,107,107),'value':all})
# subtract mean
file['start'] = file['start'].sub(file['start'].mean())
fig_1 = plt.figure(figsize=(30,20))
file.plot(x='start',y='value')
plt.xlabel('hour')
plt.ylabel('value')
Output:
Alternative 2:
# subtract the mean from start to obtain zero mean ticks
ticks = file['start'] - np.mean(file['start'])
# set distance between each tick to 10
plt.xticks(file['start'][::10], ticks[::10],rotation=45)

How to set a seaborn color map in an arbitrary range?

I am creating a heatmap for the correlations between items.
sns.heatmap(df_corr, fmt=".2g", cmap='vlag', cbar='True', annot = True)
I choose vlag as it has red for high values, blue for low values, and white for the middle.
Seaborn automatically sets red for the highest value and blue for the lowest value in the dataframe.
However, as I am tracking Pearson's correlation, the value range is between -1 and 1 - as so I would like to set 1 to be represented by red, -1 with blue, leaving 0 to be represented by white.
How the results looks like:
How it should be*:
*(Of course this was generated by "cheating" - setting -1 as value(s) to force the range to be from -1 to 1; I want to set this range without warping my data)
it is vmin=-1 and vmax=1:
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
data = np.random.uniform(low=-0.5, high=0.5, size=(5,5))
hm = sn.heatmap(data = data, cmap= 'vlag', annot = True, vmin=-1, vmax=1)
plt.show()
Here is an unorthodox solution. You can "standardize" your data to a range 1 and -1. Even though the theoretical range of Pearson coefficient is [-1, 1]; strong negative correlations are not present in your dataset.
So, you can create another dataframe which contains the data with its max being 1 and min being -1. You can then plot this dataframe to get the desired effect. The advantage this procedure provides is that this technique generalizes to pretty much any dataframe (not verified though).
Here is the code -
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Setting the initial scale of the data
scale_minimum = -1
scale_maximum = 1
scale_range = scale_maximum-scale_minimum
# Applying the scaling
df_minimun, df_maximum = df.min(), df.max() # Getting the range of the current dataframe
df_range = df_maximum - df_minimun # The range of the data
df = (df - df_minimun)/(df_range) # Scaling between 0 and 1
df_scaled = df*(scale_range) + scale_minimum # Scaling between 1 and -1
Hope this solves your problem.

Simulate the compound random variable S

Let S=X_1+X_2+...+X_N where N is a nonnegative integer-valued random variable and X_1,X_2,... are i.i.d random variables.(If N=0, we set S=0).
Simulate S in the case where N ~ Poi(100) and X_i ~ Exp(0.5). (draw histograms and use the numpy or scipy built-in functions).And check the equations E(S)=E(N)*E(X_1) and Var(S)=E(N)*Var(X_1)+E(X_1)^2 *Var(N)
I was trying to solve it, but I'm not sure yet of everything and also got stuck on the histogram part. Note: I'm new to python or more generally , new to programming.
My work:
import scipy.stats as stats
import matplotlib as plt
N = stats.poisson(100)
X = stats.expon(0.5)
arr = X.rvs(N.rvs())
S = 0
for i in arr:
S=S+i
print(arr)
print("S=",S)
expected_S = (N.mean())*(X.mean())
variance_S = (N.mean()*X.var()) + (X.mean()*X.mean()*N.var())
print("E(X)=",expected_S)
print("Var(S)=",variance_S)
Your existing code mostly looks sensible, but I'd simplify:
arr = X.rvs(N.rvs())
S = 0
for i in arr:
S=S+i
down to:
S = X.rvs(N.rvs()).sum()
To draw a histogram, you need many samples from this distribution, which is now easily accomplished via:
arr = []
for _ in range(10_000):
arr.append(X.rvs(N.rvs()).sum())
or, equivalently, using a list comprehension:
arr = [X.rvs(N.rvs()).sum() for _ in range(10_000)]
to plot these in a histogram, you need the pyplot module from Matplotlib, so your import should be:
from matplotlib.pyplot import plt
plt.hist(arr, 50)
The 50 above says to use that number of "bins" when drawing the histogram. We can also compare these to the mean and variance you calculated by assuming the distribution is well approximated by a normal:
approx = stats.norm(expected_S, np.sqrt(variance_S))
_, x, _ = plt.hist(arr, 50, density=True)
plt.plot(x, approx.pdf(x))
This works because the second value returned from matplotlib's hist method are the locations of the bins. I used density=True so I could work with probability densities, but another option could be to just multiply the densities by the number of samples to get expected counts like the previous histogram.
Running this gives me:

How to split dataframe according to intersection point in Python?

I am working on a project which is aiming to show difference between good form and bad form of an exercise. To do this we collected the acceleration data with wrist based accelerometer. The image above shows 2 set of a fitness execise (bench press). Each set has 10 repetitions. And the image below shows 10 repetitions of 1 set.I have a raw data set which consist of 10 set of an execises. What I want to do is splitting the raw data to 10 parts which will contain the part between 2 black line in the image above so I can analyze the data easily. My supervisor gave me a starting point which is choosing cutpoint in the each set. He said take a cutpoint, find the first interruption time start cutting at 3 sec before that time and count to 10 and finish cutting.
This an idea that I don't know how to apply. At least, if you can tell how to cut a dataframe according to cutpoint I would be greatful.
Well, I found another way to detect periodic parts of my accelerometer data. So, Here is my code:
import numpy as np
from peakdetect import peakdetect
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib import style
from pandas import DataFrame as df
style.use('ggplot')
def get_periodic(path):
periodics = []
data_frame = df.from_csv(path)
data_frame.columns = ['z', 'y', 'x']
if path.__contains__('1'):
if path.__contains__('bench'):
bench_press_1_week = data_frame.between_time('11:24', '11:52')
peak_indexes = get_peaks(bench_press_1_week.y, lookahead=3000)
for i in range(0, len(peak_indexes)):
time_indexes = bench_press_1_week.index.tolist()
start_time = time_indexes[0]
periodic_start = start_time.to_datetime() + dt.timedelta(0, peak_indexes[i] / 100)
periodic_end = periodic_start + dt.timedelta(0, 60)
periodic = bench_press_1_week.between_time(periodic_start.time(), periodic_end.time())
periodics.append(periodic)
return periodics
def get_peaks(data, lookahead):
peak_indexes = []
correlation = np.correlate(data, data, mode='full')
realcorr = correlation[correlation.size / 2:]
maxpeaks, minpeaks = peakdetect(realcorr, lookahead=lookahead)
for i in range(0, len(maxpeaks)):
peak_indexes.append(maxpeaks[i][0])
return peak_indexes
def show_segment_plot(data, periodic_area, exercise_name):
plt.figure(8)
gs = gridspec.GridSpec(7, 2)
ax = plt.subplot(gs[:2, :])
plt.title(exercise_name)
ax.plot(data)
k = 0
for i in range(2, 7):
for j in range(0, 2):
ax = plt.subplot(gs[i, j])
title = "{} {}".format(k + 1, ".Set")
plt.title(title)
ax.plot(periodic_area[k])
k = k + 1
plt.show()
Firstly, this question gave me another perspective for my problem. The image below shows the raw accelerometer data of bench press with 10 sets. Here it has 3 axis(x,y,z) and it's major axis is y(Blue on the image).
I used autocorrelation function for detecting the periodic parts, In the image above every peak represents 1 set of execises. With this peak detection algorithm I found each peak's x-axis value,
In[196]: maxpeaks
Out[196]:
[[16204, 32910.14013671875],
[32281, 28726.95849609375],
[48515, 24583.898681640625],
[64436, 22088.130859375],
[80335, 19582.248291015625],
[96699, 16436.567626953125],
[113081, 12100.027587890625],
[129027, 8098.98486328125],
[145184, 5387.788818359375]]
Basically, each x-value represent samples. My sampling frequency was 100Hz so 16204/100 = 162,04 seconds. To find the time of periodic part I added 162,04 sec to started time. Each bench press took aproximatelly 1 min and in this example, exercise's starting time was 11:24, for first periodic part's start time is 11:26 and ending time is 1 min after. There is some lag but yes best solution that I found is this.

Categories

Resources