How to create a plot that summarizes multiple daily profiles? - python

I am a type 1 diabetic and wear a continuous glucose monitor that measures my blood glucose levels every 5 minutes. The company that makes the CGM generates a report with a graph that looks like the figure at the bottom of this post. My goal is to learn how to recreate this graph for myself in a Jupyter notebook.
The data that I have, for example, looks like this:
Timestamp
Glucose Value (mg/dL)
2021-07-11 00:11:25
116.0
2021-07-11 00:16:25
118.0
2021-07-11 00:21:25
121.0
2021-07-11 00:26:24
123.0
2021-07-11 00:31:25
124.0
The graph is using data from a 30 day period and summarizing the distribution of values at each point in time. Is there a name for this type of graph, and how can I create it myself using Pandas/matplotlib/seaborn?
So far, I have tried creating a graph with the IQR split by day which is rather easy - using ploty:
glucose['Day'] = glucose['Timestamp'].dt.day_name()
fig = px.box(glucose, x="Day", y="Glucose Value (mg/dL)",
points="all", color='Day')
fig.show()
But now I am unsure how to easily calculate the IQR for specific time periods and average them.
Thank you so much for your help!

Answering my own question with help from the links that Joe provided in the comments:
I was able to group the dataframe by hour, then use .quantile to generate a new dataframe with rows as hours and columns as 10%, 25%, 50%, 75%, and 90%. From there it was a matter of simple formatting with matplotlib to copy the original one.
grouped = df.groupby([df['Timestamp'].dt.hour])
i = grouped['bgl'].quantile([.1, .25, .5, .75, .9]).unstack()
Thanks a lot Joe!

Related

groupby by multiple variables using Altair

I have a dataset that contains timestamps roughly every 15 minutes for several years, weather stations (two stations), solar zenith angle (sza), and categorical column that contains values such as TN, TP, FN, FP.
time
station
sza
Hit/Miss
2016-09-01 00:15:00
LFPG
122.520350
FN
2016-09-01 00:30:00
LFPO
119.658256
TP
and so on.
I would like to make a plot where I can see how many of each category of Hit/Miss belong to which sza bin every month during the study period.
This is what I have tried so far:
alt.Chart(df_paris).mark_rect().encode(
x=alt.X('month(time):O', title=None),
y=alt.Y('year(time):O', title=None),
color=alt.Color('sza', bin=True),
row=alt.Row('station', title=None),
column=alt.Column('Hit/Miss:N', title=None),
).resolve_axis(x='independent')
And the result looks like this:
Does each pixel represent a monthly mean sza of each categorical value? Because what I want is the exact number of every category that belongs to a certain sza bin every month and every year. Essentially, I want to find out if there's any correlation between the frequency of Hit/Miss values and sza depending on the time of year.
I have also tried this:
alt.Chart(df_paris).mark_bar().encode(
x=alt.X('year(time):O', title=None),
y=alt.Y('count(Hit/Miss):Q'),
color=alt.Color('Hit/Miss', legend=alt.Legend(title=None),
scale=alt.Scale(range=['#5c8fff', '#ffcc5c', '#96ceb4'],)),
column=alt.Column('month(time)', title='Monthly Aggregates 2016-2022'),
row='station'
).resolve_axis(x='independent')
which gives me:
but with that I can't see the sza distribution.
I read about aggregate and groupby in altair docs, but I'm still very much lost.
I'm pretty new at statistical analysis and python altogether, and would welcome any learning opportunities if you'll have any feedback for me. Thanks.
Does each pixel represent a monthly mean sza of each categorical value?
To represent an aggregate value like a mean, you need to use an aggregation function. In this case you should write alt.Color('mean(sza)') (and optionally add the color binning if that is important for your application, I am not sure I see the purpose).
If you don't add an aggregation function, there will be one marker per row in the dataframe. Since you are using mark_rect the markers will be stacked on top of each other and you can only see the last (topmost) one. Here is an example of how mark are plotted on top of each other (using mark_square to allow for different sizes to more clearly illustrate what is going on in the example).
import pandas as pd
import altair as alt
df = pd.DataFrame({
'x': [0, 0, 0],
'y': [0, 0, 0],
'c': [1, 2, 3],
's': [3, 2, 1],
})
alt.Chart(df).mark_square(opacity=1).encode(
x='x:O',
y='y:O',
color='c:O',
size='s:O'
)
Reversing the order of the dataframe will cause the largest square to be on top which covers the others:
alt.Chart(df.sort_values('s')).mark_square(opacity=1).encode(
x='x:O',
y='y:O',
color='c:O',
size='s:O'
)

python percentile centered in a 15 day window

I'm kind of new into Python functions and in general python (xarray for large data sets).
Does exist a function that can calculate the 90th percentile centered on a 15 day window for a long time series of a domain?
ie I have 62 years of daily maximum temperature during the summer season over North America (I merge with cdo the whole days within JJA). I want to calculate the 90th daily percentile of tmax (during the summer season) of the period based on a centered 15-day window (in order to calculate heat waves)
I was wondering if it exists something that can calculate directly this for a long time series? as in np.percentile? I was thinking if perhaps np.percentile has something like np.roll? but not sure how to use it properly
Thanks in advance!

How to loop to create plots for each hour of day of geospatial data?

I am trying to create an animated map (by generating multiple plots) of road traffic throughout a week, where the thickness of roads is represented by the volume of traffic at a specific time of day.
This is sort of what I'm looking for (but for each hour of each day):
The data has a structure that looks like this:
HMGNS_LNK_ID geometry DOW Hour Normalised Value
2 MULTILINESTRING ((251... 1 0 0.233623
2 MULTILINESTRING ((251... 1 1 0.136391
2 MULTILINESTRING ((251... 1 2 0.108916
DOW stands for 'day of the week' (1 = Monday) and so for every Hour of each of the 7 days I want to plot the map with roads' thickness by the value Normalised Value.
I encounter a problem that when trying to loop with this code:
for dow in df['DOW']:
fig, ax = plt.subplots(1)
day_df = df[df['DOW']==dow]
for hour in day_df['Hour']:
day_hour_df = day_df[day_df['Hour']==hour]
day_hour_df.plot(ax=ax, linewidth=day_hour_df['Normalised Value'])
plt.savefig("day{}_hour{}.png".format(dow, hour), dpi = 200, facecolor='#333333')
The problem is that the figures are saved only for day 1, so until day1_hour_23 and after that, it comes back to day1_hour0 and overwrites the plot with something new. I can't figure out why it stops at DOW 2.
I'm not even sure if the data structure is correct. I would greatly appreciate any help with that. Please find the full code in my repo.
Cheers!
The problem is with the way you loop and subset df. Let's go through the loop in detail. First time in the outer loop, dow will be 1 and day_df = df[df['DOW']==dow] will select all rows with 1 in the column DOW. Now the inner loop goes through the selected rows and creates day1_hour0 to day1_hour23. Inner loop done, great.
Now we come second time into the outer loop and dow is again 1. day_df = df[df['DOW']==dow] will select all rows with 1 in the column DOW, i.e., the same set of rows that it used the previous time through the outer loop. So, it (re)writes day1_hour0 to day1_hour23 again.
I would suggest using (geo)pandas.groupby:
for dow, day_gdf in df.groupby("DOW"):
for hour, day_hour_gdf in day_gdf.groupby("Hour"):
fig, ax = plt.subplots(1)
print(f"Doing dow={dow}, hour={hour}")
day_hour_gdf.plot(ax=ax, linewidth=day_hour_gdf['Normalised Value'])
plt.savefig("day{}_hour{}.png".format(dow, hour), dpi = 200, facecolor='#333333')
plt.close()
Bonus Tip: Checkout pandas-bokeh if you want to generate interactive graphs with background tiles that can be saved as HTMLs or embedded in jupyter notebooks. The learning curve can be a bit steep with bokeh, but you can produce really nice interactive plots.
Cheers!

Cleaning up x-axis because there are too many datapoints

I have a data set that is like this
Date Time Cash
1/1/20 12:00pm 2
1/1/20 12:02pm 15
1/1/20 12:03pm 20
1/1/20 15:06pm 30
2/1/20 11:28am 5
. .
. .
. .
3/1/20 15:00pm 3
I basically grouped all the data by date along the y-axis and time along the x-axis, and plotted a facetgrid as shown below:
df_new= df[:300]
g = sns.FacetGrid(df_new.groupby(['Date','Time']).Cash.sum().reset_index(), col="Date", col_wrap=3)
g = g.map(plt.plot, "Time", "Cash", marker=".")
g.set_xticklabels(rotation=45)
What I got back was hideous(as shown below). So I'm wondering is there anyway to tidy up the x-axis? Maybe like having 5-10 time data labels so the time can be visible, or maybe expanding the image?
Edit: I am plotting using seaborn. I will want it to look something like that below where the x-axis has only a couple of labels:
Thanks for your inputs.
Have you tried to use moving average instead of the actual data? You can count the moving average of any data with the following function:
def moving_average(a, n=10) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
Set n to average you need, you can play around with that value. a is in your case variable Cash represented as numpy array.
After that, set column Cash to the moving average count from real values and plot it. The plot curve will be smoother.
P.S. the plot of suicides you have added in edit is really unreadable, as the range for y axis is way higher than needed. In practice, try to avoid such plots.
Edit
I did not notice how you aggregate the data at first, you might want to work with date and time merged. I do not know where you load data from, in case you load it from csv you can add this to read_csv method: parse_dates=[['Date', 'Time']]. In case not, you can play around with the dataframe:
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
you create a new column with datetime and can work simply with that.

Plotting time of day and location coordinates

I have to plot data that have DateTime and GPS coordinate information. As an example:
2013-03-01 19:55:00 45.4565 65.6783
2013-03-01 01:40:00 46.3121 -12.3456
2013-03-02 11:25:00 23.1234 -85.3456
2013-03-05 05:00:00 15.4565 32.1234
......
This is just a random example matching the type of data I have. The whole data set is for a week and the timestamps are rounded to the nearest 5 minutes.
What I would like to do in python is to visualize this data for location patterns for each 24 hour period for the entire week. So, x-axis would have time of day. I am struggling to thing about how the location would be shown. May be a 3D plot is needed.
It would show each day in a different color and also one more for an average of the entire week (i.e. the whole week averaged into a 24 hour period).
Any idea how one would go about visualizing this using python and matplotlib?
Note that I cannot plot the locations of an actual map for now. But just as (x,y) co-ordinates.
Try using Folium's HeatMap with Time.
First you define a function to generate the map.
def generateBaseMap(default_location, default_zoom_start=12):
base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
return base_map
Then you add latitude and longitudes in a list which is ordered by date.
date_list = []
for date in df.date.sort_values().unique():
date_list.append(df.loc[df.date == date, ['lat', 'lng']].values.tolist())
Then you plot the Heat Map with Time
from folium.plugins import HeatMapWithTime
base_map = generateBaseMap(default_zoom_start=11, default_location = [lat, longitude])
HeatMapWithTime(date_list, radius=5, gradient={0.2: 'blue', 0.4: 'lime', 0.6: 'orange', 1: 'red'}, min_opacity=0.5, max_opacity=0.8, use_local_extrema=True).add_to(base_map)
base_map
Hope this helps.

Categories

Resources