groupby by multiple variables using Altair - python

I have a dataset that contains timestamps roughly every 15 minutes for several years, weather stations (two stations), solar zenith angle (sza), and categorical column that contains values such as TN, TP, FN, FP.
time
station
sza
Hit/Miss
2016-09-01 00:15:00
LFPG
122.520350
FN
2016-09-01 00:30:00
LFPO
119.658256
TP
and so on.
I would like to make a plot where I can see how many of each category of Hit/Miss belong to which sza bin every month during the study period.
This is what I have tried so far:
alt.Chart(df_paris).mark_rect().encode(
x=alt.X('month(time):O', title=None),
y=alt.Y('year(time):O', title=None),
color=alt.Color('sza', bin=True),
row=alt.Row('station', title=None),
column=alt.Column('Hit/Miss:N', title=None),
).resolve_axis(x='independent')
And the result looks like this:
Does each pixel represent a monthly mean sza of each categorical value? Because what I want is the exact number of every category that belongs to a certain sza bin every month and every year. Essentially, I want to find out if there's any correlation between the frequency of Hit/Miss values and sza depending on the time of year.
I have also tried this:
alt.Chart(df_paris).mark_bar().encode(
x=alt.X('year(time):O', title=None),
y=alt.Y('count(Hit/Miss):Q'),
color=alt.Color('Hit/Miss', legend=alt.Legend(title=None),
scale=alt.Scale(range=['#5c8fff', '#ffcc5c', '#96ceb4'],)),
column=alt.Column('month(time)', title='Monthly Aggregates 2016-2022'),
row='station'
).resolve_axis(x='independent')
which gives me:
but with that I can't see the sza distribution.
I read about aggregate and groupby in altair docs, but I'm still very much lost.
I'm pretty new at statistical analysis and python altogether, and would welcome any learning opportunities if you'll have any feedback for me. Thanks.

Does each pixel represent a monthly mean sza of each categorical value?
To represent an aggregate value like a mean, you need to use an aggregation function. In this case you should write alt.Color('mean(sza)') (and optionally add the color binning if that is important for your application, I am not sure I see the purpose).
If you don't add an aggregation function, there will be one marker per row in the dataframe. Since you are using mark_rect the markers will be stacked on top of each other and you can only see the last (topmost) one. Here is an example of how mark are plotted on top of each other (using mark_square to allow for different sizes to more clearly illustrate what is going on in the example).
import pandas as pd
import altair as alt
df = pd.DataFrame({
'x': [0, 0, 0],
'y': [0, 0, 0],
'c': [1, 2, 3],
's': [3, 2, 1],
})
alt.Chart(df).mark_square(opacity=1).encode(
x='x:O',
y='y:O',
color='c:O',
size='s:O'
)
Reversing the order of the dataframe will cause the largest square to be on top which covers the others:
alt.Chart(df.sort_values('s')).mark_square(opacity=1).encode(
x='x:O',
y='y:O',
color='c:O',
size='s:O'
)

Related

bin value of histograms from grouped data

I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!
First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

How to compress a time series plot, after taking out specific month from a dataset in python?

I have a dataset of OLR from 1986-2013 (daily data), and I am interested in plotting a time series which should have only boreal winter months i.e. from November to April.
(i) I am able to sort out Nov-Apr months from my datasets by using -
OLRNA = OLR.sel(TIME = OLR.TIME.dt.month.isin([11,12,1,2,3,4]))
and this is working.
(ii) But the problem is that whenever I am plotting a time series then that series is not continuous i.e not joining Nov-Apr for each year (showing gaps for remaining months). I know that as I have selected only Nov-Apr months in my data so it's not showing. Then how to join or compress the time axis?
how to plot this time series properly?
Instead, first mask the months you do not want to plot and then remove these masked rows by applying dropna
OLRNA = OLR.mask(OLR.Time.dt.month.isin([5, 6, 7, 8, 9, 10]))
OLRNA = OLRNA.dropna()
I tried to solve this issue and getting a proper plot. So answering my own question.
After selecting specific month from the time series. Just plot the series without considering 'time' on x-axis means just plot yaxis variable and let the x-axis denotes the serial numbers. And then with the help of matplotlib, just set the xticks and xticklabels manually wherever u want.
Thank you. Especially Bruno Vermeulen sir for the cooperation.

Hours and minutes as labels in Altair plot spanning more than one day

I'm trying to create in Altair a Vega-Lite specification of a plot of a time series whose time range spans a few days. Since in my case, it will be clear which day is which, I want to reduce noise in my axis labels by letting labels be of the form '%H:%M', even if this causes labels to be non-distinct.
Here's some example data; my actual data has a five minute resolution, but I imagine that won't matter too much here:
import altair as alt
import numpy as np
import pandas as pd
# Create data spanning 30 hours, or just over one full day
df = pd.DataFrame({'time': pd.date_range('2018-01-01', periods=30, freq='H'),
'data': np.arange(30)**.5})
By using the otherwise trivial yearmonthdatehoursminutes transform, I get the following:
alt.Chart(df).mark_line().encode(x='yearmonthdatehoursminutes(time):T',
y='data:Q')
Now, my goal is to get rid of the dates in the labels on the horizontal axis, so they become something like ['00:00', '03:00', ..., '21:00', '00:00', '03:00'], or whatever spacing works best.
The naive approach of just using hoursminutes as a transform won't work, as that bins the actual data:
alt.Chart(df).mark_line().encode(x='hoursminutes(time):T', y='data:Q')
So, is there a declarative way of doing this? Ultimately, the visualization will be making use of selections to define the horizontal axis limits, so specifying the labels explicitly using Axis does not seem appealing.
To expand on #fuglede's answer, there are two distinct concepts at play with dates and times in Altair.
Time formats let you specify how times are displayed on an axis; they look like this:
chart.encode(
x=alt.X('time:T', axis=alt.Axis(format='%H:%M'))
)
Altair uses format codes from d3-time-format.
Time units let you specify how data will be grouped, and they also adjust the default time format to match. They look something like this:
chart.encode(
x=alt.X('time:T', timeUnit='hoursminutes')
)
or via the shorthand:
chart.encode(
x='hoursminutes(time):T'
)
Available time units are listed here.
If you want to adjust axis formats only, use time formats. If you want to group based on timespans (i.e. group data by year, by month, by hour, etc.) then use a time unit. Examples of this appear in the Altair documentation, e.g. the Seattle Weather Heatmap in Altair's example gallery.
This can actually easily be achieved by specifying format in Axis:
alt.Chart(df).mark_line().encode(x=alt.X('time:T', axis=alt.Axis(format='%H:%M')), y='data:Q')

Pandas histogram of dates with empty bins

My use case is very simliar to this post, but my data is not continuous through each bin. I'm attempting to create multiple figures over the same time span to show activity (or lack thereof) over 18 months. I thought I hit the jackpot with the df.groupby(df.date.month).count() approach, but since my data is irregular I get different bins per dataset.
My question, then, is how would I go about creating some kind of master x-axis with fixed bins (month,year) and plot each dataset against these bins. I think I'm missing some fundamental understanding of either Pandas or MPL, and I apologize for what I'm sure is a silly question. First post, go easy...
Since I can't comment yet, I'll edit here:
I have 18 months generated with pd.period_range. I also have a DataFrame full of observations with timestamps within those months. Some of months have zero observations. How do I effectively count and chart the observations by month?
Have you tried the suggestions here?
You can also try this sort of approach to manually define the bin boundaries
bins = [0, 30, 60, 90, 120]
labels = [1, 2, 3, 4]
df['new_bin'] = pd.cut(df['existing_value'], bins=bins, labels=labels)

How to plot CDF plot based on two selected pandas series

Background
I have a dataframe containing three variables:
city: the city names within China.
pop: the population number of the corresponding city.
conc: the concentration of ambient pollutant of the corresponding city.
I want to investigate the cumulative distribution of the concentration by the population.
The sample figure is shown like this:
The sample dataset is uploaded here
My solution
df = pd.read_csv("./data/test.csv",)
df = df[df.columns[1:]]
df = df.sort_values(by=['pm25'],ascending=False)
df = df.reset_index()
x_ = df['pm25'].values
y_ = []
for i in range(0,len(df)-1,1):
y_.append(df['pop'].iloc[:i+1].sum()/df['pop'].sum())
y_.append(1.0)
plt.plot(x_,y_)
1.
Any better method is highly appreciated!
2.
Also, how to make the curve smooth as the first plot?
You can replace the loop by a use of pd.Series.cumsum:
y_ = df.pop.cumsum() / df.pop.sum()
For smoothing, you can use pd.Series.rolling:
plot(x_, y_.rolling(3).mean())
which applies a low pass filter (of length 3). You should consider if that is what you want, however - your plot seems correct.

Categories

Resources