Pandas histogram of dates with empty bins - python

My use case is very simliar to this post, but my data is not continuous through each bin. I'm attempting to create multiple figures over the same time span to show activity (or lack thereof) over 18 months. I thought I hit the jackpot with the df.groupby(df.date.month).count() approach, but since my data is irregular I get different bins per dataset.
My question, then, is how would I go about creating some kind of master x-axis with fixed bins (month,year) and plot each dataset against these bins. I think I'm missing some fundamental understanding of either Pandas or MPL, and I apologize for what I'm sure is a silly question. First post, go easy...
Since I can't comment yet, I'll edit here:
I have 18 months generated with pd.period_range. I also have a DataFrame full of observations with timestamps within those months. Some of months have zero observations. How do I effectively count and chart the observations by month?

Have you tried the suggestions here?
You can also try this sort of approach to manually define the bin boundaries
bins = [0, 30, 60, 90, 120]
labels = [1, 2, 3, 4]
df['new_bin'] = pd.cut(df['existing_value'], bins=bins, labels=labels)

Related

Plotting Histogram with Average of Weights in Bins in Python to Reconstruct a Function from Samples [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 14 days ago.
This post was edited and submitted for review 13 days ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I have a vector X with corresponding weights Y. Currently, I am plotting the data with a histogram, and it adds up all the weights within each bin. I would like to reconstruct Y as a function of X. To do this, I would like to plot the average weight within each bin, by dividing the value of each bin by the number of data points in that bin for example. Is there a simple way in Python to plot the average weight within each bin, rather than just plotting the the sum of all the weights within each bin?
For example code, consider the data set
X = np.array([1., 1.5, 1.4, 2., 2.5, 2.1])
Y = np.array([5, 7, 6.5, 8, 9, 8.1])
plt.hist(X, weights = Y, bins = 4)
If one does a plt.hist of this data, using Y as the weights, then it simply adds up the weights within each bin. However, since I am trying to reconstruct Y as a function of X, I would like to know the average value of Y for a given X.
I ended up using numpy's binned_statistic() to generate means of the data within bins, and did a plt.plot of these means, where each point along the x-axis was the "Average" position of a bin. Here is some code for those curious:
bin_means, bin_edges, binnumber = sp.stats.binned_statistic(X,Y, bins = 50)
#I find the middle point between each set of bin edges
bin_points = (bin_edges[:-1] + bin_edges[1:])/2
plt.plot(bin_points,bin_means)
Thanks to #paime for the suggestion! I ended up using binned_statistic instead of their suggestion of calling numpy.histogram twice with the thought that, for very large data sets, generating histograms can take a long time.
If I got you right, then maybe seaborn histplot may help you here. It got an option stat which influences the y-axis. In your case try stat='probability' or stat='percent' to see the portion which the bin in relation to your data has.
Afterwards, for visualizing the y-axis in percent have a look at this post

bin value of histograms from grouped data

I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!
First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

Altair: Controlling tick counts for binned axis

I'm trying to generate a histogram in Altair, but I'm having trouble controlling the tick count for the axis corresponding to the binned variable (x-axis). I'm new to Altair so apologies I'm missing something obvious here. I tried to look for whether others had faced this kind of issue but didn't find an exact match.
The code to generate the histogram is
alt.Chart(df_test).mark_bar().encode(
x=alt.X('x:Q', bin=alt.Bin(step=0.1), scale=alt.Scale(domain=[8.9, 11.6])),
y=alt.Y('count(y):Q', title='Count(Y)')
).configure_axis(labelLimit=0, tickCount=3)
df_test is a Pandas dataframe - the data for which is available here.
The above code generates the following histogram. Changing tickCount changes the y-axis tick counts, but not the x-axis.
Any guidance is appreciated.
There might be a more convenient way to do this using bin=, but one approach is to use transform_bin with mark_rect, since this does not change the axis into a binned axis (which are more difficult to customize):
import altair as alt
from vega_datasets import data
source = data.movies.url
alt.Chart(source).mark_rect(stroke='white').encode(
x=alt.X('x1:Q', title='IMDB Rating', axis=alt.Axis(tickCount=3)),
x2='x2:Q',
y='count()',
).transform_bin(
['x1', 'x2'], field='IMDB_Rating'
)
You might notice that you don't get the exact number of ticks, this is because there is rounding to "nice" values, such as multiple of 5 etc. I couldn't turn this off even when setting nice=False on the scale, so another approach in those cases is to pass the exact tick values values=.
alt.Chart(source).mark_rect(stroke='white').encode(
x=alt.X('x1:Q', title='IMDB Rating', axis=alt.Axis(values=[0, 3, 6, 9])),
x2='x2:Q',
y='count()',
).transform_bin(
['x1', 'x2'], field='IMDB_Rating'
)
Be careful with decimal values, these are automatically displayed as integers (even with tickRound=False), but in the wrong position (this seems like a bug to me so if you investigate it more you might want to report on the Vega Lite issue tracker.

How to compress a time series plot, after taking out specific month from a dataset in python?

I have a dataset of OLR from 1986-2013 (daily data), and I am interested in plotting a time series which should have only boreal winter months i.e. from November to April.
(i) I am able to sort out Nov-Apr months from my datasets by using -
OLRNA = OLR.sel(TIME = OLR.TIME.dt.month.isin([11,12,1,2,3,4]))
and this is working.
(ii) But the problem is that whenever I am plotting a time series then that series is not continuous i.e not joining Nov-Apr for each year (showing gaps for remaining months). I know that as I have selected only Nov-Apr months in my data so it's not showing. Then how to join or compress the time axis?
how to plot this time series properly?
Instead, first mask the months you do not want to plot and then remove these masked rows by applying dropna
OLRNA = OLR.mask(OLR.Time.dt.month.isin([5, 6, 7, 8, 9, 10]))
OLRNA = OLRNA.dropna()
I tried to solve this issue and getting a proper plot. So answering my own question.
After selecting specific month from the time series. Just plot the series without considering 'time' on x-axis means just plot yaxis variable and let the x-axis denotes the serial numbers. And then with the help of matplotlib, just set the xticks and xticklabels manually wherever u want.
Thank you. Especially Bruno Vermeulen sir for the cooperation.

Matplotlib: plotting string values give strange behaviour

I'm trying to plot two data series:
day of the year (X-axis): ['2019-01-01', '2019-01-02', ...]
hour of sunrise (Y-axis): ['07:04', '07:03', ...]
But matplotlib is getting me crazy… here's the plot of a subset (ax.plot(datelist[130:230], hourlist[130:230], label='sunrise')):
As you can see, the Y-axis decrease from '03:57' to '03:33' and, then, suddenly start to increase up to '04:26'. That's non-sense to me.
Can you help me fixing that ?
Bonus points if you tell me how to show a decent scale on both axis (i.e. 00:00 – 24:00 equally spaced by 1 hour with minor ticks; and a list of chosen dates for the X-axis).
Thank you in advance!
So, thanks to #ImportanceOfBeingErnest's insight, I managed to make it work by converting both data series to Python's datetime.datetime objects, but that wasn't enough.
In order to be properly plotted, the Y-values needed to also have the same date (with a fixed reference date just for plotting purposes).
For the chart's scale I've found the matplotlib.dates module which happens to contains useful Formatters and Locators for the axis's attributes.
In order to get a full 24 hours range for the Y-axis I've used:
ax.set_ylim([datetime.datetime(2019, 1, 1), datetime.datetime(2019, 1, 2)])
ax.yaxis.set_major_formatter(DateFormatter('%H:%M'))
ax.yaxis.set_major_locator(HourLocator())
The overall result (with some additions) seems good enough for now (even if I have to fix the UTC's offsets):
Thank you again!
What type of variables did you use? You probably used strings in datelist and hourlist. Therefore when you plot them, matplotlib doesn't sort the lists.
You need to convert your values to the correct object type, and then you would be able to plot correctly.
For example:
If I plot the list ['c','a','b'] in the y values, then my y axis would be: c, then a, then b.

Categories

Resources