Xaxis Labels not matching Data Points - Pandas/Matplotlib - python

I have a TimeSeries in Pandas that I want to plot. I have 336 records in the TimeSeries. I only want to show the date/time (index of the TimeSeries) on the x-axis once per every 20 or so data points.
Here is how I am trying to do this:
stats.plot()
ax.set_xticklabels(stats.index, rotation=45 )
ax.xaxis.set_major_locator(MultipleLocator(20))
ax.xaxis.set_minor_locator(NullLocator())
ax.yaxis.set_major_locator(MultipleLocator(.075))
draw()
My x-axis show the correct number of labels (18), but these are the first 18 in the series, they are not correctly corresponding to the datapoints in the plot.

The problem is you are using set_xticklabels which sets the value of the tick labels independent of the data. The ticks are labeled sequentially from the list you pass in.
From this I can't really tell what you are trying to do, but the behavior you are seeing is the 'correct' behavior for the library (it's doing exactly what you told it to, but that isn't what you want it to do).

Related

Matplotlib plotting data that doesnt exist

I am trying to plot three lines on one figure. I have data for three years for three sites and i am simply trying to plot them with the same x axis and same y axis. The first two lines span all three years of data, while the third dataset is usually more sparse. Using the object-oriented axes matplotlib format, when i try to plot my third set of data, I get points at the end of the graph that are out of the range of my third set of data. my third dataset is structured as tuples of dates and values such as:
data=
[('2019-07-15', 30.6),
('2019-07-16', 20.88),
('2019-07-17', 16.94),
('2019-07-18', 11.99),
('2019-07-19', 13.76),
('2019-07-20', 16.97),
('2019-07-21', 19.9),
('2019-07-22', 25.56),
('2019-07-23', 18.59),
...
('2020-08-11', 8.33),
('2020-08-12', 10.06),
('2020-08-13', 12.21),
('2020-08-15', 6.94),
('2020-08-16', 5.51),
('2020-08-17', 6.98),
('2020-08-18', 6.17)]
where the data ends in August 2020, yet the graph includes points at the end of 2020. This is happening with all my sites, as the first two datasets stay constant knowndf['DATE'] and knowndf['Value'] below.
Here is the problematic graph.
And here is what I have for the plotting:
fig, ax=plt.subplots(1,1,figsize=(15,12))
fig.tight_layout(pad=6)
ax.plot(knowndf['DATE'], knowndf['Value1'],'b',alpha=0.7)
ax.plot(knowndf['DATE'], knowndf['Value2'],color='red',alpha=0.7)
ax.plot(*zip(*data), 'g*', markersize=8) #when i plot this set of data i get nonexistent points
ax.tick_params(axis='x', rotation=45) #rotating for aesthetic
ax.set_xticks(ax.get_xticks()[::30]) #only want every 30th tick instead of every daily tick
I've tried ax.twinx() and that gives me two y axis that doesn't help me since i want to use the same x-axis and y-axis for all three sites. I've tried not using the axes approach, but there are things that come with axes that i need to plot with. Please please help!

How can I make line chart with uneven time intervals? in FacetGrid function

I made a line chart with seaborn FacetGrid, x-axis is the column called "dato", it is indicating the date of taking samples.
I have unequal time intervals between every sample taking, thereby I want to have corresponding space between every tick on x axis. how can I do that?
and I have a problem with making the correct order for dates also, tried some code, but not working for FacetGrid.
Here is my code and graph I made
mer=sns.FacetGrid(df, col='anlegg',hue='Merd',sharey=False,ylim=(0,4000))
sns.set_style('white')
sns.set_context('paper', font_scale=1.2)
mer.map_dataframe(sns.lineplot,x='dato',y='BW',marker='o',err_style='bars')
mer.add_legend()
mer.set_axis_labels('','BW')
mer.set_titles(col_template='anlegg {col_name}')
mer.set_xticklabels(rotation=90)

How to use column values for x axis labels in matplotlib

I have a basic DataFrame in pandas and using matplotlib to create a chart
I have followed advice found on SO and also on the docs for labelling the values on the x axis but they won't change from the indices.
I have this,
Presc_df_asc = Presc_df.sort_values('Total Items',ascending=True)
Presc_df_asc['Total Items'].plot.bar(x="Practice", ylim=[Presc_df_asc['Total Items'].min(), Presc_df_asc['Total Items'].max()])
plt.xlabel('Practice')
plt.ylabel('Total Items')
plt.title('practice total items')
plt.legend(('Items',),loc='upper center')
From what I have found plot.bar(x="Practice" should set the x-axis to show the values int he practice column under each bar.
But no matter what I try I get the x-axis labelled as indices with just the main label saying Practices.
In order for the plotting command to be able to access the "Practice" column, you need to apply the plot function to the entire dataframe (or a sub_dataframe that contains at least these two columns). The code below uses the corresponding labels below each bar. The rot=0 argument prevents the labels from being rotated by 90°.
Presc_df_asc.plot.bar(x="Practice", y ="Total Items",
ylim=[Presc_df_asc['Total Items'].min(),
Presc_df_asc['Total Items'].max()], rot=0)

Plotting a column with millions of rows

I have a data-frame with millions of rows (almost 8 million). I need to see the distribution of the values in one of the columns. This column is called 'price_per_mile'. I also have a column called 'Borough'. The final goal is doing a t-test.
First I want to see the distribution of data in 'price_per_mile', to see if data is normal and if I need to do some data cleaning. Then group-by based on five categories in 'borough' column and then do the t-test for each possible pair of boroughs.
I have tried to plot the distribution with sns.distplot() but it doesn't give me a clear plot as it seems there's a scaling of the values on the y-axis. Also, the range of values contained in 'price_per_mile' is big.
Then I tried to plot a section of values, again the plot doesn't look clear and informative enough. Scaling happens again.
result.drop(result[(result.price_per_mile <1) | (result.price_per_mile>200)].index, inplace=True)
What do I need to do to have a better-looking plot which gives me the true value of each bin and not just a normalized value?
I read the documentation for sns.distplot() but didn't find something helpful.
As per the documentation for displot (emphasis mine)
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
Which means that if you want the non-normalized histogram, you have to make sure to instruct seaborn to not plot the KDE at the same time
sns.distplot(a, kde=True, norm_hist=False)
sns.distplot(a, kde=False, norm_hist=False)

Extra set of bars on plot in Pandas?

I want to create a plot using Pandas to show the standard deviations of item prices on specific week days (in my case there are 6 relevant days of the week, each shown as 0-5 on the x axis).
It seems to work however there is another set of smaller bars next to each standard deviation bar that is literally also valued at 0-5.
I think this means that I'm also accidentally also plotting the day of the week.
How can I get rid of these smaller bars and only show the standard deviation bars?
sales_std=sales_std[['WeekDay','price']].groupby(['WeekDay']).std()
.reset_index()
Here is where I try to plot the graph:
p = sales_std.plot(figsize=
(15,5),legend=False,kind="bar",rot=45,color="orange",fontsize=16,
yerr=sales_std);
p.set_title("Standard Deviation", fontsize=18);
p.set_xlabel("WeekDay", fontsize=18);
p.set_ylabel("Price", fontsize=18);
p.set_ylim(0,100);
Resulting Bar Plot:
You are plotting both WeekDay and price at the same time (i.e. plotting an entire Dataframe). In order to show bars for price only, you need to plot Series given WeekDay as an index (so no reset_index() is required after groupby()).
# you don't need `reset_index()` in your code
sales_std=sales_std[['WeekDay','price']].groupby(['WeekDay']).std()
sales_std['price'].plot(kind='bar')
Note: I intentionally omitted graph-styling parts of your code to focus on fixing the issue.

Categories

Resources