Plotting Bar Graph by Years in Matplotlib - python

I am trying to plot this DataFrame which records various amounts of money over a yearly series:
from matplotlib.dates import date2num
jp = pd.DataFrame([1000,2000,2500,3000,3250,3750,4500], index=['2011','2012','2013','2014','2015','2016','2017'])
jp.index = pd.to_datetime(jp.index, format='%Y')
jp.columns = ['Money']
I would simply like to make a bar graph out of this using PyPlot (i.e pyplot.bar).
I tried:
plt.figure(figsize=(15,5))
xvals = date2num(jp.index.date)
yvals = jp['Money']
plt.bar(xvals, yvals, color='black')
ax = plt.gca()
ax.xaxis_date()
plt.show()
But the chart turns out like this:
Only by increasing the width substantially will I start seeing the bars. I have a feeling that this graph is attributing the data to the first date of the year (2011-01-01 for example), hence the massive space between each 'bar' and the thinness of the bars.
How can I plot this properly, knowing that this is a yearly series? Ideally the y-axis would contain only the years. Something tells me that I do not need to use date2num(), since this seems like a very common, ordinary plotting exercise.
My guess as to where I'm stuck is not handling the year correctly. As of now I have them as DateTimeIndex, but maybe there are other steps I need to take.
This has puzzled me for 2 days. All solutions I found online seems to use DataFrame.plot, but I would rather learn how to use PyPlot properly. I also intend to add two more sets of bars, and it seems like the most common way to do that is through plt.bar().
Thanks everyone.

You can either do
jp.plot.bar()
which gives:
or plot against the actual years:
plt.bar(jp.index.year, jp.Money)
which gives:

Related

Datetime plotting

Python beginner here :/!
The csv files can be found here (https://www.waterdatafortexas.org/groundwater/well/8739308)
#I'm trying to subset my data and plot them by years or every 6 months but I just cant make it work, this is my code so far
data=pd.read_csv('Water well.csv')
data["datetime"]=pd.to_datetime(data["datetime"])
data["datetime"]
fig, ax = plt.subplots()
ax.plot(data["datetime"], data["water_level(ft below land surface)"])
ax.set_xticklabels(data["datetime"], rotation= 90)
and this is my data and the output. As you can see, it only plots 2021 by time
This is my data of water levels from 2016 to 2021 and the output of the code
data
When you run your script, you get the following warning:
UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticklabels(data["datetime"], rotation= 90)
Your example demonstrates, why they included this warning.
Comment out your line
#ax.set_xticklabels(data["datetime"], rotation= 90)
and you have the following (correct) output:
Your code takes now the nine automatically generated x-axis ticks, removes the correct labels, and labels them instead with the first nine entries of the dataframe. Obviously, these labels are wrong, and this is the reason they provide you with the warning - either let matplotlib do the automatic labeling or do both using FixedFormatter and FixedLocator to ensure that tick positions and labels match.
For more information on Tick locators and formatters consult the matplotlib documentation.
P.S.: You also have to invert the y-axis because the data are in ft below land surface.
The problem is, you have too much data, you have to simplify it.
At first you can try to do something like this:
data["datetime"]=pd.to_datetime(data["datetime"])
date = data["datetime"][0::1000][0:10]
temp = data["water_level(ft below land surface)"][0::1000][0:10]
fig, ax = plt.subplots()
ax.plot(date, temp)
ax.set_xticklabels(date, rotation= 90)
date = data["datetime"][0::1000][0:10]
This line mean: take the index 0, then 1000, then 2000, ...
So you will have an new array. And then with this new array you just take the first 10 indexes.
It's a dirty solution
The best solution in my opinion is to create a new dataset with the average temperature for each day or each week. And after you display the result

How to scale and customize axis range in matplotlib?

I am new to data visualization, so please bear with me.
I am trying to create a data plot that describes various different attributes on a data set on blockbuster movies. The x-axis will be year of the movie and the y-axis will be worldwide gross. Now, some movies have made upwards of a billion in this category, and it seems that my y axis is overwhelmed as it completely blocks out the numbers and becomes illegible. Here is what I have thus far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('blockbusters.csv')
fig, ax = plt.subplots()
ax.set_title('Top Grossing Films')
ax.set_xlabel('Year')
ax.set_ylabel('Worldwide Grossing')
x = df['year'] #xaxis
y = df['worldwide_gross'] #yaxis
plt.show()
Any tips on how to scale this down? Ideally it could be presented on a scale of 10. Thanks in advance!
You could try logarithmic scaling:
ax.set_yscale('log')
You might want to manually set the ticks on the y-axis using
ax.set_yticks([list of values for which you want to have a tick])
ax.set_yticklabels([list of labels you want on each tick]) # optional
Another way to approach this might be to rank the movies (which gross is the highest, second highest, ...), i.e. on the y axis you would plot
df['worldwide_gross'].rank()
Edit: as you indicate, one might also check the dtypes to make sure the data is numerical. If not, use .astype(int) or .astype(float) to convert it.

Plotting data with matplot and python to graph

I'm currently trying to plot 7 days with varying small to large numbers.
The first set of data may look like this
dates = ['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26', '2018-09-27']
values = [107.660514, 107.550403, 107.435041, 107.435003, 107.574965, 107.449961, 107.650052, 107.649974]
vs another set of data may have the same dates, but the values may be much small incremental changes
dates = ['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26', '2018-09-27']
values = [0.849215, 0.849655, 0.849655, 0.851095, 0.850885, 0.850135, 0.851203, 0.851865]
When I use this
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
plt.plot_date(x=dates, y=values, fmt="r--")
plt.ylabel(c)
plt.grid(True)
plt.savefig('static/%s.png' % c)
The resulting image for the 1st set of values comes out as a dashed lined connecting the days to the dots. But the 2nd set of data makes a image of 7 parallel lines stacked on top of each other.
Should I be plotting this differently?
I assume you would like a comparison between two set of data you provided.
However, with such gap between both sets of data, it could be fairly unclear if you want to show both sets in a same plot.
You could use plt.subplots() to do that, and you'll probably get a plot like this
Or a better way is just showing two plots separately.. And you'll get a much clearer plot.
If you want to just show two plots, you can do something like this.

plot x-axis not displaying correctly for rolling mean

I'm obviously making a very basic mistake in adding a rolling mean plot to my figure.
The basic plot of close prices works fine, but as soon as I add the rolling mean to the plot, the x-axis dates get screwed up and I can't see what it's trying to do.
Here's the code:
import pandas as pd
import matplotlib.pyplot as plot
df = pd.read_csv('historical_price_data.csv')
df['Date'] = pd.to_datetime(df.Date, infer_datetime_format=True)
df.sort_index(inplace=True)
ax = df[['Date', 'Close']].plot(figsize=(14, 7), x='Date', color='black')
rolling_mean = df.Close.rolling(window=7).mean()
plot.plot(rolling_mean, color='blue', label='Rolling Mean')
plot.show()
With this sample data set I am getting this figure:
Given this simplicity of this code, I'm obviously making a very basic mistake, I just can't see what it is.
EDIT: Interesting, although #AndreyPortnoy's suggestion to set the index to Date results in the odd error that Date is not in the index, when I use the built-in's per his suggestion, the figure is no longer a complete mess, but for some reason the x-axis is reversed, and the ticks are no longer dates, but apparently ints (?) even though df.types shows Date is datetime64[ns]
#Sandipan\ Dey: Here's what the dataset looks like. Per code above I'm using pd.to_datetime() to convert to datetime64, and have tried df[::-1] to fix the problem where it is reversed when the 2nd plot (mov_avg) is added to the figure (but not reversed when figure only has the 1 plot.)
The fact that your dates for the moving averages start at 1970 suggests that an integer range index is used. It was generated by default when you read in the csv file. Try inserting
df.set_index('Date', inplace=True)
before
df.sort_index(inplace=True)
Then you can do
ax = df['Close'].plot(figsize=(14, 7), color='black')
rolling_mean = df.Close.rolling(window=7).mean()
plot.plot(rolling_mean, color='blue', label='Rolling Mean')
Note that I'm not passing x explicitly, letting pandas and matplotlib infer it.
You can simplify your code by using the builtin plotting facilities like so:
df['mov_avg'] = df['Close'].rolling(window=7).mean()
df[['Close', 'mov_avg']].plot(figsize=(14, 7))

How do I change the density of x-ticks of a pandas time series plot?

I am trying to generate a smaller figure visualising a pandas time series. The automatically-generated x-ticks, however, do not adapt to the new size and result in overlapping ticks. I am wondering how can I adapt the frequency of the x-ticks? E.g. for this example:
figsize(4, 2)
num = 3000
X = linspace(0, 100, num=num)
dense_ts = pd.DataFrame(sin(X) + 0.1 * np.random.randn(num),
pd.date_range('2014-01-1', periods=num, freq='min'))
dense_ts.plot()
The figure that I get is:
I am able to work around this problem using the Matplotlib date plotting, but it is not a very elegant solution - the code requires me to specify all the output formatting on a per-case basis.
figsize(4, 2)
from matplotlib import dates
fig, ax = plt.subplots()
ax.plot_date(dense_ts.index.to_pydatetime(), dense_ts, 'b-')
ax.xaxis.set_minor_locator(dates.HourLocator(byhour=range(24),
interval=12))
ax.xaxis.set_minor_formatter(dates.DateFormatter('%H:%m'))
ax.xaxis.set_major_locator(dates.WeekdayLocator())
ax.xaxis.set_major_formatter(dates.DateFormatter('\n\n%a\n%Y'))
plt.show()
I'm wondering if there is a way to solve this issue using the pandas plotting module or maybe by setting some axes object properties? I tried playing with the ax.freq object, but couldn't really achieve anything.
You can pass a list of x axis values you want displayed in your dense_ts.plot()
dense_ts.plot(xticks=['10:01','22:01'...])
Another example for clarity
df = pd.DataFrame(np.random.randn(10,3))
Plot without specifying xticks
df.plot(legend=False)
Plot with xticks argument
df.plot(xticks=[2,4,6,8],legend=False)

Categories

Resources