Using MaxNLocator for pandas bar plot results in wrong labels

Using MaxNLocator for pandas bar plot results in wrong labels - python

I have a pandas dataframe and I want to create a plot of it:
import pandas as pd
from matplotlib.ticker import MultipleLocator, FormatStrFormatter, MaxNLocator
df = pd.DataFrame([1, 3, 3, 5, 10, 20, 11, 7, 2, 3, 1], range(-5, 6))
df.plot(kind='barh')
Nice, everything works as expected:
Now I wanted to hide some of the ticks on y axes. Looking at the docs, I thought I can achieve it with:
MaxNLocator: Finds up to a max number of intervals with ticks at nice
locations. MultipleLocator: Ticks and range are a multiple of base;
either integer or float.
But both of them plot not what I was expecting to see (the values on the y-axes do not show the correct numbers):
ax = df.plot(kind='barh')
ax.yaxis.set_major_locator(MultipleLocator(2))
ax = df.plot(kind='barh')
ax.yaxis.set_major_locator(MaxNLocator(3))
What do I do wrong?

Problem
The problem occurs because pandas barplots are categorical. Each bar is positioned at a succesive integer value starting at 0. Only the labels are adjusted to show the actual dataframe index. So here you have a FixedLocator with values 0,1,2,3,... and a FixedFormatter with values -5, -4, -3, .... Changing the Locator alone does not change the formatter, hence you get the numbers -5, -4, -3, ... but at different locations (one tick is not shown, hence the plot starts at -4 here).
A. Pandas solution
In addition to setting the locator you would need to set a formatter, which returns the correct values as function of the location. In the case of a dataframe index with successive integers as used here, this can be done by adding the minimum index to the location using a FuncFormatter. For other cases, the function for the FuncFormatter may become more complicated.
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.ticker import (MultipleLocator, MaxNLocator,
FuncFormatter, ScalarFormatter)
df = pd.DataFrame([1, 3, 3, 5, 10, 20, 11, 7, 2, 3, 1], range(-5, 6))
ax = df.plot(kind='barh')
ax.yaxis.set_major_locator(MultipleLocator(2))
sf = ScalarFormatter()
sf.create_dummy_axis()
sf.set_locs((df.index.max(), df.index.min()))
ax.yaxis.set_major_formatter(FuncFormatter(lambda x,p: sf(x+df.index[0])))
plt.show()
B. Matplotlib solution
Using matplotlib, the solution is potentially easier. Since matplotlib bar plots are numeric in nature, they position the bars at the locations given to the first argument. Here, setting a locator alone is sufficient.
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.ticker import MultipleLocator, MaxNLocator
df = pd.DataFrame([1, 3, 3, 5, 10, 20, 11, 7, 2, 3, 1], range(-5, 6))
fig, ax = plt.subplots()
ax.barh(df.index, df.values[:,0])
ax.yaxis.set_major_locator(MultipleLocator(2))
plt.show()

Related

Plotting Time Series with Matplotlib: Using datetime.datetime() works but datetime.datetime.strptime(str, format) does not

I wish to plot a some data in a bar graph using matplotlib. The x-values of the plot should be datetime.datetime objects so that matplotlib can use them. If I generate the time values with the following method it works:
import matplotlib.pyplot as plt
import datetime
x = [datetime.datetime(2010, 12, 1, 10, 0),
datetime.datetime(2011, 1, 4, 9, 0),
datetime.datetime(2011, 5, 5, 9, 0)]
y = [4, 9, 2]
ax = plt.subplot(111)
ax.bar(x, y, width=10)
ax.xaxis_date()
plt.show()
Yielding this plot:
However if I use the following method it does not work:
import matplotlib.pyplot as plt
import datetime
dates = ["2020-05-11 18:25:37","2020-05-11 18:25:40","2020-05-11 18:25:43",
"2020-05-11 18:25:46","2020-05-11 18:25:49"]
X = []
for date in dates:
X.append(datetime.datetime.strptime(date, '%Y-%m-%d %H:%M:%S'))
Y = [1, 3, 4, 6, 4]
ax = plt.subplot(111)
ax.bar(X, Y, width=10)
ax.xaxis_date()
plt.show()
Yielding this abomination:
I am obviously missing something here but it appears to me that the results should be the same for:
datetime.datetime(2010, 12 ,1 ,10, 0)
datetime.datetime.strptime('2010-12-01 10:00:00','%Y-%m-%d %H:%M:%S')

You can use date2num to convert the dates to matplotlib format.
Plot the dates and values using plot_date:
X = matplotlib.dates.date2num(X)
matplotlib.pyplot.plot_date(X,Y)
matplotlib.pyplot.show()
or you can use like this
ax.plot_date(X, Y)

The bar graph cannot be displayed due to an error, while the line graph shows the x-axis in minutes:seconds. I'm sorry I can't help you.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
fig = plt.figure(figsize=(8,4.5))
ax = fig.add_subplot(111)
dates = ["2020-05-11 18:25:37","2020-05-11 18:25:40","2020-05-11 18:25:43",
"2020-05-11 18:25:46","2020-05-11 18:25:49"]
Y = [1, 3, 4, 6, 4]
df = pd.DataFrame({'Y':Y}, index=pd.to_datetime(dates))
ax.plot(df.index, df['Y'])
xloc = mdates.SecondLocator(interval=1)
xfmt = mdates.DateFormatter("%M:%S")
ax.xaxis.set_major_locator(xloc)
ax.xaxis.set_major_formatter(xfmt)
plt.show()

The problem was caused by a datetime width for matplotlib having to be expressed in units of 1 day. So if width = 1, a bar width is equal to a day on the x-axis.
This was resolved by making the width be equal to a percentage of a day appropriate for the time scale used, in this case 3 seconds. For example, if you want the bar width to occupy 3 seconds on the x-axis make the width equal to 3 seconds as a percentage of a whole day,
#NB: There are 86400 seconds in a day and I want a width of 3 seconds.
ax.plot(width = (1/(86400))*3)
If you wish for the bars to not touch you should reduce the width of the bars to less than the interval between timestamps as plotted on the x-axis. Further if you wish to dyamically determine the minimum interval between timestamps please look at this post.

How to show the plot with correcaltions

I have a data frame and I want to plot a figure like this. I try in R and python, but I can not. Can anybody help me to plot this data?
Thank you.
This is my simple data and code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.DataFrame([[1, 4, 5, 12, 5, 2,2], [-5, 8, 9, 0,2,1,8],[-6, 7, 11, 19,1,2,5],[-5, 1, 3, 7,5,2,5],[-5, 7, 3, 7,6,2,9],[2, 7, 9, 7,6,2,8]])
sns.pairplot(data)
plt.show()

sns.pairplot() is a simple function aimed at creating a pair-plot easily using the default settings. If you want more flexibility in terms of the kind of plots you want in the figure, then you have to use PairGrid directly
data = pd.DataFrame(np.random.normal(size=(1000,4)))
def remove_ax(*args, **kwargs):
plt.axis('off')
g = sns.PairGrid(data=data)
g.map_diag(plt.hist)
g.map_lower(sns.kdeplot)
g.map_upper(remove_ax)

Altair mark_line plots noisier than matplotlib?

I am learning altair to add interactivity to my plots. I am trying to recreate a plot I do in matplotlib, however altair is adding noise to my curves.
this is my dataset
df1
linked here from github: https://raw.githubusercontent.com/leoUninova/Transistor-altair-plots/master/df1.csv
This is the code:
fig, ax = plt.subplots(figsize=(8, 6))
for key, grp in df1.groupby(['Name']):
y=grp.logabsID
x=grp.VG
ax.plot(x, y, label=key)
plt.legend(loc='best')
plt.show()
#doing it directly from link
df1='https://raw.githubusercontent.com/leoUninova/Transistor-altair-plots/master/df1.csv'
import altair as alt
alt.Chart(df1).mark_line(size=1).encode(
x='VG:Q',
y='logabsID:Q',
color='Name:N'
)
Here is the image of the plots I am generating:
matplotlib vs altair plot
How do I remove the noise from altair?

Altair sorts the x axis before drawing lines, so if you have multiple lines in one group it will often lead to "noise", as you call it. This is not noise, but rather an accurate representation of all the points in your dataset shown in the default sort order. Here is a simple example:
import numpy as np
import pandas as pd
import altair as alt
df = pd.DataFrame({
'x': [1, 2, 3, 4, 5, 5, 4, 3, 2, 1],
'y': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'group': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
})
alt.Chart(df).mark_line().encode(
x='x:Q',
y='y:Q'
)
The best way to fix this is to set the detail encoding to a column that distinguishes between the different lines that you would like to be drawn individually:
alt.Chart(df).mark_line().encode(
x='x:Q',
y='y:Q',
detail='group:N'
)
If it is not the grouping that is important, but rather the order of the points, you can specify that by instead providing an order channel:
alt.Chart(df.reset_index()).mark_line().encode(
x='x:Q',
y='y:Q',
order='index:Q'
)
Notice that the two lines are connected on the right end. This is effectively what matplotlib does by default: it maintains the index order even if there is repeated data. Using the order channel for your data produces the result you're looking for:
df1 = pd.read_csv('https://raw.githubusercontent.com/leoUninova/Transistor-altair-plots/master/df1.csv')
alt.Chart(df1.reset_index()).mark_line(size=1).encode(
x='VG:Q',
y='logabsID:Q',
color='Name:N',
order='index:Q'
)
The multiple lines in each group are drawn in order connected at the ends, just as they are in matplotlib.

Show all lines in matplotlib line plot

How do I bring the other line to the front or show both the graphs together?
plot_yield_df.plot(figsize=(20,20))

If plot data overlaps, then one way to view both the data is increase the linewidth along with handling transparency, as shown:
plt.plot(np.arange(5), [5, 8, 6, 9, 4], label='Original', linewidth=5, alpha=0.5)
plt.plot(np.arange(5), [5, 8, 6, 9, 4], label='Predicted')
plt.legend()
Subplotting is other good way.

Problem
The lines are plotted in the order their columns appear in the dataframe. So for example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
a = np.random.rand(400)*0.9
b = np.random.rand(400)+1
a = np.c_[a,-a].flatten()
b = np.c_[b,-b].flatten()
df = pd.DataFrame({"A" : a, "B" : b})
df.plot()
plt.show()
Here the values of "B" hide those from "A".
Solution 1: Reverse column order
A solution is to reverse their order
df[df.columns[::-1]].plot()
That has also changed the order in the legend and the color coding.
Solution 2: Reverse z-order
So if that is not desired, you can instead play with the zorder.
ax = df.plot()
lines = ax.get_lines()
for line, j in zip(lines, list(range(len(lines)))[::-1]):
line.set_zorder(j)

matplotlib plot (not scatter) color based on extra variable

Good day,
I am trying to plot two arrays (timesin and locations) as a scatter of points. However, because timesin is a datetime object (of which I want time only), I find that I can only plot it properly using pyplot.plot(), not pyplot.scatter(). The issue arrises when I want to color the points on this plot with a third variable, idx. I know pyplot.scatter() is capable of doing this quite easily, but I don't know how to do it with pyplot.plot().
My excerpt of code:
import os
import tempfile
from datetime import datetime
import numpy as np
os.environ['MPLCONFIGDIR'] = tempfile.mkdtemp()
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
pp = PdfPages('output.pdf')
names = ['WestHall', 'Elevator', 'EastHall', 'MathLounge']
locations = np.arange(4)+1
plt.scatter(timesin, locations, c=idx, marker="o")
plt.xlabel("Time of day")
plt.ylabel("Location")
plt.yticks(np.arange(4)+1, names)
plt.gcf().autofmt_xdate()
pp.savefig()
plt.close()
pp.close()
When I try this, I get an error, because it tries to interpret idx as rgba:
ValueError: to_rgba: Invalid rgba arg "[...]"
number in rbg sequence outside 0-1 range
How do I get it to interpret idx as conditional coloring without using pyplot.scatter()?
Thanks
Update:
As suggested by Hun, I actually can use pyplot.scatter() in this context by converting the datetime objects to numbers using matplotlibs dates library. Thus, figuring out how to use pyplot.plot() for conditional coloring was unnecessary.

It would be easier if you use plt.scatter(). But you need to convert the datetime to something scatter() can understand. There is a way to do it.
>>> dt # datetime numpy array
array(['2005-02-01', '2005-02-02', '2005-02-03', '2005-02-04'], dtype='datetime64[D]')
>>> dt.tolist() # need to be converted to list
[datetime.date(2005, 2, 1), datetime.date(2005, 2, 2), datetime.date(2005, 2, 3), datetime.date(2005, 2, 4)]
# convert the list to internal time information for matplotlib. But this is float.
>>> dt1 = matplotlib.dates.date2num(dt.tolist())
array([ 731978., 731979., 731980., 731981.])
With this dt1 you can use plt.scatter()

I think that it is not possible to do this at once with matplotlib.pyplot.plot. However, here is my solution:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import cm
def scatterplot(x,y,prop):
prop = cm.jet((prop-np.min(prop))/(np.max(prop)-np.min(prop)))
ax = plt.gca()
for i in xrange(len(x)):
ax.plot(x[i],y[i],color=prop[i], marker='o')
return
x = np.random.rand(100)
y = np.random.rand(100)
prop = -20+40*np.random.rand(100)
fig = plt.figure(1,figsize=(5,5))
ax = fig.add_subplot(111)
scatterplot(x,y,prop)
plt.show()
which produces:
The only drawback of this approach is that if you have several particles, the process of walk through all of them can be relatively slow.
EDIT (In response to #Nathan Goedeke's comment:
I tried the same implementation but using a datetime object:
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
x = np.array([dt.datetime(2016, 10, 19, 10, 0, 0),
dt.datetime(2016, 10, 19, 10, 0, 1),
dt.datetime(2016, 10, 19, 10, 0, 2),
dt.datetime(2016, 10, 19, 10, 0, 3)])
fig = plt.figure()
y = np.array([1, 3, 4, 2])
prop = np.array([2.,5.,3.,1.])
scatterplot(x,y,prop)
plt.show()
and it works as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using MaxNLocator for pandas bar plot results in wrong labels - python

Related

Plotting Time Series with Matplotlib: Using datetime.datetime() works but datetime.datetime.strptime(str, format) does not

How to show the plot with correcaltions

Altair mark_line plots noisier than matplotlib?

Show all lines in matplotlib line plot

matplotlib plot (not scatter) color based on extra variable

Categories

Resources