What is plotted when string data is passed to the matplotlib API?

What is plotted when string data is passed to the matplotlib API? - python

# first, some imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Let's say I want to make a scatter plot, using this data:
np.random.seed(42)
x=np.arange(0,50)
y=np.random.normal(loc=3000,scale=1,size=50)
Plot via:
plt.scatter(x,y)
I get this answer:
Ok, let's create a dataframe first:
df=pd.DataFrame.from_dict({'x':x,'y':y.astype(str)})
(I am aware that I am storing y as str - this is a reproducible example, and I do this to reflect the real use case.)
Then, if I do:
plt.scatter(df.x,df.y)
I get:
What am I seeing in this second plot? I thought that the second plot must be showing the x column plotted against the y column, which are converted to float. This is clearly not the case.

Matplotlib doesn't automatically convert str values to numerical, so your y values are treated as categorical. As far as Matplotlib is concerned, the differences '1.0' to '0.9' and '1.0' to '100.0' are not different.
So, the y-axis on the plot will be the same as range(len(y)) (since the difference between all categorical values is the same) with labels assigned from the categorical values.
Since your x is a range equal to range(50), and now your y is a range too (also equal to range(50)), it plots x = y, with y-labels set to respective str value.

As per the excellent answer by dm2, when you pass y as a string, y is simply being treated as arbitrary string labels, and being plotted one after the other in the order in which they appear. To demonstrate, here's an even simpler example.
from matplotlib import pyplot as plt
x = [1, 2, 3, 4]
y = [5, 25, 10, 1] # these are ints
plt.scatter(x, y)
So far so good. Now, different string y values.
y = list("abcd")
plt.scatter(x, y)
You can see how it just takes the y labels and just drops them on the axis one after another.
Finally,
y = ["5", "25", "10", "1"]
plt.scatter(x, y)
Compare this with the previous results and now it should become obvious what's going on.

It's more obvious if the labels and locations are extracted, that the API plots the strings as labels, and the axis locations are 0 indexed numbers based on the how many (len) categories exist.
.get_xticks() and .get_yticks() extract a list of the numeric locations.
.get_xticklabels() and .get_yticklabels() extract a list of matplotlib.text.Text, Text(x, y, text).
There are fewer numbers in the list for the y axis because there were duplicate values as a result of rounding.
This applies to any APIs, like seaborn or pandas that use matplotlib as the backend.
sns.scatterplot(data=df, x='x_num', y='y', ax=ax1)
ax1.scatter(data=df, x='x_num', y='y')
ax1.plot('x_num', 'y', 'o', data=df)
Labels, Locs, and Text
print(x_nums_loc)
print(y_nums_loc)
print(x_lets_loc)
print(y_lets_loc)
print(x_lets_labels)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[Text(0, 0, 'A'), Text(1, 0, 'B'), Text(2, 0, 'C'), Text(3, 0, 'D'), Text(4, 0, 'E'),
Text(5, 0, 'F'), Text(6, 0, 'G'), Text(7, 0, 'H'), Text(8, 0, 'I'), Text(9, 0, 'J'),
Text(10, 0, 'K'), Text(11, 0, 'L'), Text(12, 0, 'M'), Text(13, 0, 'N'), Text(14, 0, 'O'),
Text(15, 0, 'P'), Text(16, 0, 'Q'), Text(17, 0, 'R'), Text(18, 0, 'S'), Text(19, 0, 'T'),
Text(20, 0, 'U'), Text(21, 0, 'V'), Text(22, 0, 'W'), Text(23, 0, 'X'), Text(24, 0, 'Y'),
Text(25, 0, 'Z')]
Imports, Data, and Plotting
import numpy as np
import string
import pandas as pd
import matplotlib.pyplot as plt
import string
# sample data
np.random.seed(45)
x_numbers = np.arange(100, 126)
x_letters = list(string.ascii_uppercase)
y= np.random.normal(loc=3000, scale=1, size=26).round(2)
df = pd.DataFrame.from_dict({'x_num': x_numbers, 'x_let': x_letters, 'y': y}).astype(str)
# plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3.5))
df.plot(kind='scatter', x='x_num', y='y', ax=ax1, title='X Numbers', rot=90)
df.plot(kind='scatter', x='x_let', y='y', ax=ax2, title='X Letters')
x_nums_loc = ax1.get_xticks()
y_nums_loc = ax1.get_yticks()
x_lets_loc = ax2.get_xticks()
y_lets_loc = ax2.get_yticks()
x_lets_labels = ax2.get_xticklabels()
fig.tight_layout()
plt.show()

Related

Bar graph df.plot() vs ax.bar() structure matplotlib

I am trying to graph a table as a bar graph.
I get my desired outcome using df.plot(kind='bar') structure. But for certain reasons, I now need to graph it using the ax.bar() structure.
Please refer to the example screenshot. I would like to graph the x axis as categorical labels like the df.plot(kind='bar') structure rather than continuous scale, but need to learn to use ax.bar() structure to do the same.

Make the index categorical by setting the type to 'str'
import pandas as pd
import matplotlib.pyplot as plt
data = {'SA': [11, 12, 13, 16, 17, 159, 209, 216],
'ET': [36, 45, 11, 15, 16, 4, 11, 10],
'UT': [11, 26, 10, 11, 16, 7, 2, 2],
'CT': [5, 0.3, 9, 5, 0.2, 0.2, 3, 4]}
df = pd.DataFrame(data)
df['SA'] = df['SA'].astype('str')
df.set_index('SA', inplace=True)
width = 3
fig, ax = plt.subplots(figsize=(12, 8))
p1 = ax.bar(df.index, df.ET, color='b', label='ET')
p2 = ax.bar(df.index, df.UT, bottom=df.ET, color='g', label='UT')
p3 = ax.bar(df.index, df.CT, bottom=df.ET+df.UT, color='r', label='CT')
plt.legend()
plt.show()

Timeline bar using matplotlib & PolyCollection - Python

I have been trying to replicate #theimportanceofbeingernest 's answer to Timeline bar graph using python and matplotlib
and can't seem to get the correct output graph.
Here is my current output
Here is my desired output (but with using my data etc.)
I'm struggling to identify the issue.
Any help will be greatly appreciated!
Thank you.
Here's the code:
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.collections import PolyCollection
data = [(dt.datetime(1900, 1, 1, 14, 19, 26), dt.datetime(1900, 1, 1, 14, 19, 29), 'index'),
(dt.datetime(1900, 1, 1, 14, 19, 29), dt.datetime(1900, 1, 1, 14, 19, 31), 'links'),
(dt.datetime(1900, 1, 1, 14, 19, 31), dt.datetime(1900, 1, 1, 14, 19, 33), 'guides'),
(dt.datetime(1900, 1, 1, 14, 19, 33), dt.datetime(1900, 1, 1, 14, 19, 35), 'prices'),
(dt.datetime(1900, 1, 1, 14, 19, 35), dt.datetime(1900, 1, 1, 16, 39, 47), 'index'),
(dt.datetime(1900, 1, 1, 16, 39, 47), dt.datetime(1900, 1, 1, 16, 39, 48), 'prices')]
cats = {'index': 1, 'links': 2, 'guides': 3, 'prices': 4}
colormapping = {'index': 'C0', 'links': 'C1', 'guides': 'C2', 'prices': 'C3'}
verts = []
colors = []
for d in data:
v = [(mdates.date2num(d[0]), cats[d[2]]-.4),
(mdates.date2num(d[0]), cats[d[2]]+.4),
(mdates.date2num(d[1]), cats[d[2]]+.4),
(mdates.date2num(d[1]), cats[d[2]]-.4),
(mdates.date2num(d[0]), cats[d[2]]-.4)]
verts.append(v)
colors.append(colormapping[d[2]])
bars = PolyCollection(verts, facecolors=colors)
fig, ax = plt.subplots()
ax.add_collection(bars)
ax.autoscale()
loc = mdates.MinuteLocator(byminute=[0,30])
ax.xaxis.set_major_locator(loc)
ax.xaxis.set_major_formatter(mdates.AutoDateFormatter(loc))
ax.set_yticks([1,2,3,4])
ax.set_yticklabels(['index', 'links', 'guides', 'prices'])
plt.show()

Your time differences are extremely short. They are a few seconds, while yourthe x-range is a few hours. So, these bars basically get invisible.
Note that in matplotlib areas are usually drawn without antialiasing, which is useful when putting together multiple semitransparent areas. Lines, however, are drawn with some thickness (in screenspace) and antialiased. Therefore, setting an explicit edgecolor helps to visualize these "bars".
bars = PolyCollection(verts, facecolors=colors, edgecolors=colors)

Can't plot heatmap in Bokeh with datetime x axis

I'm trying to plot the following simple heatmap:
data = {
'value': [1, 2, 3, 4, 5, 6],
'x': [datetime(2016, 10, 25, 0, 0),
datetime(2016, 10, 25, 8, 0),
datetime(2016, 10, 25, 16, 0),
datetime(2016, 10, 25, 0, 0),
datetime(2016, 10, 25, 8, 0),
datetime(2016, 10, 25, 16, 0)],
'y': ['param1', 'param1', 'param1', 'param2', 'param2', 'param2']
}
hm = HeatMap(data, x='x', y='y', values='value', stat=None)
output_file('heatmap.html')
show(hm)
Unfortunately it doesn't render properly:
I've tried setting x_range but nothing seems to work.
I've managed to get something working with the following code:
d1 = data['x'][0]
d2 = data['x'][-1]
p = figure(
x_axis_type="datetime", x_range=(d1, d2), y_range=data['y'],
tools='xpan, xwheel_zoom, reset, save, resize,'
)
p.rect(
source=ColumnDataSource(data), x='x', y='y', width=12000000, height=1,
)
However as soon as I try to use the zoom tool, I get the following errors in console:
Uncaught Error: Number property 'start' given invalid value:
Uncaught TypeError: Cannot read property 'indexOf' of null
I've using Bokeh 0.12.3.

The bokeh.charts, including HeatMap was deprecated and removed in 2017. You should use the stable and supported bokeh.plotting API. With your data above, a complete example:
from datetime import datetime
from bokeh.plotting import figure, show
from bokeh.transform import linear_cmap
data = {
'value': [1, 2, 3, 4, 5, 6],
'x': [datetime(2016, 10, 25, 0, 0),
datetime(2016, 10, 25, 8, 0),
datetime(2016, 10, 25, 16, 0),
datetime(2016, 10, 25, 0, 0),
datetime(2016, 10, 25, 8, 0),
datetime(2016, 10, 25, 16, 0)],
'y': ['param1', 'param1', 'param1', 'param2', 'param2', 'param2']
}
p = figure(x_axis_type='datetime', y_range=('param1', 'param2'))
EIGHT_HOURS = 8*60*60*1000
p.rect(x='x', y='y', width=EIGHT_HOURS, height=1, line_color="white",
fill_color=linear_cmap('value', 'Spectral6', 1, 6), source=data)
show(p)

Attempting to make a multi-column graph

I am trying to make a column graph where the y-axis is the mean grain size, the x-axis is the distance along the transect, and each series is a date and/or number value (it doesn't really matter).
I have been trying a few different methods in Excel 2010 but I cannot figure it out. My hope is that, lets say at the first location, 9, there will be three columns and then at 12 there will be two columns. If it matter at all, lets say the total distance is 50. The result of this data should have 7 sets of columns along the transect/x-axis.
I have tried to do this using python but my coding knowledge is close to nil. Here is my code so far:
import numpy as np
import matplotlib.pyplot as plt
grainsize = [0.7912, 0.513, 0.4644, 1.0852, 1.8515, 1.812, 6.371, 1.602, 1.0251, 5.6884, 0.4166, 24.8669, 0.5223, 37.387, 0.5159, 0.6727]
series = [2, 3, 4, 1, 4, 2, 3, 4, 1, 4, 1, 4, 1, 4, 1, 4]
distance = [9, 9, 9, 12, 12, 15, 15, 15, 17, 17, 25, 25, 32.5, 32.5, 39.5, 39.5]
If someone happen to know of a code to use, it would be very helpful. A recommendation for how to do this in Excel would be awesome too.

There's a plotting library called seaborn, built on top of matplotlib, that does this in one line. Your example:
import numpy as np
import seaborn as sns
from matplotlib.pyplot import show
grainsize = [0.7912, 0.513, 0.4644, 1.0852, 1.8515, 1.812, 6.371,
1.602, 1.0251, 5.6884, 0.4166, 24.8669, 0.5223, 37.387, 0.5159, 0.6727]
series = [2, 3, 4, 1, 4, 2, 3, 4, 1, 4, 1, 4, 1, 4, 1, 4]
distance = [9, 9, 9, 12, 12, 15, 15, 15, 17, 17, 25, 25, 32.5, 32.5, 39.5, 39.5]
ax = sns.barplot(x=distance, y=grainsize, hue=series, palette='muted')
ax.set_xlabel('distance')
ax.set_ylabel('grainsize')
show()
You will be able to do a lot even as a total newbie by editing the many examples in the seaborn gallery. Use them as training wheels: edit only one thing at a time and think about what changes.

Generating an array of dates in python

I am writing a python script that produces a bar graph of data between two dates specified by the user
For example here the user enters 30 November and 4 December
import datetime as dt
dateBegin = dt.date(2012,11,30)
dateEnd = dt.date(2012,12,4)
Is there a way to return an array of the dates between dateBegin and dateEnd?
What I want is something like [30, 1, 2, 3, 4]. Any suggestions?

Sure! You are looking for matplotlib.dates.drange:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
import datetime as DT
dates = mdates.num2date(mdates.drange(DT.datetime(2012, 11, 30),
DT.datetime(2012, 12, 4),
DT.timedelta(days=1)))
print(dates)
# [datetime.datetime(2012, 11, 30, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x8c8f8ec>), datetime.datetime(2012, 12, 1, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x8c8f8ec>), datetime.datetime(2012, 12, 2, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x8c8f8ec>), datetime.datetime(2012, 12, 3, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x8c8f8ec>)]
vals = np.random.randint(10, size=len(dates))
fig, ax = plt.subplots()
ax.bar(dates, vals, align='center')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.xticks(rotation=25)
ax.set_xticks(dates)
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is plotted when string data is passed to the matplotlib API? - python

Related

Bar graph df.plot() vs ax.bar() structure matplotlib

Timeline bar using matplotlib & PolyCollection - Python

Can't plot heatmap in Bokeh with datetime x axis

Attempting to make a multi-column graph

Generating an array of dates in python

Categories

Resources