Setting legend labels to dates in Python - python

In short, I have (a couple of days worth of) glucose values plotted against their timestamps. I have written a function which then layers the glucose values on the same x-axis so I can look for glucose trends. Ultimately, that means that glucose data from a couple of days is plotted with different lines, resulting in the graph below:
Currently, the label says 'Glucose reading' for every color. I am looking to set the label in a way so when the data is being plotted it shows the dates (2019-11-21, 2019-11-22) and so on. I'm really not sure how to do it since I've never dealt with matplotlib legends below and I couldn't really find any useful solutions.
Any guidance would be much appreciated!
EDIT 1:
I am using pandas dataframe. Minimal code example - My legend is positioned in a plotting function like so:
def plotting_function(x, y, isoverlay = False):
years_fmt = mdates.DateFormatter(' %H:%M:%S')
ax = plt.axes()
ax.xaxis.set_major_formatter(years_fmt)
dates = [date.to_pydatetime() for date in x]
if isoverlay:
plt.plot(x, y, label= "Glucose reading" )
else:
plt.plot(x, y, 'rs:', label="Glucose reading")
plt.xlabel("Time of readings")
plt.ylabel("Glucose readings in mmol/L")
plt.legend(ncol=2)
plt.title("Glucose readings plotted against their timestamps")

In the plotting function you could add a list of the labels for the legend as an extra parameter and give that to plt.legend().
Here is a minimal example to show how it could work:
import numpy as np
import matplotlib.pyplot as plt
def plotting_function(x, y, labels):
plt.plot(x, y)
plt.legend(labels, ncol=2)
N = 100
K = 9
x = np.arange(N)
y = np.random.normal(.05, .2, (N, K)).cumsum(axis=0) + np.random.uniform(1, 10, K)
labels = [f'label {i + 1}' for i in range(K)] # as a test: ['label 1', 'label 2' ,...]
# labels = ['2019-11-21', '2019-11-22', ...] # this is another example, how dates could be used
plotting_function(x, y, labels)

Related

matplotlib visualization- positive negative proportion chart

I'm trying to make the same chart as below and wonder if matplotlib has a similar chart to make that.
The chart below is the result of the STM topic model in the R package
I have probs values using DMR in Python:
array([[0.07204196, 0.04238116],
[0.04518877, 0.30546978],
[0.0587892 , 0.19870868],
[0.16710107, 0.07182639],
[0.128209 , 0.02422131],
[0.15264449, 0.07237352],
[0.2250081 , 0.06986096],
[0.1337716 , 0.10750801],
[0.01197221, 0.06736039],
[0.00527367, 0.04028973]], dtype=float32)
These are the results and left is Negative words and right is Positive
Example of negative positive proportion chart:
It is possible to create something quite close to the image you included. I understood that the right column should be negative while the right column should be positive?
First make the data negative:
import numpy as np
arr = np.array([[0.07204196, 0.04238116],
[0.04518877, 0.30546978],
[0.0587892 , 0.19870868],
[0.16710107, 0.07182639],
[0.128209 , 0.02422131],
[0.15264449, 0.07237352],
[0.2250081 , 0.06986096],
[0.1337716 , 0.10750801],
[0.01197221, 0.06736039],
[0.00527367, 0.04028973]], dtype="float32")
# Make the right col negative
arr[:, 0] *= -1
Then we can plot like so:
from string import ascii_lowercase
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
for y, x in enumerate(arr.flatten()):
# Get a label from the alphabet
label = ascii_lowercase[y]
# Plot the point
ax.plot(x, y, "o", color="black")
# Annotate the point with the label
ax.annotate(label, xy=(x, y), xytext=(x - 0.036, y), verticalalignment="center")
# Add the vertical line at zero
ax.axvline(0, ls="--", color="black", lw=1.25)
# Make the x axis equal
xlim = abs(max(ax.get_xlim(), key=abs))
ax.set_xlim((-xlim, xlim))
# Remove y axis
ax.yaxis.set_visible(False)
# Add two text labels for the x axis
for text, x in zip(["Negative", "Positive"], ax.get_xlim()):
ax.text(x / 2, -3.75, f"{text} Reviews", horizontalalignment="center")
Which outputs:
You can tweak the values in the calls to ax.annotate and ax.text if you need to change the locations of the text on the plot or x-axis.
I'm not sure what the key part of the question is. That is, are you more interested in labeling the individual points based on the category, or if you're more concerned with the unique circle with a line through it. With the array provided it's a little confusing about what the data represents.
What I've assumed is each sublist represents a single category. With that in mind, what I did was make a separate column (delta) for the differences in values and then plotted them vs the index.
# New column (delta) with styling
df['delta'] = df[0]-df[1]
col = np.where(df.delta>0,'g',np.where(df.index<0,'b','r'))
fig, ax = plt.subplots(figsize =(10,7))
# Style it up a bit
plt.title('Differnece in Topic Proportion (Negative vs Positive)')
plt.xlabel('Net Review Score')
plt.ylabel('Index Number')
plt.tight_layout()
plt.savefig("Evolution of rapport of polarisation - (Aluminium).png")
plt.scatter(df['delta'], df.index, s=None, c=col, marker=None, linewidth=2)
plt.axvline(x = 0, color = 'b', label = 'axvline - full height', linestyle="--" )
That gives an out of this:

ValueError: The number of FixedLocator locations (5), usually from a call to set_ticks, does not match the number of ticklabels (12)

this piece of code was working before, however, after creating a new environment , it stopped working for the line
plt.xticks(x, months, rotation=25,fontsize=8)
if i comment this line then no error, after putting this line error is thrown
ValueError: The number of FixedLocator locations (5), usually from a call to set_ticks, does not match the number of ticklabels (12).
import numpy as np
import matplotlib.pyplot as plt
dataset = df
dfsize = dataset[df.columns[0]].size
x = []
for i in range(dfsize):
x.append(i)
dataset.shape
# dataset.dropna(inplace=True)
dataset.columns.values
var = ""
for i in range(dataset.shape[1]): ## 1 is for column, dataset.shape[1] calculate length of col
y = dataset[dataset.columns[i]].values
y = y.astype(float)
y = y.reshape(-1, 1)
y.shape
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values=np.nan, strategy='mean', verbose=0)
missingvalues = missingvalues.fit(y)
y = missingvalues.transform(y[:, :])
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_x = LabelEncoder()
x = labelencoder_x.fit_transform(x)
from scipy.interpolate import *
p1 = np.polyfit(x, y, 1)
# from matplotlib.pyplot import *
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.figure()
plt.xticks(x, months, rotation=25,fontsize=8)
#print("-->"+dataset.columns[i])
plt.suptitle(dataset.columns[i] + ' (xyz)', fontsize=10)
plt.xlabel('month', fontsize=8)
plt.ylabel('Age', fontsize=10)
plt.plot(x, y, y, 'r-', linestyle='-', marker='o')
plt.plot(x, np.polyval(p1, x), 'b-')
y = y.round(decimals=2)
for a, b in zip(x, y):
plt.text(a, b, str(b), bbox=dict(facecolor='yellow', alpha=0.9))
plt.grid()
# plt.pause(2)
# plt.grid()
var = var + "," + dataset.columns[i]
plt.savefig(path3 + dataset.columns[i] + '_1.png')
plt.close(path3 + dataset.columns[i] + '_1.png')
plt.close('all')
I also stumbled across the error and found that making both your xtick_labels and xticks a list of equal length works. So in your case something like :
def month(num):
# returns month name based on month number
num_elements = len(x)
X_Tick_List = []
X_Tick_Label_List=[]
for item in range (0,num_elements):
X_Tick_List.append(x[item])
X_Tick_Label_List.append(month(item+1))
plt.xticks(ticks=X_Tick_List,labels=X_Tick_LabeL_List, rotation=25,fontsize=8)
I am using subplots and came across the same error. I've noticed that the error disappears if the axis being re-labelled (in my case the y-axis) shows all labels. If it does not, then the error you have flagged appears. I suggest increasing the chart height until all the y-axis labels are shown by default (See screenshots below).
Alternatively, I suggest using ax.set_xticks(...) to define the FixedLocator tick positions and then ax.set_xticklables(...) to set the labels.
Ever 2nd y-axis label drawn
Every y-axis label drawn and over-written with custom labels
The questions seems to be a bit older but I just stumbled across a similar problem: the very same error occurred for me with the line
ax.set(xticklabels=..., xticks=...)
after an update to matplotlib version 3.2.2 which worked OK until then.
The set method of an axis seems to call other methods according to the order of the arguments of set. In older matplotlib versions, after unpacking the set-arguments there seems to have been some sort of ordering.
Rearranging the order of the parameter such that first the new number of xticks are set solved the problem for me:
ax.set(xticks=..., xticklabels=...)
Happens when matplotlib failed to pack columns names in gist. Previously I had 5 columns and it was OK, next time I had 100+ columns and it caused an error.
Error in line:
g.set_xticklabels(df['code'], rotation=15, fontdict={'fontsize': 16})

How to make standard deviation and percentile bands in a python scatter plot

I have data for a scatter plot (for reference, x values are labelled sm, y values are labelled bhm) and my three goals are to find the medians of binned data, create standard deviation bands, and create bands at the 90th and 10th percentiles. I've managed to do the first, and while I've been able to make vertical bars indicating the standard deviation, I can't figure out how to make filled-in bands since every time I try to set parameters with the fill_between function, it says operators with sm/bhm are incompatible since they're datasets and I'm comparing them to singular values (the mean line). I copied all of my code down below and there's a comment pointing out the relevant stuff - I just kept all of it since the variable names are a bit important and also because some parts of the plot don't show up properly without the seemingly extraneous code
To create the bands at 90/10 percent, I tried this bit of code by trying to bin the mean as I did for the median, and then filling the top and bottom of the line +-90% of the data but I keep getting
patsy.PatsyError: model is missing required outcome variables
#stuff that really doesn't work
model = smf.quantreg(bhm, sm)
quantiles = [0.1, 0.9]
fits = [model.fit(q=q) for q in quantiles]
figure, axes = plt.subplots()
_sm = np.linspace(min(sm), max(sm))
for index, quantile in enumerate(quantiles):
_bhm = fits[index].params['world'] * _sm +
fits[index].params['Intercept']
axes.plot(_sm, _bhm, label = quantile)
axes.plot(_sm, _sm, 'g--', label = 'i guess this line is the mean')
#stuff that also doesn't really work
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as mpatches
import h5py
import statistics as stat
import pandas as pd
import statsmodels.formula.api as smf
#my files and labels for things
f=h5py.File(r'C:\Users\hanna\Downloads\CatalogueGalsz0p0.hdf5', 'r')
sm = f['StellarMass']
bhm = f['BHMass']
bt = f['BtoT']
dt = f['DtoT']
nbins = 125
#titles and scaling for the plot
plt.title('Relationships Between Stellar Mass, Black Hole Mass, and Bulge
to Total Ratios')
plt.xlabel('Stellar Mass')
plt.ylabel('Black Hole Mass')
plt.xscale('log')
plt.yscale('log')
axes = plt.gca()
axes.set_ylim([500000,max(bhm)])
axes.set_xlim([min(sm),max(sm)])
#labels for the legend and how I colored the points in the plot
DtoT = np.copy(f['DtoT'].value)
colour = np.zeros(len(DtoT),dtype=str)
for i in np.arange(0, len(bt)):
if bt[i]>=0.5:
colour[i]='green'
else:
colour[i]='red'
redbt = mpatches.Patch(color = 'red', label = 'Bulge to Total Ratios Below 0.5')
greenbt = mpatches.Patch(color = 'green', label = 'Bulge to Total Ratios Above 0.5')
plt.legend(handles = [(redbt), (greenbt)])
#the important part - this is how I binned my data to make the median line, and this part works but not the standard deviation bands
bins = np.linspace(0, max(sm), nbins)
delta = bins[1]-bins[0]
idx = np.digitize(sm, bins)
runningmedian = [np.median(bhm[idx==k]) for k in range(nbins)]
runningstd = [bhm[idx==k].std() for k in range(nbins)]
plt.plot(bins-delta/2, runningmedian, c = 'b', lw=1)
plt.scatter(sm, bhm, c=colour, s=.2)
plt.show()

How to make readable a line plot using a DataFrame with a large number of rows

I have a 1,000,000 x 2 DataFrame object consisting of data I'm trying to understand visually. Its basically a simulation of 1,000,000 events where a packet traveling along a network is either queued or dropped depending on the buffer's size. So, the two column values are Packets in Queue and Packets Dropped.
I'm trying to make a line plot using Python, Matplotlib and Jupyter Notebooks that has the ID of the event on the x-axis and the number of packets in the queue at a specific ID point on the y-axis. There should be two lines, the first representing the number of packets in the queue and the second representing the number of packets dropped. However, given that there are over 1,000,000 simulations, the graph isn't intelligible. The values are too squished together. Is it possible to make a readable graph with 1,000,000 event instances or do I need to dramatically trim the number of events?
With a million data points it will require a lot of effort and zooming in to see them in fine detail. Plotly has some nice tools for zooming in and out of plots as well as sliding your data window along the x-axis.
If you're okay with some averaging, you can plot a moving average and get close to a hundred thousand points. You can stack two subplots on each other to see both columns of data in reasonable detail. You can of course average them more, but you lose the ability to see fine details.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def moving_avg(x, N=30):
return np.convolve(x, np.ones((N,))/N, mode='valid')
plt.figure(figsize = (16,12))
plt.subplot(3,1,1)
x = np.random.random(1000)
plt.plot(x, linewidth = 1, alpha = 0.5, label = 'linewidth = 1')
plt.plot(moving_avg(x, 10), 'C0', label = 'moving average, N = 10')
plt.xlim(0,len(x))
plt.legend(loc=2)
plt.subplot(3,1,2)
x = np.random.random(10000)
plt.plot(x, linewidth = 0.2, alpha = 0.5, label = 'linewidth = 0.2')
plt.plot(moving_avg(x, 100), 'C0', label = 'moving average, N = 100')
plt.xlim(0,len(x))
plt.legend(loc=2)
plt.subplot(3,1,3)
x = np.random.random(100000)
plt.plot(x, linewidth = 0.05, alpha = 0.5, label = 'linewidth = 0.05')
plt.plot(moving_avg(x, 500), 'C0', label = 'moving average, N = 500')
plt.xlim(0,len(x))
plt.legend(loc=2)
plt.tight_layout()
Try histogram
from matplotlib.pyplot import hist
import pandas as pd
df = pd.DataFrame()
df['x'] = np.random.rand(1000000)
hist(df.index, weights=df.x, bins=1000)
plt.show()
Method 2 line plots
df['x'] = np.random.rand(1000000)
df['y'] = np.random.rand(1000000)
w = 1000
v1 = df['x'].rolling(min_periods=1, window=w).sum()[[i*w for i in range(1, int(len(df)/w))]]/w
v2 = df['y'].rolling(min_periods=1, window=w).sum()[[i*w for i in range(1, int(len(df)/w))]]/w
plt.plot(np.arange(len(v1)),v1, c='b')
plt.plot(np.arange(len(v1)),v2, c='r')
plt.show()
We are calculating the mean of w=1000 points i.e averaging w values together and plotting them.
Looks like below when 1000000 points are bucked at every 1000 interval

Matplotlib Pcolormesh - in what format should I give the data?

I'm trying to use matplotlib's pcolormesh function to draw a diagram that shows dots in 2d coordinates, and the color of the dots would be defined by a number.
I have three arrays, one of which has the x-coordinates, another one with the y-coordinates, and the third one has the numbers which should represent colors.
xdata = [ 695422. 695423. 695424. 695425. 695426. 695426.]
ydata = [ 0. -15.4 -15.3 -15.7 -15.5 -19. ]
colordata = [ 0. 121. 74. 42. 8. 0.]
Now, apparently pcolormesh wants its data as three 2d arrays.
In some examples I've seen something like this being done:
newxdata, newydata = np.meshgrid(xdata,ydata)
Okay, but how do I get colordata into a similar format? I tried to it this way:
newcolordata, zz = np.meshgrid(colordata, xdata)
But I'm not exactly sure if it's right. Now, if I try to draw the diagram:
ax.pcolormesh(newxdata, newydata, newcolordata)
I get something that looks like this.
No errors, so I guess that's good. The picture it returns obviously doesn't look like what I want it to. Can someone point me into right direction with this? Is the data array still in wrong format?
This should be all of the important code:
newxdata, newydata = np.meshgrid(xdata,ydata)
newcolordata, zz = np.meshgrid(colordata, xdata)
print newxdata
print newydata
print newcolordata
diagram = plt.figure()
ax = diagram.add_subplot(111)
xformat = DateFormatter('%d/%m/%Y')
ax.xaxis_date()
plot1 = ax.pcolormesh(newxdata, newydata, newcolordata)
ax.set_title("A butterfly diagram of sunspots between dates %s and %s" % (date1, date2))
ax.autoscale(enable=False)
ax.xaxis.set_major_formatter(xformat)
diagram.autofmt_xdate()
if command == "save":
diagram.savefig('diagrams//'+name+'.png')
Edit: I noticed that the colors do correspond to the number. Now I just have to turn those equally sized bars into dots.
If you want dots, use scatter. pcolormesh draws a grid. scatter draws markers colored and/or scaled by size.
For example:
import matplotlib.pyplot as plt
xdata = [695422.,695423.,695424.,695425.,695426.,695426.]
ydata = [0.,-15.4,-15.3,-15.7,-15.5,-19.]
colordata = [0.,121.,74.,42.,8.,0.],
fig, ax = plt.subplots()
ax.scatter(xdata, ydata, c=colordata, marker='o', s=200)
ax.xaxis_date()
fig.autofmt_xdate()
plt.show()
Edit:
It sounds like you want to bin your data and sum the areas inside each bin.
If so, you can just use hist2d to do this. If you specify the areas of the sunspots as the weights to the histogram, the areas inside each bin will be summed.
Here's an example (data from here: http://solarscience.msfc.nasa.gov/greenwch.shtml, specifically, this file, formatted as described here). Most of this is reading the data. Notice that I'm specifying the vmin and then using im.cmap.set_under('none') to display anything under that value as transparent.
It's entirely possible that I'm completely misunderstanding the data here. The units may be completely incorrect (the "raw" areas given are in million-ths of the sun's surface area, I think).
from glob import glob
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
def main():
files = sorted(glob('sunspot_data/*.txt'))
df = pd.concat([read_file(name) for name in files])
date = mdates.date2num(df.date)
fig, ax = plt.subplots(figsize=(10, 4))
data, xbins, ybins, im = ax.hist2d(date, df.latitude, weights=df.area/1e4,
bins=(1000, 50), vmin=1e-6)
ax.xaxis_date()
im.cmap.set_under('none')
cbar = fig.colorbar(im)
ax.set(xlabel='Date', ylabel='Solar Latitude', title='Butterfly Plot')
cbar.set_label("Percentage of the Sun's surface")
fig.tight_layout()
plt.show()
def read_file(filename):
"""This data happens to be in a rather annoying format..."""
def parse_date(year, month, day, time):
year, month, day = [int(item) for item in [year, month, day]]
time = 24 * float(time)
hour = int(time)
minute_frac = 60 * (time % 1)
minute = int(minute_frac)
second = int(60 * (minute_frac % 1))
return dt.datetime(year, month, day, hour, minute, second)
cols = dict(year=(0, 4), month=(4, 6), day=(6, 8), time=(8, 12),
area=(41, 44), latitude=(63, 68), longitude=(57, 62))
df = pd.read_fwf(filename, colspecs=cols.values(), header=None,
names=cols.keys(), date_parser=parse_date,
parse_dates={'date':['year', 'month', 'day', 'time']})
return df
main()

Categories

Resources