Python Histogram using matplotlib - python

I've written a python script that parses a trace file and retrieves a list of objects (vehicle objects) containing the vehicle id, timestep and the number of other vehicles in radio range of a particular vehicle for that timestep:
for d_obj in global_list_of_nbrs:
print "\t", d_obj.id, "\t", d_obj.time, "\t", d_obj.num_nbrs
The sample output from the test file I am using is:
0 0 1
0 1 2
0 2 0
1 0 1
1 1 2
2 0 0
2 1 2
This can be interpreted as vehicle with id 0 at timestep 0 has 1 neighbouring vehicle, vehicle with id 0 at timestep 1 has 2 neighbouring vehicles (i.e. in radio range) etc.
I would like to plot a histogram using matplotlib to represent this data but am confused what I should do with bins etc and how I should represent the list (currently a list of objects).
Can anyone please advise on this?
Many thanks in advance.

Here's an example of something you might be able to do with this data set:
Note: You'll need to install pandas for this example to work for you.
n = 10000
id_col = randint(3, size=n)
lam = 10
num_nbrs = poisson(lam, size=n)
d = DataFrame({'id': id_col, 'num_nbrs': num_nbrs})
fig, axs = subplots(2, 3, figsize=(12, 4))
def plotter(ax, grp, grp_name, method, **kwargs):
getattr(ax, method)(grp.num_nbrs.values, **kwargs)
ax.set_title('ID: %i' % grp_name)
gb = d.groupby('id')
for row, method in zip((0, 1), ('plot', 'hist')):
for ax, (grp_name, grp) in zip(axs[row].flat, gb):
plotter(ax, grp, grp_name, method)
What I've done is created 2 plots for each of 3 IDs. The top row shows the number of neighbors as a function of time for each ID. The bottom row shows the distribution of the number of neighbors across time.
You'll probably want to play around with sharing axes, axes labelling and all the other fun things that matplotlib offers.

Related

How to get create a histogram over time?

I'm trying to visualize how a distribution changes over time -- each vertical slice should be the distribution at that timestep.
I want it to look something like this (there are two such curves/temporal-histograms here).
The closest I've found is this seaborn time series, but I want the distribution or at least the min, mean, and max -- this band is a confidence interval, which I can't use (it's also prohibitively slow).
https://seaborn.pydata.org/examples/errorband_lineplots.html
Update:
Here's a snippet to product some sample data.
import numpy as np
num_timesteps = 20
samples_per_timestep = 100
timesteps = np.arange(num_timesteps)
def get_std(t):
return t if t < num_timesteps//2 else abs(num_timesteps - t)
samples = np.stack([np.random.normal(t, get_std(t), samples_per_timestep) for t in timesteps])
samples[t] is a sample of the ditribution a timestep t. The distribution starts as constant (std = 0), widens, then narrows again.
Update: So I asked about this in the Seaborn repository, and this can be way simpler using Seaborn.
Here's an example:
import seaborn as sns
import seaborn.objects as so
fmri = sns.load_dataset("fmri").query("region == 'parietal'")
p = so.Plot(fmri, "timepoint", "signal")
for tail in [25, 10, 5, 1]:
p = p.add(so.Band(), so.Perc([tail, 100 - tail]))
p.add(so.Line(), so.Agg("median"))
Which will result in this plot:
You can read more about it in Statistical estimation and error bars.
This is a lot less work and better scalable. Hope it helps!
I had the exact same problem, and took quite a few detours, but this is most definitely possible!
Imports
We need to import matplotlib, NumPy and Pandas.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Input data
I assume you have the data as a Pandas Series, with the time/step on the index and the values as values.
Step
Wealth
0
0
0
0
0
0
0
0
1
7.89338
1
7.50838
1
2.00948
1
8.74963
I load my data from a pickle (in the format specified above):
wealth = pd.read_pickle("wealth.pickle")
You can download a ZIP with this pickle file here: wealth.zip.
Aggregate the data to percentiles
This part is a bit ugly. We first define a partial NumPy function for each percentile we want to calculate:
# Define functions to calculate percentiles
def q1(x):
return x.quantile(0.01)
def q5(x):
return x.quantile(0.05)
def q25(x):
return x.quantile(0.25)
def q50(x):
return x.quantile(0.50)
def q75(x):
return x.quantile(0.75)
def q95(x):
return x.quantile(0.95)
def q99(x):
return x.quantile(0.99)
If anyone reading this has a better/cleaner way to aggerate this, please let me know!
Note we use numpy.quantile, because (in my case) it works better with the data in the series, but numpy.percentile should be equivalent.
The data now needs to be grouped by the index (level=0), and then aggerated using the functions defined above:
w_agg = wealth.groupby(level=0).agg([q1, q5, q25, q50, q75, q95, q99])
And now w_agg looks like this:
Step
q1
q5
q25
q50
q75
q95
q99
0
0
0
0
0
0
0
0
1
-3.2311
0.759751
3.2881
6.03641
8.43206
11.9663
15.4515
2
-3.22888
-1.15079
3.13756
6.41804
8.43206
12.7269
15.4515
3
-5.31614
-1.91156
3.22287
6.54126
8.77544
14.644
15.5798
4
-5.64095
-2.52143
2.65959
6.22455
9.40699
14.6545
15.9647
Plotting
Now we can start with the plotting. Aside from the regulars, we're using matplotlib.filbetween for this.
We create a figure, and then start with the widest percentile range: From 1 to 99. Then we draw 5 to 95 on top of that, then 25 to 75 on top of that and finally the median as a line.
Play a bit with the alpha and color values to make it look nice!
# Create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Add the bands to the axes
ax.fill_between(x=w_agg.index, y1=w_agg["q1"], y2=w_agg["q99"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q5"], y2=w_agg["q95"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q25"], y2=w_agg["q75"], alpha=0.3, color="tab:blue")
# Plot the median as line
ax.plot(w_agg.index, w_agg["q50"], '-', color="tab:blue")
# Add title, legend and plot
ax.set_title("Distribution of wealth between population over time")
ax.set_xlabel("Time")
ax.set_ylabel("Wealth")
ax.legend([f"{n}% distribution" for n in [99, 90, 50]] + ["Median"], loc="upper left")
fig.tight_layout()
The result
For my dataset, this was the result:

Plotting DataFrame columns as Series sets unexpected arguments

Here is an issue when plotting a dataframe:
The dataframe looks like
i ii n b
0 1 0 3 0
1 4 1 5 0
2 4 0 1 5
3 4 1 2 6
4 6 0 3 0
5 6 1 4 3
(code to create below). I'd like to plot stacked bars for same values of i, and I want bars to belong to groups according to ii. When I select only certain rows of the dataframe, I have issues plotting, forcing me to explicitly convert the dataframe's columns (which are extracted as pandas Series) to lists. (Note that I cannot use pivot, as I have multiple rows for some (i, ii) combinations.)
Why can I not directly pass a Series to matplotlib.pyplot.bar() (code for figure 3)?
Why does using Series affect the width of bars, which cannot be overridden by an explicit argument width?
Is there a way to produce the desired plot in a better way?
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'i':[1,4,4,4,6,6], 'n':[3,5,1,2,3,4]})
df['ii'] = df.index % 2
df2 = df.set_index(['i', 'ii'])
df2["b"] = df2.groupby(level='i')['n'].cumsum() - df2.n
df2.reset_index(inplace=True)
# This produces expected outcome
plt.figure(1)
plt.clf()
ix = df2[df2.ii==0]
plt.bar(x=list(ix.i), height=ix.n, bottom=list(ix.b))
ix = df2[df2.ii==1]
plt.bar(x=list(ix.i), height=ix.n, bottom=list(ix.b))
plt.show()
plt.figure(2)
plt.clf()
ix = df2[df2.ii==0]
plt.bar(x=ix.i, height=ix.n, bottom=list(ix.b))
ix = df2[df2.ii==1]
# The following line will draw a bar with unexpected width of bar
plt.bar(x=ix.i, height=ix.n, bottom=list(ix.b))
plt.show()
plt.figure(3)
plt.clf()
plt.show()
ix = df2[df2.ii==0]
plt.bar(x=ix.i, height=ix.n, bottom=ix.b)
ix = df2[df2.ii==1]
# The following line will fail
plt.bar(x=ix.i, height=ix.n, bottom=ix.b)
# error:
# TypeError: only size-1 arrays can be converted to Python scalars
# apparently matplotlib tries to set line width
Desired output:
Basically you get such kind of error when the function expects a single value instead of that you passed an array. In many cases, we can use np.vectorize to apply a function that accepts a single element to every element in an array. It seems this is not a case here. Why you do not want to pass the list as you did in the first plot?
plt.bar(x=list(ix.i), height=list(ix.n), bottom=list(ix.b))

How can I auto-adjust my scatterplot labels without them being overlapped by other labels in python?

So I have been working on this for a bit, and just wanted to see if someone could look at why I could to auto-adjust my scatter-plot labels. As I was searching for a solution I came across the adjustText library found here https://github.com/Phlya/adjustText and it seems like it should work, but I'm just trying to find an example that plots from a dataframe. As I tried replicating the adjustText examples it throws me an error So this is my current code.
df["category"] = df["category"].astype(int)
df2 = df.sort_values(by=['count'], ascending=False).head()
ax = df.plot.scatter(x="category", y="count")
a = df2['category']
b = df2['count']
texts = []
for xy in zip(a, b):
texts.append(plt.text(xy))
adjust_text(texts, arrowprops=dict(arrowstyle="->", color='r', lw=0.5))
plt.title("Count of {column} in {table}".format(**sql_dict))
But then I got this TypeError: TypeError: text() missing 2 required positional arguments: 'y' and 's' This is what I tried to transform it from to pivot the coordinates, it works but coordinates just overlap.
df["category"] = df["category"].astype(int)
df2 = df.sort_values(by=['count'], ascending=False).head()
ax = df.plot.scatter(x="category", y="count")
a = df2['category']
b = df2['count']
for xy in zip(a, b):
ax.annotate('(%s, %s)' % xy, xy=xy)
As you can see here I'm getting my df constructed from tables in sql and I'll provide you what this specific table should look like here. In this specific table it's length of stay in days compared to how many people stayed that long.
So as a sample of the data may look like. I made a second datframe above so I would label only the highest values on the plot. This is one of my first experiences with graphing visualizations in python so any help would be appreciated.
[![picture of a graph of overlapping items][1]][1]
[los_days count]
3 350
1 4000
15 34
and so forth. Thanks so much. Let me know if you need anything else.
Here is an example of the df
category count
0 2 29603
1 4 33980
2 9 21387
3 11 17661
4 18 10618
5 20 8395
6 27 5293
7 29 4121
After some reverse engineering with an example from adjustText library and my own example, I just had to change my for loop to create the labels and it worked fantastically.
labels = ['{}'.format(i) for i in zip(a, b)]
texts = []
for x, y, text in zip(a, b, labels):
texts.append(ax.text(x, y, text))
adjust_text(texts, force_text=0.05, arrowprops=dict(arrowstyle="-|>",
color='r', alpha=0.5))

How do I label a specific point in a scatter plot with a unique ID?

I am creating an interactive graph for a layout that looks a lot like this:
Each point has a unique ID and is usually part of a group. Each group has their own color so I use multiple scatter plots to create the entire layout. I need the following to occur when I click on a single point:
On mouse click, retrieve the ID of the selected point.
Plug the ID into a black box function that returns a list of nearby* IDs.
Highlight the points of the IDs in the returned list.
*It is possible for some of the IDs to be from different groups/plots.
How do I:
Associate each point with an ID and return the ID when the point is clicked?
Highlight other points in the layout when all I know is their IDs?
Re-position individual points while maintaining their respective groups i.e. swapping positions with points that belong to different groups/plots.
I used pyqtgraph before switching over to matplotlib so I first thought of creating a dictionary of IDs and their point objects. After experimenting with pick_event, it seems to me that the concept of point objects does not exist in matplotlib. From what I've learned so far, each point is represented by an index and only its PathCollection can return information about itself e.g. coordinates. I also learned that color modification of a specific point is done through its PathCollection whereas in pyqtgraph I can do it through a point object e.g. point.setBrush('#000000').
I am still convinced that using a single scatter plot would be the much better option. There is nothing in the question that would contradict that.
You can merge all your data in a single DataFrame, with columns group, id, x, y, color. The part in the code below which says "create some dataset" does create such a DataFrame
group id x y color
0 1 AEBB 0 0 palegreen
1 3 DCEB 1 0 plum
2 0 EBCC 2 0 sandybrown
3 0 BEBE 3 0 sandybrown
4 3 BEBB 4 0 plum
Note that each group has its own color. One can then create a scatter from it, using the colors from the color column.
A pick event is registered as in this previous question and once a point is clicked, which is not already black, the id from the DataFrame corresponding to the selected point is obtained. From the id, other ids are generated via the "blackbox function" and for each id obtained that way the respective index of the point in the dataframe is determined. Because we have single scatter this index is directly the index of the point in the scatter (PathCollection) and we can paint it black.
import numpy as np; np.random.seed(1)
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
### create some dataset
x,y = np.meshgrid(np.arange(20), np.arange(20))
group = np.random.randint(0,4,size=20*20)
l = np.array(np.meshgrid(list("ABCDE"),list("ABCDE"),
list("ABCDE"),list("ABCDE"))).T.reshape(-1,4)
ide = np.random.choice(list(map("".join, l)), size=20*20, replace=False)
df = pd.DataFrame({"id" : ide, "group" : group ,
"x" : x.flatten(), "y" : y.flatten() })
colors = ["sandybrown", "palegreen", "paleturquoise", "plum"]
df["color"] = df["group"]
df["color"].update(df["color"].map(dict(zip(range(4), colors ))))
print df.head()
### plot a single scatter plot from the table above
fig, ax = plt.subplots()
scatter = ax.scatter(df.x,df.y, facecolors=df.color, s=64, picker=4)
def getOtherIDsfromID(ID):
""" blackbox function: create a list of other IDs from one ID """
l = [np.random.permutation(list(ID)) for i in range(5)]
return list(set(map("".join, l)))
def select_point(event):
if event.mouseevent.button == 1:
facecolor = scatter._facecolors[event.ind,:]
if (facecolor == np.array([[0, 0, 0, 1]])).all():
c = df.color.values[event.ind][0]
c = matplotlib.colors.to_rgba(c)
scatter._facecolors[event.ind,:] = c
else:
ID = df.id.values[event.ind][0]
oIDs = getOtherIDsfromID(ID)
# for each ID obtained, make the respective point black.
rows = df.loc[df.id.isin([ID] + oIDs)]
for i, row in rows.iterrows():
scatter._facecolors[i,:] = (0, 0, 0, 1)
tx = "You selected id {}.\n".format(ID)
tx += "Points with other ids {} will be affected as well"
tx = tx.format(oIDs)
print tx
fig.canvas.draw_idle()
fig.canvas.mpl_connect('pick_event', select_point)
plt.show()
In the image below, the point with id DAEE has been clicked on, and other points with ids ['EDEA', 'DEEA', 'EDAE', 'DEAE'] have been chosen by the blackbox function. Not all of those IDs exist, such that two other points with an existing id are colorized as well.

Weighted data problems, mean is fine, but Covar and std look wrong, how do I adjust?

I'm trying to apply a weighted filter on data rather the use raw data before calculating stats, mu, std and covar. But the results clearly need adjusting.
# generate some data and a filter
f_n = 100.
np.random.seed(seed=101);
foo = np.random.rand(f_n,3)
foo = DataFrame(foo).add(1).pct_change()
f_filter = np.arange(f_n,.0,-1)
f_filter = 1.0 / (f_filter**(f_filter/f_n))
# nominalise the filter ... This could be where I'm going wrong?
f_filter = f_filter * (f_n / f_filter.sum())
Now we are ready to look at some results
print foo.mul(f_filter,axis=0).mean()
print foo.mean()
0 0.039147
1 0.039013
2 0.037598
dtype: float64
0 0.035006
1 0.042244
2 0.041956
dtype: float64
Means all look in line, but when we look at covar and std they are significantly different in terms of scale and also direction
print foo.mul(f_filter,axis=0).cov()
print foo.cov()
0 1 2
0 0.124766 -0.038954 0.027256
1 -0.038954 0.204269 0.056185
2 0.027256 0.056185 0.203934
0 1 2
0 0.070063 -0.014926 0.010434
1 -0.014926 0.099249 0.015573
2 0.010434 0.015573 0.087060
print foo.mul(f_filter,axis=0).std()
print foo.std()
0 0.353223
1 0.451961
2 0.451590
dtype: float64
0 0.264694
1 0.315037
2 0.295060
dtype: float64
Any ideas why, how can we adjust the filter or to adjust the covar matrix to make it more comparable?
The problem is your weighting function. (Do you want Gaussian random numbers or uniform r.v.?) See this plot
f_n = 100.
np.random.seed(seed=101);
# ??? you want uniform random variable? or is this just a typo and you want normal random variable?
foo = np.random.rand(f_n,3)
foo = DataFrame(foo)
f_filter = np.arange(f_n,.0,-1)
# here is the problem, uneven weight makes a artificial trend, causing non-stationary. covariance only works for stationary data.
# =============================================
f_filter = 1.0 / (f_filter**(f_filter/f_n))
fig, ax = plt.subplots()
ax.plot(f_filter)
Uneven weight makes a artificial trend (your random numbers are all positive uniforms), causing non-stationary. covariance only works for stationary data. Take a look at the resulting weighted data.

Categories

Resources