Matplotlib Scatterplot with 4 visual encondings

Matplotlib Scatterplot with 4 visual encondings - python

The initial step is a pandas Dataframe with several columns.
Th second step I did is to convert some columns of this Dataframe in a Numpy array using to_numpy() function.
I retrieve something like:
[[100 200 3.5 1] [100 200 3.5 1] [100 300 6.2 1] [200 125 4.2 1] [100 300 6.2 1] [100 200 3.5 1]]
Where the first element imagine that is an origin id
the second element is a destiny id
the 3rd is the distance between origin a destiny
and the 4th is just a counter (1 element) (I have included it just because I think that could be required to count elements. Just ignore it if your proposed solution doesn't use it)
I would like to have a scatterplot with the following specifications:
origing_id in x axis
destiny_id in y axis
color of the scatter dot in a warm scale that
indicates distance between both points (3rd element)
size of the
scatter dot depends on the number of pairs of origins_id
/destiny_id we have.for example we have three 100 200
combinations. So its size should be bigger that the one for
combintion 200 125 that only has one entry.
I have tried but I'm not able to include all prerequisites in this plot.
How this could be achieved in matplotlib? Or is there any other easier approach using pandas directly?

If I understood your requirements correctly, this should do the trick:
import matplotlib.pyplot as plt
import numpy as np
data = np.array([[100,200,3.5,1],[100,200,3.5,1],[100,300,6.2,1],[200,125,4.2,1],[100,300,6.2,1],[100,200,3.5,1]])
unique, counts = np.unique(data, axis=0, return_counts=True)
x = unique[:,0]
y = unique[:,1]
c = unique[:,2]
## figure out a nice looking scaling factor here
# and remember that the scatter point size is supposed to be an area,
# hence squaring a base factor is ideal
s = (counts*10)**2
fig, ax = plt.subplots()
sca = ax.scatter(x,y,c=c,s=s)
plt.colorbar(sca)
plt.show()
which yields:

Related

How to get create a histogram over time?

I'm trying to visualize how a distribution changes over time -- each vertical slice should be the distribution at that timestep.
I want it to look something like this (there are two such curves/temporal-histograms here).
The closest I've found is this seaborn time series, but I want the distribution or at least the min, mean, and max -- this band is a confidence interval, which I can't use (it's also prohibitively slow).
https://seaborn.pydata.org/examples/errorband_lineplots.html
Update:
Here's a snippet to product some sample data.
import numpy as np
num_timesteps = 20
samples_per_timestep = 100
timesteps = np.arange(num_timesteps)
def get_std(t):
return t if t < num_timesteps//2 else abs(num_timesteps - t)
samples = np.stack([np.random.normal(t, get_std(t), samples_per_timestep) for t in timesteps])
samples[t] is a sample of the ditribution a timestep t. The distribution starts as constant (std = 0), widens, then narrows again.

Update: So I asked about this in the Seaborn repository, and this can be way simpler using Seaborn.
Here's an example:
import seaborn as sns
import seaborn.objects as so
fmri = sns.load_dataset("fmri").query("region == 'parietal'")
p = so.Plot(fmri, "timepoint", "signal")
for tail in [25, 10, 5, 1]:
p = p.add(so.Band(), so.Perc([tail, 100 - tail]))
p.add(so.Line(), so.Agg("median"))
Which will result in this plot:
You can read more about it in Statistical estimation and error bars.
This is a lot less work and better scalable. Hope it helps!

I had the exact same problem, and took quite a few detours, but this is most definitely possible!
Imports
We need to import matplotlib, NumPy and Pandas.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Input data
I assume you have the data as a Pandas Series, with the time/step on the index and the values as values.
Step
Wealth
0
0
0
0
0
0
0
0
1
7.89338
1
7.50838
1
2.00948
1
8.74963
I load my data from a pickle (in the format specified above):
wealth = pd.read_pickle("wealth.pickle")
You can download a ZIP with this pickle file here: wealth.zip.
Aggregate the data to percentiles
This part is a bit ugly. We first define a partial NumPy function for each percentile we want to calculate:
# Define functions to calculate percentiles
def q1(x):
return x.quantile(0.01)
def q5(x):
return x.quantile(0.05)
def q25(x):
return x.quantile(0.25)
def q50(x):
return x.quantile(0.50)
def q75(x):
return x.quantile(0.75)
def q95(x):
return x.quantile(0.95)
def q99(x):
return x.quantile(0.99)
If anyone reading this has a better/cleaner way to aggerate this, please let me know!
Note we use numpy.quantile, because (in my case) it works better with the data in the series, but numpy.percentile should be equivalent.
The data now needs to be grouped by the index (level=0), and then aggerated using the functions defined above:
w_agg = wealth.groupby(level=0).agg([q1, q5, q25, q50, q75, q95, q99])
And now w_agg looks like this:
Step
q1
q5
q25
q50
q75
q95
q99
0
0
0
0
0
0
0
0
1
-3.2311
0.759751
3.2881
6.03641
8.43206
11.9663
15.4515
2
-3.22888
-1.15079
3.13756
6.41804
8.43206
12.7269
15.4515
3
-5.31614
-1.91156
3.22287
6.54126
8.77544
14.644
15.5798
4
-5.64095
-2.52143
2.65959
6.22455
9.40699
14.6545
15.9647
Plotting
Now we can start with the plotting. Aside from the regulars, we're using matplotlib.filbetween for this.
We create a figure, and then start with the widest percentile range: From 1 to 99. Then we draw 5 to 95 on top of that, then 25 to 75 on top of that and finally the median as a line.
Play a bit with the alpha and color values to make it look nice!
# Create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Add the bands to the axes
ax.fill_between(x=w_agg.index, y1=w_agg["q1"], y2=w_agg["q99"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q5"], y2=w_agg["q95"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q25"], y2=w_agg["q75"], alpha=0.3, color="tab:blue")
# Plot the median as line
ax.plot(w_agg.index, w_agg["q50"], '-', color="tab:blue")
# Add title, legend and plot
ax.set_title("Distribution of wealth between population over time")
ax.set_xlabel("Time")
ax.set_ylabel("Wealth")
ax.legend([f"{n}% distribution" for n in [99, 90, 50]] + ["Median"], loc="upper left")
fig.tight_layout()
The result
For my dataset, this was the result:

How to set a seaborn color map in an arbitrary range?

I am creating a heatmap for the correlations between items.
sns.heatmap(df_corr, fmt=".2g", cmap='vlag', cbar='True', annot = True)
I choose vlag as it has red for high values, blue for low values, and white for the middle.
Seaborn automatically sets red for the highest value and blue for the lowest value in the dataframe.
However, as I am tracking Pearson's correlation, the value range is between -1 and 1 - as so I would like to set 1 to be represented by red, -1 with blue, leaving 0 to be represented by white.
How the results looks like:
How it should be*:
*(Of course this was generated by "cheating" - setting -1 as value(s) to force the range to be from -1 to 1; I want to set this range without warping my data)

it is vmin=-1 and vmax=1:
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
data = np.random.uniform(low=-0.5, high=0.5, size=(5,5))
hm = sn.heatmap(data = data, cmap= 'vlag', annot = True, vmin=-1, vmax=1)
plt.show()

Here is an unorthodox solution. You can "standardize" your data to a range 1 and -1. Even though the theoretical range of Pearson coefficient is [-1, 1]; strong negative correlations are not present in your dataset.
So, you can create another dataframe which contains the data with its max being 1 and min being -1. You can then plot this dataframe to get the desired effect. The advantage this procedure provides is that this technique generalizes to pretty much any dataframe (not verified though).
Here is the code -
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Setting the initial scale of the data
scale_minimum = -1
scale_maximum = 1
scale_range = scale_maximum-scale_minimum
# Applying the scaling
df_minimun, df_maximum = df.min(), df.max() # Getting the range of the current dataframe
df_range = df_maximum - df_minimun # The range of the data
df = (df - df_minimun)/(df_range) # Scaling between 0 and 1
df_scaled = df*(scale_range) + scale_minimum # Scaling between 1 and -1
Hope this solves your problem.

normalization of values in python np array gone wrong?

I have a matrix of floats shaped (3000, 9).
Across 1 line, there is 1 ''simulation''.
Across columns, for a fixed line, there's the contents of the ''simulation''.
I want that for each simulation, the first 8 columns to be normalized to the sum of the 8 first columns.
That is, the first column's entry (for one fixed line) to become what was before, over the sum of the first 8 columns (for that same fixed line).
A trivial task, but I get from a nice, correct, graph (non-normalized), something totally unphysical when plotting with plt.scatter.
The last column of each line is what we are going to use for the x-axis to plot the first 8 columns (the y values).
So one line will represent 8 datapoints for 1 fixed value of x.
The non-normalized graph:
https://ibb.co/Msr8RVB
The normalized graph:
https://ibb.co/tJp7bZn
The datasets:
non-normalized: https://easyupload.io/oat9kq
My code:
import numpy as np
from matplotlib import pyplot as plt
non_norm = np.loadtxt("integration_results_3000samples_10_20_10_25_Wcm2_BenSimulationFromSlack.txt")
plt.figure()
for i in range(non_norm.shape[1]-1):
plt.scatter(non_norm[:, -1], non_norm[:, i], label="c_{}".format(i+47))
plt.xscale("log")
plt.savefig("non-norm_Ben3000samples.pdf", bbox_inches='tight')
norm = np.empty( (non_norm.shape[0], non_norm.shape[1]) )
norm[:, -1] = non_norm[:, -1]
for i in range(norm.shape[1]-1):
for j in range(norm.shape[0]):
norm[j, i] = np.true_divide(non_norm[j, i] , np.sum(non_norm[j, :-1]))
plt.figure()
for i in range(norm.shape[1]-1):
plt.scatter(norm[:, -1], norm[:, i], label="c_{}".format(i+47))
plt.xscale("log")
plt.savefig("norm_Ben3000samples.pdf", bbox_inches='tight')
Do you see what went wrong?
Thank you

When you're normalising a row that has just one value and 7 zeroes, the value becomes 1 and the rest of the row is 0? This is likely why your plot is messing up.
For example, the plot for the first column looks like this before and after normalization:

Function to plot a data source (probably not regularized) in format x,y,value using python matplotlib

I have a pandas data frame that looks like:
X,Y,VAL
3,1,1221.231
3,3,121.2
3,4,4354.2
3,...,...
3,1200,12.1
...
5,1,756.3
5,2,12.01
5,...,...
...,...,...
123,110,23.1
123,1119,65.9
Note that x,y values are in the first and second column, different from what pyplot imshow expects (a multi dimensional array).
We have in the first n lines all Y values to the first X coordinate an the after that cames the second X value and all Y values related to that line, the thing goes on until it encounters the latest line. The dots here represents that data continues in more rows.
The values on third column it goes the measured quantity over the map.
I have tried to iterate over or build a mesh using the method "unique" from pandas library to built the coordinates, but the non regularity of source turned thing complicated
IS there any functions able to process this in something "graphable" in imshow or convert it to another kind of table/matrix?
Using one of the proposed solutions isn't viable because the coloring can't be interpolated. My source mesh isn't so sparse, but unfortunately I can't garantee that is regular. (but it can be in some cases).
But let's suppose that I have some data like
x y value
64 4 2743
64 8 3531
64 16 4543
64 32 5222
64 64 5730
128 4 2778
128 8 3500
128 16 4657
Is there any function able to convert this table to imshow compatible? (line/column based on x/y values) or I need to iterate over it ?
After try the proposed solution, I came to a optimization issue, also Iwas not able to create the v_mesh on attempt to use pcolormesh. It can be seen here Optimizing non regularized data reading to image as another question.

But let's suppose that I have some data like
Ok, second dataset could be easily fit into image array. Basically, if you use log2(x) and log2(y) instead of x and y, it fits easily into image. So fill image with data and use log scales.
Something along the line (untested code!)
x = ...
y = ...
v = ...
xmin = x[0]
xmax = x[-1]
ymin = y[0]
ymax = y[-1]
nx = len(x.unique())
ny = len(y.unique())
img = np.zero((nx, ny))
for ix in range(0, nx):
for iy in range(0, ny):
v_idx = ix*ny + iy
img[ix, iy] = v[v_idx]
plt.imshow(img, extent=[xmin, xmax, ymin, ymax], cmap=plt.cm.Reds, interpolation='none')
plt.xscale('log')
plt.yscale('log')
plt.show()
UPDATE
BTW, is using imshow() a hard requirement? You might want to take a look at pcolormesh()
UPDATE II
When to use imshow over pcolormesh?

from __future__ import division
from mpl_toolkits.mplot3d.axes3d import Axes3D
from pylab import *
buf = '''64 4 2743
64 8 3531
64 16 4543
64 32 5222
64 64 5730
128 4 2778
128 8 3500
128 16 4657'''
data = np.array([map(int, l.split('\t')) for l in buf.splitlines()])
x = data[:,0]
y = data[:,1]
z = data[:,2]
clf()
ax = subplot(projection='3d')
c = (z/z.max()*256)
# ex1
ax.scatter(x,y,z,c=cm.jet(c), marker='s')
ax.view_init(0,0)
draw()
Axes3D takes (x,y,z) input so you can get away with no interpolation and no iteration. Is this overkill? Actually just normal 2d scatter with color should also be enough... If you use square marker 's' and do axis('off'), it becomes more similar to imshow.

Is there a way for iPython to generate these kinds of charts given a dataframe?

This picture
Please ignore the background image. The foreground chart is what I am interested in showing using pandas or numpy or scipy (or anything in iPython).
I have a dataframe where each row represents temperatures for a single day.
This is an example of some rows:
100 200 300 400 500 600 ...... 2300
10/3/2013 53*C 57*C 48*C 49*C 54*C 54*C 55*C
10/4/2013 45*C 47*C 48*C 49*C 50*C 52*C 57*C
Is there a way to get a chart that represents the changes from hour to hour using the first column as a 'zero'

Something quick and dirty that might get you most of the way there, assuming your data frame is named df:
import matplotlib.pyplot as plt
plt.imshow(df.T.diff().fillna(0.0).T.drop(0, axis=1).values)
Since I can't easily construct a sample version with your exact column labels, there might be slight additional tinkering with getting rid of any index columns that are included in the diff and moved with the transposition. But this worked to make a simple heat-map-ish plot for me on a random data example.
Then you can create a matplotlib figure or axis object and specify whatever you want for the x- and y-axis labels.

You could just plot lines one at a time for each row with an offset:
nrows, ncols = 12, 30
# make up some fake data:
d = np.random.rand(nrows, ncols)
d *= np.sin(2*np.pi*np.arange(ncols)*4/ncols)
d *= np.exp(-0.5*(np.arange(nrows)-nrows/2)**2/(nrows/4)**2)[:,None]
#this is all you need, if you already have the data:
for i, r in enumerate(d):
plt.fill_between(np.arange(ncols), r+(nrows-i)/2., lw=2, facecolor='white')
You could do it all at once if you don't need the fill color to block the previous line:
d += np.arange(nrows)[:, None]
plt.plot(d.T)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.