Querying data in pandas where points are grouped by a hexbin function - python

Both seaborn and pandas provide APIs in order to plot bivariate histograms as a hexbin plot (example plotted below). However, I am searching to execute a query for the points that are located in the same hexbin. Is there a function to retrieve the rows associated with the data points in the hexbin?
The give an example:
My data frame contains 3 rows: A, B and C. I use sns.jointplot(x=A,y=B) to plot the density. Now, I want to execute a query on each data point located in the same bin. For instance, for each bin compute the mean of the C value associated with each point.

Current solution -- Quick Hack
Currently, I have implemented the following function to apply a function to the data associated with a (x,y) coordinate located in the same hexbin:
def hexagonify(x, y, values, func=None):
hexagonized_list = []
fig = plt.figure()
fig.set_visible(False)
if func is not None:
image = plt.hexbin(x=x, y=y, C=values, reduce_C_function=func)
else:
image = plt.hexbin(x=x, y=y, C=values)
values = image.get_array()
verts = image.get_offsets()
for offc in range(verts.shape[0]):
binx, biny = verts[offc][0], verts[offc][1]
val = values[offc]
if val:
hexagonized_list.append((binx, biny, val))
fig.clear()
plt.close(fig)
return hexagonized_list
The values (with the same size as x or y) are passed through the values parameter. The hexbins are computed through the hexbin function of matplotlib. The values are retrieved through the get_array() function of the returned PolyCollection. By default, the np.mean function is applied to the accumalated values per bin. This functionality can be changed by providing a function to the func paramater. Subsequently, the get_offsets() method allows us to calculate the center of the bins (discussed here). In this way, we can associate (by default) mean value of the provided values per hexbin. However, this solution is a hack, so any improvements to this solution are welcome.

From matplotlib
If you have already drawn the plot, you can get Bin Counts from polycollection returned by matplotlib:
polycollection: A PolyCollection instance; use PolyCollection.get_array on this to get the counts in each hexagon.
This functionality is also available in:
matplotlib.pyplot.hist2d;
numpy.histogram2d;
Pure pandas
Here a MCVE using only pandas that can handle the C property:
import numpy as np
import pandas as pd
# Trial Dataset:
N=1000
d = np.array([np.random.randn(N), np.random.randn(N), np.random.rand(N)]).T
df = pd.DataFrame(d, columns=['x', 'y', 'c'])
# Create bins:
df['xb'] = pd.cut(df.x, 3)
df['yb'] = pd.cut(df.y, 3)
# Group by and Aggregate:
p = df.groupby(['xb', 'yb']).agg('mean')['c']
p.unstack()
First we create bins using pandas.cut. Then we group by and aggregate. You can pick the agg function you like to aggregate C (eg. max, median, etc.).
The output is about:
yb (-2.857, -0.936] (-0.936, 0.98] (0.98, 2.895]
xb
(-2.867, -0.76] 0.454424 0.519920 0.507443
(-0.76, 1.34] 0.535930 0.484818 0.513158
(1.34, 3.441] 0.441094 0.493657 0.385987

Related

Interpolate: spectra (wavelength, counts) at a given temperature, to create grid of temperature and counts

I have a number of spectra: wavelength/counts at a given temperature. The wavelength range is the same for each spectrum.
I would like to interpolate between the temperature and counts to create a large grid of spectra (temperature and counts (at a given wavelength range).
The code below is my current progress. When I try to get a spectrum for a given temperature I only get one value of counts when I need a range of counts representing the spectrum (I already know the wavelengths).
I think I am confused about arrays and interpolation. What am I doing wrong?
import pandas as pd
import numpy as np
from scipy import interpolate
image_template_one = pd.read_excel("mr_image_one.xlsx")
counts = np.array(image_template_one['counts'])
temp = np.array(image_template_one['temp'])
inter = interpolate.interp1d(temp, counts, kind='linear')
temp_new = np.linspace(30,50,0.5)
counts_new = inter(temp_new)
I am now think that I have two arrays; [wavelength,counts] and [wavelength, temperature]. Is this correct, and, do I need to interpolate between the arrays?
Example data
I think what you want to achieve can be done with interp2d:
from scipy import interpolate
# dummy data
data = pd.DataFrame({
'temp': [30]*6 + [40]*6 + [50]*6,
'wave': 3 * [a for a in range(400,460,10)],
'counts': np.random.uniform(.93,.95,18),
})
# make the interpolator
inter = interpolate.interp2d(data['temp'], data['wave'], data['counts'])
# scipy's interpolators return functions,
# which you need to call with the values you want interpolated.
new_x, new_y = np.linspace(30,50,100), np.linspace(400,450,100)
interpolated_values = inter(new_x, new_y)

Pandas style based on logarithm of value

I'd like to style a Pandas DataFrame display with a background color that is based on the logarithm (base 10) of a value, rather than the data frame value itself. The numeric display should show the original values (along with specified numeric formatting), rather than the log of the values.
I've seen many solutions involving the apply and applymap methods, but am not really clear on how to use these, especially since I don't want to change the underlying dataframe.
Here is an example of the type of data I have. Using the "gradient" to highlight is not satisfactory, but highlighting based on the log base 10 would be really useful.
import pandas as pd
import numpy as np
E = np.array([1.26528431e-03, 2.03866202e-04, 6.64793821e-05, 1.88018687e-05,
4.80967314e-06, 1.22584958e-06, 3.09260354e-07, 7.76751705e-08])
df = pd.DataFrame(E,columns=['Error'])
df.style.format('{:.2e}'.format).background_gradient(cmap='Blues')
Since pandas 1.3.0, background_gradient now has a gmap (gradient map) argument that allows you to set the values that determine the background colors.
See the examples here (this link is to the dev docs - can be replaced once 1.3.0 is released) https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.io.formats.style.Styler.background_gradient.html#pandas.io.formats.style.Styler.background_gradient
I figured out how to use the apply function to do exactly what I want. And also, I discovered a few more features in Matplotlib's colors module, including LogNorm which normalizes using a log. So in the end, this was relatively easy.
What I learned :
Do not use background_gradient, but rather supply your own function that maps DataFrame values to colors. The argument to the function is the dataframe to be displayed. The return argument should be a dataframe with the same columns, etc, but with values replaced by colors, e.g. strings background-color:#ffaa44.
Pass this function as an argument to apply.
import pandas as
import numpy as np
from matplotlib import colors, cm
import seaborn as sns
def color_log(x):
df = x.copy()
cmap = sns.color_palette("spring",as_cmap=True).reversed()
evals = df['Error'].values
norm = colors.LogNorm(vmin=1e-10,vmax=1)
normed = norm(evals)
cstr = "background-color: {:s}".format
c = [cstr(colors.rgb2hex(x)) for x in cm.get_cmap(cmap)(normed)]
df['Error'] = c
return df
E = np.array([1.26528431e-03, 2.03866202e-04, 6.64793821e-05, 1.88018687e-05,
4.80967314e-06, 1.22584958e-06, 3.09260354e-07, 7.76751705e-08])
df = pd.DataFrame(E,columns=['Error'])
df.style.format('{:.2e}'.format).apply(color_log,axis=None)
Note (1) The second argument to the apply function is an "axis". By supplying axis=None, the entire data frame will be passed to color_log. Passing axis=0 will pass in each column of the data frame as a Series. In this case, the code supplied above will not work. However, this would be useful for dataframes in which each column should be handled separately.
Note (2) If axis=None is used, and the DataFrame has more than one column, the color mapping function passed to apply should set colors for all columns in the DataFrame. For example,
df[:,:] = 'background-color:#eeeeee'
would sets all columns to grey. Then, selective columns could be overwritten with other colors choices.
I would be happy to know if there is yet a simpler way to do this.

Using adjustText to avoid label overlap with Python prince correspondence analysis

I have read about how efficient the package adjustText is with respect to avoiding label overlap and I would like to use to the following diagram created by prince:
Here is the code that created the image:
import pandas as pd
import prince
from adjustText import adjust_text
pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))
X=pd.DataFrame(data=[ ... my data ... ],
columns=pd.Series([ ... my data ... ]),
index=pd.Series([ ... my data ...]),
)
ca = prince.CA(n_components=2,n_iter=3,copy=True,check_input=True,engine='auto',random_state=42)
ca = ca.fit(X)
ca.row_coordinates(X)
ca.column_coordinates(X)
ax = ca.plot_coordinates(X=X,ax=None,figsize=(6, 6),x_component=0,y_component=1,show_row_labels=True,show_col_labels=True)
ax.get_figure().savefig('figure.png')
In all examples of adjustText I could find, there was always a direct access to the coordinates of labels. How do I access the coordinates of labels in this case? How can I apply adjust_text to this figure?
First, deactivate label display by plot_coordinates():
ax = ca.plot_coordinates(X=X,ax=None,figsize=(6, 6),x_component=0,y_component=1,show_row_labels=False,show_col_labels=False)
Then, extract coordinates of columns and rows:
COLS=ca.column_coordinates(X).to_dict()
XCOLS=COLS[0]
YCOLS=COLS[1]
ROWS=ca.row_coordinates(X).to_dict()
XROWS=ROWS[0]
YROWS=ROWS[1]
Structures XCOLS, YCOLS, XROWS, YROWS are dictionaries with values that are floats (the coordinates). Let us merge the two x-axis dictionaries in a single x-axis dictionary I will call XGLOBAL, same thing for the y-axis dictionaries, into YGLOBAL:
XGLOBAL={ k : XCOLS.get(k,0)+XROWS.get(k,0) for k in set(XCOLS) | set(XROWS) }
YGLOBAL={ k : YCOLS.get(k,0)+YROWS.get(k,0) for k in set(YCOLS) | set(YROWS) }
Now I just apply adjust_text() as described in the documentation:
fig = ax.get_figure()
texts=[plt.text(XGLOBAL[x],YGLOBAL[x],x,fontsize=7) for x in XGLOBAL.keys()]
adjust_text(texts,arrowprops=dict(arrowstyle='-', color='red'))
fig.savefig('newfigure.png')
And the result is:
Notice that while the image generation was instantaneous without adjust_text, it took around 40 seconds with adjust_text.
You can also put a small angle in texts iteration. I saw on my side that it helps the adjust_text routine.
texts=[plt.text(XGLOBAL[x],YGLOBAL[x],x,fontsize=7,
rotation = -XGLOBAL.keys()+2*x) for x in XGLOBAL.keys()]

How to retrieve all data from seaborn distribution plot with mutliple distributions?

The post Get data points from Seaborn distplot describes how you can get data elements using sns.distplot(x).get_lines()[0].get_data(), sns.distplot(x).patches and [h.get_height() for h in sns.distplot(x).patches]
But how can you do this if you've used multiple layers by plotting the data in a loop, such as:
Snippet 1
for var in list(df):
print(var)
distplot = sns.distplot(df[var])
Plot
Is there a way to retrieve the X and Y values for both linecharts and the bars?
Here's the whole setup for an easy copy&paste:
#%%
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pylab
pylab.rcParams['figure.figsize'] = (8, 4)
import seaborn as sns
from collections import OrderedDict
# Function to build synthetic data
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
# Dataframe with synthetic data
df = sample(rSeed = 123, colNames = ['X1', 'X2'], periodLength = 50)
# sns.distplot with multiple layers
for var in list(df):
myPlot = sns.distplot(df[var])
Here's what I've tried:
Y-values for histogram:
If I run:
barX = [h.get_height() for h in myPlot.patches]
Then I get the following list of lenght 11:
[0.046234272703757885,
0.1387028181112736,
0.346757045278184,
0.25428849987066837,
0.2542884998706682,
0.11558568175939472,
0.11875881712519201,
0.3087729245254993,
0.3087729245254993,
0.28502116110046083,
0.1662623439752689]
And this seems reasonable since there seems to be 6 values for the blue bars and 5 values for the red bars. But how do I tell which values belong to which variable?
Y-values for line:
This seems a bit easier than the histogram part since you can use myPlot.get_lines()[0].get_data() AND myPlot.get_lines()[1].get_data() to get:
Out[678]:
(array([-4.54448949, -4.47612134, -4.40775319, -4.33938504, -4.27101689,
...
3.65968859, 3.72805675, 3.7964249 , 3.86479305, 3.9331612 ,
4.00152935, 4.0698975 , 4.13826565]),
array([0.00042479, 0.00042363, 0.000473 , 0.00057404, 0.00073097,
0.00095075, 0.00124272, 0.00161819, 0.00208994, 0.00267162,
...
0.0033384 , 0.00252219, 0.00188591, 0.00139919, 0.00103544,
0.00077219, 0.00059125, 0.00047871]))
myPlot.get_lines()[1].get_data()
Out[679]:
(array([-3.68337423, -3.6256517 , -3.56792917, -3.51020664, -3.4524841 ,
-3.39476157, -3.33703904, -3.27931651, -3.22159398, -3.16387145,
...
3.24332952, 3.30105205, 3.35877458, 3.41649711, 3.47421965,
3.53194218, 3.58966471, 3.64738724]),
array([0.00035842, 0.00038018, 0.00044152, 0.00054508, 0.00069579,
0.00090076, 0.00116922, 0.00151242, 0.0019436 , 0.00247792,
...
0.00215912, 0.00163627, 0.00123281, 0.00092711, 0.00070127,
0.00054097, 0.00043517, 0.00037599]))
But the whole thing still seems a bit cumbersome. So does anyone know of a more direct approach to perhaps retrieve all data to a dictionary or dataframe?
I was just getting the same need of retrieving data from a seaborn distribution plot, what worked for me was to call the method .findobj() on each iteration's graph. Then, one can notice that the matplotlib.lines.Line2D object has a get_data() method, this is similar as what you've mentioned before for myPlot.get_lines()[1].get_data().
Following your example code
data = []
for idx, var in enumerate(list(df)):
myPlot = sns.distplot(df[var])
# Fine Line2D objects
lines2D = [obj for obj in myPlot.findobj() if str(type(obj)) == "<class 'matplotlib.lines.Line2D'>"]
# Retrieving x, y data
x, y = lines2D[idx].get_data()[0], lines2D[idx].get_data()[1]
# Store as dataframe
data.append(pd.DataFrame({'x':x, 'y':y}))
Notice here that the data for the first sns.distplot plot is stored on the first index of lines2D and the data for the second sns.distplot is stored on the second index. I'm not really sure about why this happens this way, but if you were to consider more than two plots, then you will access each sns.distplot data by calling Lines2D on it's respective index.
Finally, to verify one can plot each distplot
plt.plot(data[0].x, data[0].y)

Matplotlib: Dynamically relabel the x-axis when sampling data

I'm working on plotting some time series data; I have a very long time series (~68k lines). I've got a small helper functions that grabs a small subset by slicing the list, performs some analysis, then plots the results
The resulting plot is zero-indexed because I pass in only the sliced subset. I'd like it to be absolute indexed. How do I dynamically change the labels?
Here's something similar to what I'm looking at
data = load_data() # very long
start, end = get_start_end_from_cmd_line()
subset = data[start, end]
data_to_plot = process_data(subset)
plt.plot(data_to_plot, label="P1")
plt.show()
The problem is that my graph x-axis ranges from 0 to (end - start), not from start to end. I'd like to change the labels dynamically, so they're correct for any size graph I'm looking at, and correct if I pan or zoom
You need to include the range of set of x values that pairs with the calculated set of y values, otherwise matplotlib will default to base zero for the x axis:
something like this, maybe:
import matplotlib.pyplot as plt
def get_start_end_from_cmd_line(): # substitute for your own function
return 5, 9
def load_data(): # substitute for your own function
x, y = (list(range(3, 13)), list(range(10)))
return x, y
xdata, ydata = load_data()
start, end = get_start_end_from_cmd_line()
subsetx, subsety = xdata[start: end], ydata[start: end] # slice both the x and the corresponding y
plt.plot(subsetx, subsety, label="P1")
plt.show()
As the xdata may be only implied in your dataset, you will need to construct a set of x values: under certain circumstances, (original x data starts at zero, are integers, and increment by one being the simplest), you could use this to substitute:
xdata = range([start: end]) # matplotlib will also accept a range object i/o a list so you do not need to do list(range([start: end]))
Check out slice to construct more sophisticated sequences

Categories

Resources