Is there a way to both bin and aggregate (with some other function than count) in seaborn.objects? I'd like to compute the mean per bin and right now I'm using the following:
import seaborn.objects as so
import pandas as pd
import seaborn as sns
df = sns.load_dataset("penguins")
df2 = (
df.groupby(pd.cut(df["bill_length_mm"], bins=30))[["bill_depth_mm"]]
.mean()
)
df2["bill_length_mm"] = [x.mid for x in df2.index]
p = so.Plot(df2, x="bill_length_mm", y="bill_depth_mm").add(so.Bars())
p
There's not yet a binning operation separate from Hist (this could make sense as a Stat or a Scale, I'm not sure).
But note that you can do the aggregation more simply than you are in your example because you can pass a Series directly and don't need to construct a new dataframe:
(
so.Plot(
df,
x="bill_depth_mm",
y=pd.cut(df["bill_length_mm"], bins=30).map(lambda x: x.mid),
)
.add(so.Bars(), so.Agg("mean"))
)
Note that the Series will be aligned with the DataFrame (or other Series passed directly) based on the index information rather than position.
Related
I am trying to change the order of variables I use to make a facet grid in xarray. For example, I have [a,b,c,d] as column names. I want to reorder it to [c,d,a,b]. Unfortunately, unlike seaborn, I could not find parameters such as col_order or row_order in xarray plot function (
https://xarray.pydata.org/en/stable/generated/xarray.plot.FacetGrid.html
Update:
To help myself better explain what I need, I took the example below from the user guide of xarray:
In the following example, I need to change the place of months. I mean, for example, I want to put the month 7 as the first column and 2nd month as the 5th and so on and so forth.
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature.nc").rename({"air": "Tair"})
# we will add a gradient field with appropriate attributes
ds["dTdx"] = ds.Tair.differentiate("lon") / 110e3 / np.cos(ds.lat * np.pi / 180)
ds["dTdy"] = ds.Tair.differentiate("lat") / 105e3
ds.dTdx.attrs = {"long_name": "$∂T/∂x$", "units": "°C/m"}
ds.dTdy.attrs = {"long_name": "$∂T/∂y$", "units": "°C/m"}
monthly_means = ds.groupby("time.month").mean()
# xarray's groupby reductions drop attributes. Let's assign them back so we get nice labels.
monthly_means.Tair.attrs = ds.Tair.attrs
fg = monthly_means.Tair.plot(
col="month",
col_wrap=4, # each row has a maximum of 4 columns
)
plt.show()
Any help is highly appreciated.
xarray will respect the shape of your data, so you can rearrange the data prior to plotting:
In [2]: ds = xr.tutorial.open_dataset("air_temperature.nc")
In [3]: ds_mon = ds.groupby("time.month").mean()
In [4]: # order the data by month, descending
...: ds_mon.air.sel(month=list(range(12, 0, -1))).plot(
...: col="month", col_wrap=4,
...: )
Out[4]: <xarray.plot.facetgrid.FacetGrid at 0x16b9a7700>
I'd like to style a Pandas DataFrame display with a background color that is based on the logarithm (base 10) of a value, rather than the data frame value itself. The numeric display should show the original values (along with specified numeric formatting), rather than the log of the values.
I've seen many solutions involving the apply and applymap methods, but am not really clear on how to use these, especially since I don't want to change the underlying dataframe.
Here is an example of the type of data I have. Using the "gradient" to highlight is not satisfactory, but highlighting based on the log base 10 would be really useful.
import pandas as pd
import numpy as np
E = np.array([1.26528431e-03, 2.03866202e-04, 6.64793821e-05, 1.88018687e-05,
4.80967314e-06, 1.22584958e-06, 3.09260354e-07, 7.76751705e-08])
df = pd.DataFrame(E,columns=['Error'])
df.style.format('{:.2e}'.format).background_gradient(cmap='Blues')
Since pandas 1.3.0, background_gradient now has a gmap (gradient map) argument that allows you to set the values that determine the background colors.
See the examples here (this link is to the dev docs - can be replaced once 1.3.0 is released) https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.io.formats.style.Styler.background_gradient.html#pandas.io.formats.style.Styler.background_gradient
I figured out how to use the apply function to do exactly what I want. And also, I discovered a few more features in Matplotlib's colors module, including LogNorm which normalizes using a log. So in the end, this was relatively easy.
What I learned :
Do not use background_gradient, but rather supply your own function that maps DataFrame values to colors. The argument to the function is the dataframe to be displayed. The return argument should be a dataframe with the same columns, etc, but with values replaced by colors, e.g. strings background-color:#ffaa44.
Pass this function as an argument to apply.
import pandas as
import numpy as np
from matplotlib import colors, cm
import seaborn as sns
def color_log(x):
df = x.copy()
cmap = sns.color_palette("spring",as_cmap=True).reversed()
evals = df['Error'].values
norm = colors.LogNorm(vmin=1e-10,vmax=1)
normed = norm(evals)
cstr = "background-color: {:s}".format
c = [cstr(colors.rgb2hex(x)) for x in cm.get_cmap(cmap)(normed)]
df['Error'] = c
return df
E = np.array([1.26528431e-03, 2.03866202e-04, 6.64793821e-05, 1.88018687e-05,
4.80967314e-06, 1.22584958e-06, 3.09260354e-07, 7.76751705e-08])
df = pd.DataFrame(E,columns=['Error'])
df.style.format('{:.2e}'.format).apply(color_log,axis=None)
Note (1) The second argument to the apply function is an "axis". By supplying axis=None, the entire data frame will be passed to color_log. Passing axis=0 will pass in each column of the data frame as a Series. In this case, the code supplied above will not work. However, this would be useful for dataframes in which each column should be handled separately.
Note (2) If axis=None is used, and the DataFrame has more than one column, the color mapping function passed to apply should set colors for all columns in the DataFrame. For example,
df[:,:] = 'background-color:#eeeeee'
would sets all columns to grey. Then, selective columns could be overwritten with other colors choices.
I would be happy to know if there is yet a simpler way to do this.
I have converted a continuous dataset to categorical. I am getting nan values when ever the value of the continuous data is 0.0 after conversion. Below is my code
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins)
category = category.to_frame()
print (category)
How do I convert the values so that I dont get NaN values. I have attached two screenshots for better understanding how the actual data looks and how the convert data looks. This is the main dataset. This is the what it becomes after using bins and pandas.cut(). How can thos "0.00" stays like the other values in the dataset.
When using pd.cut, you can specify the parameter include_lowest = True. This will make the first internal left inclusive (it will include the 0 value as your first interval starts with 0).
So in your case, you can adjust your code to be
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins,include_lowest=True)
category = category.to_frame()
print (category)
Documentation Reference for pd.cut
Both seaborn and pandas provide APIs in order to plot bivariate histograms as a hexbin plot (example plotted below). However, I am searching to execute a query for the points that are located in the same hexbin. Is there a function to retrieve the rows associated with the data points in the hexbin?
The give an example:
My data frame contains 3 rows: A, B and C. I use sns.jointplot(x=A,y=B) to plot the density. Now, I want to execute a query on each data point located in the same bin. For instance, for each bin compute the mean of the C value associated with each point.
Current solution -- Quick Hack
Currently, I have implemented the following function to apply a function to the data associated with a (x,y) coordinate located in the same hexbin:
def hexagonify(x, y, values, func=None):
hexagonized_list = []
fig = plt.figure()
fig.set_visible(False)
if func is not None:
image = plt.hexbin(x=x, y=y, C=values, reduce_C_function=func)
else:
image = plt.hexbin(x=x, y=y, C=values)
values = image.get_array()
verts = image.get_offsets()
for offc in range(verts.shape[0]):
binx, biny = verts[offc][0], verts[offc][1]
val = values[offc]
if val:
hexagonized_list.append((binx, biny, val))
fig.clear()
plt.close(fig)
return hexagonized_list
The values (with the same size as x or y) are passed through the values parameter. The hexbins are computed through the hexbin function of matplotlib. The values are retrieved through the get_array() function of the returned PolyCollection. By default, the np.mean function is applied to the accumalated values per bin. This functionality can be changed by providing a function to the func paramater. Subsequently, the get_offsets() method allows us to calculate the center of the bins (discussed here). In this way, we can associate (by default) mean value of the provided values per hexbin. However, this solution is a hack, so any improvements to this solution are welcome.
From matplotlib
If you have already drawn the plot, you can get Bin Counts from polycollection returned by matplotlib:
polycollection: A PolyCollection instance; use PolyCollection.get_array on this to get the counts in each hexagon.
This functionality is also available in:
matplotlib.pyplot.hist2d;
numpy.histogram2d;
Pure pandas
Here a MCVE using only pandas that can handle the C property:
import numpy as np
import pandas as pd
# Trial Dataset:
N=1000
d = np.array([np.random.randn(N), np.random.randn(N), np.random.rand(N)]).T
df = pd.DataFrame(d, columns=['x', 'y', 'c'])
# Create bins:
df['xb'] = pd.cut(df.x, 3)
df['yb'] = pd.cut(df.y, 3)
# Group by and Aggregate:
p = df.groupby(['xb', 'yb']).agg('mean')['c']
p.unstack()
First we create bins using pandas.cut. Then we group by and aggregate. You can pick the agg function you like to aggregate C (eg. max, median, etc.).
The output is about:
yb (-2.857, -0.936] (-0.936, 0.98] (0.98, 2.895]
xb
(-2.867, -0.76] 0.454424 0.519920 0.507443
(-0.76, 1.34] 0.535930 0.484818 0.513158
(1.34, 3.441] 0.441094 0.493657 0.385987
I want to apply log2 with applymap and np2.log2to a data and show it using boxplot, here is the code I have written:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv('testdata.csv')
df = pd.DataFrame(data)
################################
# a.
df.boxplot()
plt.title('Raw Data')
################################
# b.
df.applymap(np.log2)
df.boxplot()
plt.title('Normalized Data')
and below is the boxplot I get for my RAW data which is okay, but I do get the same boxplot after applying log2 transformation !!! can anyone please tell me what I am doing wrong and what should be corrected to get the normalized data with applymap and np.log2
A much faster way to do this would be:
df = np.log2(df)
Don't forget to assign the result back to df.
According to API Reference DataFrame.applymap(func)
Apply a function to a DataFrame that is intended to operate
elementwise, i.e. like doing map(func, series) for each series in the
DataFrame
It won't change the DataFrame you need to get the return value and use it.
Pandas now has the transform() function, which in your case amounts to:
df = df.transform(lambda x: np.log2(x))