Pandas style based on logarithm of value - python

I'd like to style a Pandas DataFrame display with a background color that is based on the logarithm (base 10) of a value, rather than the data frame value itself. The numeric display should show the original values (along with specified numeric formatting), rather than the log of the values.
I've seen many solutions involving the apply and applymap methods, but am not really clear on how to use these, especially since I don't want to change the underlying dataframe.
Here is an example of the type of data I have. Using the "gradient" to highlight is not satisfactory, but highlighting based on the log base 10 would be really useful.
import pandas as pd
import numpy as np
E = np.array([1.26528431e-03, 2.03866202e-04, 6.64793821e-05, 1.88018687e-05,
4.80967314e-06, 1.22584958e-06, 3.09260354e-07, 7.76751705e-08])
df = pd.DataFrame(E,columns=['Error'])
df.style.format('{:.2e}'.format).background_gradient(cmap='Blues')

Since pandas 1.3.0, background_gradient now has a gmap (gradient map) argument that allows you to set the values that determine the background colors.
See the examples here (this link is to the dev docs - can be replaced once 1.3.0 is released) https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.io.formats.style.Styler.background_gradient.html#pandas.io.formats.style.Styler.background_gradient

I figured out how to use the apply function to do exactly what I want. And also, I discovered a few more features in Matplotlib's colors module, including LogNorm which normalizes using a log. So in the end, this was relatively easy.
What I learned :
Do not use background_gradient, but rather supply your own function that maps DataFrame values to colors. The argument to the function is the dataframe to be displayed. The return argument should be a dataframe with the same columns, etc, but with values replaced by colors, e.g. strings background-color:#ffaa44.
Pass this function as an argument to apply.
import pandas as
import numpy as np
from matplotlib import colors, cm
import seaborn as sns
def color_log(x):
df = x.copy()
cmap = sns.color_palette("spring",as_cmap=True).reversed()
evals = df['Error'].values
norm = colors.LogNorm(vmin=1e-10,vmax=1)
normed = norm(evals)
cstr = "background-color: {:s}".format
c = [cstr(colors.rgb2hex(x)) for x in cm.get_cmap(cmap)(normed)]
df['Error'] = c
return df
E = np.array([1.26528431e-03, 2.03866202e-04, 6.64793821e-05, 1.88018687e-05,
4.80967314e-06, 1.22584958e-06, 3.09260354e-07, 7.76751705e-08])
df = pd.DataFrame(E,columns=['Error'])
df.style.format('{:.2e}'.format).apply(color_log,axis=None)
Note (1) The second argument to the apply function is an "axis". By supplying axis=None, the entire data frame will be passed to color_log. Passing axis=0 will pass in each column of the data frame as a Series. In this case, the code supplied above will not work. However, this would be useful for dataframes in which each column should be handled separately.
Note (2) If axis=None is used, and the DataFrame has more than one column, the color mapping function passed to apply should set colors for all columns in the DataFrame. For example,
df[:,:] = 'background-color:#eeeeee'
would sets all columns to grey. Then, selective columns could be overwritten with other colors choices.
I would be happy to know if there is yet a simpler way to do this.

Related

Plotnine's scale fill and axis position

I would like to move the x-axis to the top of my plot and manually fill the colors. However, the usual method in ggplot does not work in plotnine. When I provide the position='top' in my scale_x_continuous() I receive the warning: PlotnineWarning: scale_x_continuous could not recognize parameter 'position'. I understand position is not in plotnine's scale_x_continuous, but what is the replacement? Also, scale_fill_manual() results in an Invalid RGBA argument: 'color' error. Specifically, the value requires an array-like object. Thus I provided the array of colors, but still had an issue. How do I manually set the colors for a scale_fill object?
import pandas as pd
from plotnine import *
lst = [[1,1,'a'],[2,2,'a'],[3,3,'a'],[4,4,'b'],[5,5,'b']]
df = pd.DataFrame(lst, columns =['xx', 'yy','lbls'])
fill_clrs = {'a': 'goldenrod1',
'b': 'darkslategray3'}
ggplot()+\
geom_tile(aes(x='xx', y='yy', fill = 'lbls'), df) +\
geom_text(aes(x='xx', y='yy', label='lbls'),df, color='white')+\
scale_x_continuous(expand=(0,0), position = "top")+\
scale_fill_manual(values = np.array(list(fill_clrs.values())))
Plotnine does not support changing the position of any axis.
You can pass a list or a dict of colour values to scale_fill_manual provided they are recognisable colour names. The colours you have are obscure and they are not recognised. To see that it works try 'red' and 'green', see https://matplotlib.org/gallery/color/named_colors.html for all the named colors. Otherwise, you can also use hex colors e.g. #ff00cc.

Querying data in pandas where points are grouped by a hexbin function

Both seaborn and pandas provide APIs in order to plot bivariate histograms as a hexbin plot (example plotted below). However, I am searching to execute a query for the points that are located in the same hexbin. Is there a function to retrieve the rows associated with the data points in the hexbin?
The give an example:
My data frame contains 3 rows: A, B and C. I use sns.jointplot(x=A,y=B) to plot the density. Now, I want to execute a query on each data point located in the same bin. For instance, for each bin compute the mean of the C value associated with each point.
Current solution -- Quick Hack
Currently, I have implemented the following function to apply a function to the data associated with a (x,y) coordinate located in the same hexbin:
def hexagonify(x, y, values, func=None):
hexagonized_list = []
fig = plt.figure()
fig.set_visible(False)
if func is not None:
image = plt.hexbin(x=x, y=y, C=values, reduce_C_function=func)
else:
image = plt.hexbin(x=x, y=y, C=values)
values = image.get_array()
verts = image.get_offsets()
for offc in range(verts.shape[0]):
binx, biny = verts[offc][0], verts[offc][1]
val = values[offc]
if val:
hexagonized_list.append((binx, biny, val))
fig.clear()
plt.close(fig)
return hexagonized_list
The values (with the same size as x or y) are passed through the values parameter. The hexbins are computed through the hexbin function of matplotlib. The values are retrieved through the get_array() function of the returned PolyCollection. By default, the np.mean function is applied to the accumalated values per bin. This functionality can be changed by providing a function to the func paramater. Subsequently, the get_offsets() method allows us to calculate the center of the bins (discussed here). In this way, we can associate (by default) mean value of the provided values per hexbin. However, this solution is a hack, so any improvements to this solution are welcome.
From matplotlib
If you have already drawn the plot, you can get Bin Counts from polycollection returned by matplotlib:
polycollection: A PolyCollection instance; use PolyCollection.get_array on this to get the counts in each hexagon.
This functionality is also available in:
matplotlib.pyplot.hist2d;
numpy.histogram2d;
Pure pandas
Here a MCVE using only pandas that can handle the C property:
import numpy as np
import pandas as pd
# Trial Dataset:
N=1000
d = np.array([np.random.randn(N), np.random.randn(N), np.random.rand(N)]).T
df = pd.DataFrame(d, columns=['x', 'y', 'c'])
# Create bins:
df['xb'] = pd.cut(df.x, 3)
df['yb'] = pd.cut(df.y, 3)
# Group by and Aggregate:
p = df.groupby(['xb', 'yb']).agg('mean')['c']
p.unstack()
First we create bins using pandas.cut. Then we group by and aggregate. You can pick the agg function you like to aggregate C (eg. max, median, etc.).
The output is about:
yb (-2.857, -0.936] (-0.936, 0.98] (0.98, 2.895]
xb
(-2.867, -0.76] 0.454424 0.519920 0.507443
(-0.76, 1.34] 0.535930 0.484818 0.513158
(1.34, 3.441] 0.441094 0.493657 0.385987

Plotting pandas dataframe after doing pandas melt is slow and creates strange y-axis

This could be caused by me not understanding how pandas.melt works but I get strange behaviour when plotting "melted" dataframes using plotnine. Both frames has been converted from wide to long format. One frame with a column containing string values (df_slow) and another with only numerical values (df_fast).
The following code gives different behaviour. Plotting df_slow is slow and gives a strange looking y-axis. Plotting df_fast looks ok and is fast. My guess is that pandas melt is doing something strange with the data which causes this behaviour. See example plots.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as p9
SIZE = 200
value = np.random.rand(SIZE, 1)
# Create test data, one with a column containing strings, one with only numeric values
df_slow = pd.DataFrame({'value': value.flatten(), 'string_column': ['A']*SIZE})
df_fast = pd.DataFrame({'value': value.flatten()})
# Set index
df_slow = df_slow.reset_index()
df_fast = df_fast.reset_index()
# Convert 'df_slow', 'df_fast' to long format
df_slow = pd.melt(df_slow, id_vars='index')
df_fast = pd.melt(df_fast, id_vars='index')
print(df_slow.head())
print(df_fast.head())
df_slow = df_slow[df_slow.variable == 'value']
# This is slow and has many breaks on y-axis
p = (p9.ggplot(df_slow, p9.aes(x='index', y='value')) + p9.geom_point())
p.draw()
# This is much faster and y-axis looks good
p = (p9.ggplot(df_fast, p9.aes(x='index', y='value')) + p9.geom_point())
p.draw()
slow and strange plot
fast and good looking plot
Possible fix
Changing the type of the "value" column in df_slow makes it behave like df_fast when plotting.
# This makes df_slow behave like df_fast when plotting
df_slow['value'] = df_slow.value.astype(np.float64)
Question
Is this a bug in plotnine (or pandas) or am I doing something wrong?
Answer
When pivoting two columns with different data types, in this case string and float, I guess it makes sense that the resulting column containing both strings and floats will have the type object. As
ALollz pointed out this probably makes plotnine interpret the values as strings which causes this behavior.

Apply log2 transformation to a pandas DataFrame

I want to apply log2 with applymap and np2.log2to a data and show it using boxplot, here is the code I have written:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv('testdata.csv')
df = pd.DataFrame(data)
################################
# a.
df.boxplot()
plt.title('Raw Data')
################################
# b.
df.applymap(np.log2)
df.boxplot()
plt.title('Normalized Data')
and below is the boxplot I get for my RAW data which is okay, but I do get the same boxplot after applying log2 transformation !!! can anyone please tell me what I am doing wrong and what should be corrected to get the normalized data with applymap and np.log2
A much faster way to do this would be:
df = np.log2(df)
Don't forget to assign the result back to df.
According to API Reference DataFrame.applymap(func)
Apply a function to a DataFrame that is intended to operate
elementwise, i.e. like doing map(func, series) for each series in the
DataFrame
It won't change the DataFrame you need to get the return value and use it.
Pandas now has the transform() function, which in your case amounts to:
df = df.transform(lambda x: np.log2(x))

Choosing the correct values in excel in Python

General Overview:
I am creating a graph of a large data set, however i have created a sample text document so that it is easier to overcome the problems.
The Data is from an excel document that will be saved as a CSV.
Problem:
I am able to compile the data a it will graph (see below) However how i pull the data will not work for all of the different excel sheet i am going to pull off of.
More Detail of problem:
The Y-Values (Labeled 'Value' and 'Value1') are being pulled for the excel sheet from the numbers 26 and 31 (See picture and Code).
This is a problem because the Values 26 and 31 will not be the same for each graph.
Lets take a look for this to make more sense.
Here is my code
import pandas as pd
import matplotlib.pyplot as plt
pd.read_csv('CSV_GM_NB_Test.csv').T.to_csv('GM_NB_Transpose_Test.csv,header=False)
df = pd.read_csv('GM_NB_Transpose_Test.csv', skiprows = 2)
DID = df['SN']
Value = df['26']
Value1 = df['31']
x= (DID[16:25])
y= (Value[16:25])
y1= (Value1[16:25])
"""
print(x,y)
print(x,y1)
"""
plt.plot(x.astype(int), y.astype(int))
plt.plot(x.astype(int), y1.astype(int))
plt.show()
Output:
Data Set:
Below in the comments you will find the 0bin to my Data Set this is because i do not have enough reputation to post two links.
As you can see from the Data Set
X- DID = Blue
Y-Value = Green
Y-Value1 = Grey
Troublesome Values = Red
The problem again is that the data for the Y-Values are pulled from Row 10&11 from values 26,31 under SN
Let me know if more information is needed.
Thank you
Not sure why you are creating the transposed CSV version. It is also possible to work directly from your original data. For example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('CSV_GM_NB_Test.csv', skiprows=8)
data = df.ix[:,19:].T
data.columns = df['SN']
data.plot()
plt.show()
This would give you:
You can use pandas.DataFrame.ix() to give you a sliced version of your data using integer positions. The [:,19:] says to give you columns 19 onwards. The final .T transposes it. You can then apply the values for the SN column as column headings using .columns to specify the names.

Categories

Resources