Apply log2 transformation to a pandas DataFrame - python

I want to apply log2 with applymap and np2.log2to a data and show it using boxplot, here is the code I have written:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv('testdata.csv')
df = pd.DataFrame(data)
################################
# a.
df.boxplot()
plt.title('Raw Data')
################################
# b.
df.applymap(np.log2)
df.boxplot()
plt.title('Normalized Data')
and below is the boxplot I get for my RAW data which is okay, but I do get the same boxplot after applying log2 transformation !!! can anyone please tell me what I am doing wrong and what should be corrected to get the normalized data with applymap and np.log2

A much faster way to do this would be:
df = np.log2(df)
Don't forget to assign the result back to df.

According to API Reference DataFrame.applymap(func)
Apply a function to a DataFrame that is intended to operate
elementwise, i.e. like doing map(func, series) for each series in the
DataFrame
It won't change the DataFrame you need to get the return value and use it.

Pandas now has the transform() function, which in your case amounts to:
df = df.transform(lambda x: np.log2(x))

Related

Bin and aggregate with `seaborn.objects`

Is there a way to both bin and aggregate (with some other function than count) in seaborn.objects? I'd like to compute the mean per bin and right now I'm using the following:
import seaborn.objects as so
import pandas as pd
import seaborn as sns
df = sns.load_dataset("penguins")
df2 = (
df.groupby(pd.cut(df["bill_length_mm"], bins=30))[["bill_depth_mm"]]
.mean()
)
df2["bill_length_mm"] = [x.mid for x in df2.index]
p = so.Plot(df2, x="bill_length_mm", y="bill_depth_mm").add(so.Bars())
p
There's not yet a binning operation separate from Hist (this could make sense as a Stat or a Scale, I'm not sure).
But note that you can do the aggregation more simply than you are in your example because you can pass a Series directly and don't need to construct a new dataframe:
(
so.Plot(
df,
x="bill_depth_mm",
y=pd.cut(df["bill_length_mm"], bins=30).map(lambda x: x.mid),
)
.add(so.Bars(), so.Agg("mean"))
)
Note that the Series will be aligned with the DataFrame (or other Series passed directly) based on the index information rather than position.

What is the best way to apply several percentage differences to pandas Data Frame?

Let's consider dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([np.random.randn(1000)]).transpose()
I want to apply percentage change transformations and add it to df. I want to apply 1,...,10 percentage changes. My primitive solution is:
df_copy = df.copy()
for i in range(1, 11):
to_add = df_copy.pct_change(i)
df = pd.concat([df, to_add], axis = 1)
However, I'm not sure if its the most efficient way how it can be done. Do you know if there is any option to do it better?

Pandas style based on logarithm of value

I'd like to style a Pandas DataFrame display with a background color that is based on the logarithm (base 10) of a value, rather than the data frame value itself. The numeric display should show the original values (along with specified numeric formatting), rather than the log of the values.
I've seen many solutions involving the apply and applymap methods, but am not really clear on how to use these, especially since I don't want to change the underlying dataframe.
Here is an example of the type of data I have. Using the "gradient" to highlight is not satisfactory, but highlighting based on the log base 10 would be really useful.
import pandas as pd
import numpy as np
E = np.array([1.26528431e-03, 2.03866202e-04, 6.64793821e-05, 1.88018687e-05,
4.80967314e-06, 1.22584958e-06, 3.09260354e-07, 7.76751705e-08])
df = pd.DataFrame(E,columns=['Error'])
df.style.format('{:.2e}'.format).background_gradient(cmap='Blues')
Since pandas 1.3.0, background_gradient now has a gmap (gradient map) argument that allows you to set the values that determine the background colors.
See the examples here (this link is to the dev docs - can be replaced once 1.3.0 is released) https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.io.formats.style.Styler.background_gradient.html#pandas.io.formats.style.Styler.background_gradient
I figured out how to use the apply function to do exactly what I want. And also, I discovered a few more features in Matplotlib's colors module, including LogNorm which normalizes using a log. So in the end, this was relatively easy.
What I learned :
Do not use background_gradient, but rather supply your own function that maps DataFrame values to colors. The argument to the function is the dataframe to be displayed. The return argument should be a dataframe with the same columns, etc, but with values replaced by colors, e.g. strings background-color:#ffaa44.
Pass this function as an argument to apply.
import pandas as
import numpy as np
from matplotlib import colors, cm
import seaborn as sns
def color_log(x):
df = x.copy()
cmap = sns.color_palette("spring",as_cmap=True).reversed()
evals = df['Error'].values
norm = colors.LogNorm(vmin=1e-10,vmax=1)
normed = norm(evals)
cstr = "background-color: {:s}".format
c = [cstr(colors.rgb2hex(x)) for x in cm.get_cmap(cmap)(normed)]
df['Error'] = c
return df
E = np.array([1.26528431e-03, 2.03866202e-04, 6.64793821e-05, 1.88018687e-05,
4.80967314e-06, 1.22584958e-06, 3.09260354e-07, 7.76751705e-08])
df = pd.DataFrame(E,columns=['Error'])
df.style.format('{:.2e}'.format).apply(color_log,axis=None)
Note (1) The second argument to the apply function is an "axis". By supplying axis=None, the entire data frame will be passed to color_log. Passing axis=0 will pass in each column of the data frame as a Series. In this case, the code supplied above will not work. However, this would be useful for dataframes in which each column should be handled separately.
Note (2) If axis=None is used, and the DataFrame has more than one column, the color mapping function passed to apply should set colors for all columns in the DataFrame. For example,
df[:,:] = 'background-color:#eeeeee'
would sets all columns to grey. Then, selective columns could be overwritten with other colors choices.
I would be happy to know if there is yet a simpler way to do this.

Value Error and problem with shape during creation of Data Frame in Python?

I would like to combine coefficient from Liear Regression model with values from test dataset, nevertheless I have error like below, my code is below, do you know where is the problem and what can I do ?
I need something like below, where indexes are from X.columns and numbers are from LR.coef_.
In the following example, values is a dataframe which has the same shape of your LR.coef_. To use its first row as column values in another dataframe, you can create a dict and pass that dict to pandas.DataFrame().
import pandas as pd
import numpy as np
values = pd.DataFrame(np.zeros((1, 689)))
X = pd.DataFrame(np.zeros((2096, 689)))
frame = { 'coefficient': values.iloc[0] }
coefficient = pd.DataFrame(frame, index=X.columns)

ufunc 'add' did not contain a loop with signature matching types dtype('<U23') dtype('<U23') dtype('<U23')

When trying to convert the sklearn dataset into pandas dataframe by the following code I am getting this error "ufunc 'add' did not contain a loop with signature matching types dtype('
import numpy as np
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'])
Here is how I converted the sklearn dataset to a pandas dataframe. The target column name needs to be appended.
bostonData = pd.DataFrame(data= np.c_[boston['data'], boston['target']],
columns= np.append(boston['feature_names'],['target']))
You have numpy array of strings please provide full error therefore we figure out what's missing;
For example I am assuming you got dtype('U9'), please add;
dtype=float into your array. Something like not certain;
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'], dtype=float)
Sometimes it's just easier to keep it simple. Create a DF for both data and target, then merge using pandas.
data_df = pd.DataFrame(data=cancer['data'] ,columns=cancer['feature_names'])
target_df = pd.DataFrame(data=cancer['target'], columns=['target']).reset_index(drop=True)
target_df.rename_axis(None)
df = pd.concat([data_df, target_df], axis=1)

Categories

Resources