Pandas spline interpolation wrong? - python

Pandas (Version 1.3.5) and SciPy (Version 1.7.3) give different result for spline interpolation and from my understanding pandas is wrong:
df = pd.DataFrame(data = {'values': [10, 12, 15, None, None, None, None, 10, 5, 1, None, 0, 1, 3],})
df['interpolated_pandas'] = df['values'].interpolate(method='spline', axis=0, order=3)
df[['interpolated_pandas', 'values']].plot.line();
gives me:
And
idx = ~df['values'].isna()
f = interpolate.interp1d(df[idx].index, df.loc[idx,'values'], kind=3) # kind: an integer specifying the order of the spline interpolator to use
df['interpolated_scipy'] = f(df.index)
df[['interpolated_scipy', 'values']].plot.line();
gives me:
Is there something wrong in my code or is my understanding wrong? Or is this an actual bug in Pandas?

Pandas uses UnivariateSpline which by default uses a "smoothing factor used to choose the number of knots", see pandas code and scipy doc.
To achieve same results, we need add s=0 in the function call:
df['interpolated_pandas'] = df['values'].interpolate(method='spline', axis=0, order=3) # default with smoothing factor
df['interpolated_pandas_s0'] = df['values'].interpolate(method='spline', axis=0, order=3, s=0) # without smoothing factor and same as `interp1d`

Related

Color pandas DataFrame value if larger than 1.5*median(column)

Let's say I have a DataFrame that looks like this:
df= pd.DataFrame({'A': [1,-2,0,-1,17],
'B': [11,-23,1,-3,132],
'C': [121,2029,-243,17,-45]}
)
I use a jupyter notebook and want to colour with df.style the values in each column only if they exceed a value X, where X=1.5*median(column). So, I would like to have something like this:
Preferably, I would like to have some gradient (df.style.background_gradient) to the colouring of the values, e.g. in column A the entry 17 to be darker than 1, because 17 is further away from the median of the column. But the gradient is optional.
How can I do this?
This answer uses pandas 1.4.2, the Styler can function differently depending on version.
The simple case is fairly straightforward. Create a function which accepts a Series as input and then use np.where to conditionally build styles:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [1, -2, 0, -1, 17],
'B': [11, -23, 1, -3, 132],
'C': [121, 2029, -243, 17, -45]
})
def simple_median_style(
s: pd.Series, true_css: str, false_css: str = ''
) -> np.ndarray:
return np.where(s > 1.5 * s.median(), true_css, false_css)
df.style.apply(simple_median_style, true_css='background-color: green')
Simple support for separate styles for values > 1.5 * median and less than by configuring the true_color and false_color values. Naturally, more functionality can be added depending on specific need.
The gradient piece is a bit more involved. We can use get_cmap to get a Colormap from its name. Then we can create a CenteredNorm to form a gradient around a specific value. In this case the median value (1.5 * median) for each column. Using these two together we can create a gradient over the entire column.
Here I've used a simple list comprehension to conditionally apply the gradient or some false style ('' no styles).
from typing import List
import pandas as pd
from matplotlib.cm import get_cmap
from matplotlib.colors import Colormap, CenteredNorm, rgb2hex
df = pd.DataFrame({
'A': [1, -2, 0, -1, 17],
'B': [11, -23, 1, -3, 132],
'C': [121, 2029, -243, 17, -45]
})
def centered_gradient(
s: pd.Series, cmap: Colormap, false_css: str = ''
) -> List[str]:
# Find center point
center = 1.5 * s.median()
# Create normaliser centered on median
norm = CenteredNorm(vcenter=center)
# s = s.where(s > center, center)
return [
# Conditionally apply gradient to values above center only
f'background-color: {rgb2hex(rgba)}' if row > center else false_css
for row, rgba in zip(s, cmap(norm(s)))
]
df.style.apply(centered_gradient, cmap=get_cmap('Greens'))
Note: this approach considers all values when normalising so the gradient will be affected by all values in the column.
In case the more general case is needed, an unconditional gradient centered on the median could be built with (the rest is the same as the complete example above):
def centered_gradient(
s: pd.Series, cmap: Colormap, false_css: str = ''
) -> List[str]:
# Find center point
center = 1.5 * s.median()
# Create normaliser centered on median
norm = CenteredNorm(vcenter=center)
# Convert rgba value arrays to hex
return [
f'background-color: {rgb2hex(rgba)}' for rgba in cmap(norm(s))
]
Whilst #HenryEcker solution is well explained and detailed, a very simple approach would be to directly tackle your problem with style chaining, something like:
styler = df.style
for col in df.columns:
mask = (df[col] > df[col].median() * 1.5)
styler.background_gradient(subset=(mask, col), cmap="Blues", vmin=-100)
styler

Python integration of Pandas dataframe

I have the following pandas dataframe df with 2 columns, which looks like:
0 0
1. 22
2. 34
3. 21
4. 21
5. 92
I would like to integrate the area under this curve if we were to plot the first columns as the x-axis and the second column as the y-axis. I have tried doing this using the integrated module from scipy (from scipy import integrate), and applied as follows as I have seen in examples online:
print(df.integrate)
However, it seems the integrate function does not work. I'm receiving the error:
Dataframe object has no attribute integrate
How would I go about this?
Thank you
You want numerical integration given a fixed sample of data. The Scipy package lists a handful of methods to do this: https://docs.scipy.org/doc/scipy/reference/integrate.html#integrating-functions-given-fixed-samples
For your data, the trapezoidal is probably the most straight forward. You provide the y and x values to the function. You did not post the column names of your data frame, so I am using the 0-index for x and the 1-index for y values
from scipy.integrate import trapz
trapz(df.iloc[:, 1], df.iloc[:, 0])
Since integrate is a scipy method not a pandas method, you need to invoke it as follows:
from scipy.integrate import trapz, simps
print(trapz(*args))
https://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html
Try this
import pandas as pd
import numpy as np
def integrate(x, y):
area = np.trapz(y=y, x=x)
return area
df = pd.DataFrame({'x':[0, 1, 2, 3, 4, 4, 5],'y':[0, 1, 3, 3, 5, 6, 7]})
x = df.x.values
y = df.y.values
print(integrate(x, y))

Dendrogram with plotly - how to set a custom linkage method for hierarchical clustering

I am new to plotly and need to draw a dendrogram with group average linkage.
I am aware that there is a distfun parameter in create_dendrogram(), but I have no idea what to pass to that argument to get Group Average Linkage. The distfun argument apparently have to be callable. What function should I pass to it?
As a sidenote, I have a sample pairwise distance matrix
0
13 0
2 14 0
17 1 18 0
which, when I passed to the create_dendrogram() method, seems to produce an incorrect result. What am I doing wrong here?
code:
import plotly.figure_factory as ff
import numpy as np
X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])
names = list("0123")
fig = ff.create_dendrogram(X, orientation='left', labels=names)
fig.update_layout(width=800, height=800)
fig.show()
Code literally copied from the plotly website bc idk wth I'm supposed to do.
This website: https://plotly.com/python/v3/dendrogram/
You can choose a linkage method using scipy.cluster.hierarchy.linkage()
via linkagefun argument in create_dendrogram() function.
For example, to use UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm:
import plotly.figure_factory as ff
import scipy.cluster.hierarchy as sch
import numpy as np
X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])
names = "0123"
fig = ff.create_dendrogram(X,
orientation='left',
labels=names,
linkagefun=lambda x: sch.linkage(x, "average"),)
fig.update_layout(width=800, height=800)
fig.show()
Please, note that X has to be a matrix of data samples.
This is a bit old but, for anyone else with similar issues, I think the distfun param simply specifies how you want to convert your data matrix to a condensed distance matrix - you define the function yourself.
For example, after a bit of head banging I cobbled together data_to_dist to convert a data matrix to a Jaccard distance matrix, then condense it. You should be aware that plotly's dendrogram implementation does not check whether your matrix is condensed so your distfun needs to ensure this occurs. Maybe this is wrong, but it looks like distfun should only take one positional param (the data matrix) and return one object (the condensed distance matrix):
import plotly.figure_factory as ff
import numpy as np
from scipy.spatial.distance import jaccard, squareform
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
all_features = set([i for i in feature_list1 if i != filler_val])#filler val can be used to even up ragged lists and ignore certain dtypes ie prots not in a module
all_features.update(set([i for i in feature_list2 if i != filler_val]))#works for both numpy arrays and lists
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
def data_to_dist_matrix(mn_data, filler_val = 0):
#notes:
#the original plotly example uses pdist to find manhatten distance for clustering.
#pdist 'Returns a condensed distance matrix Y' - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist.
#a condensed distance matrix is required for input into scipy linkage for clustering.
#plotly dendrogram function does not do this conversion to the output of a given distfun call - https://github.com/plotly/plotly.py/blob/cfad7862594b35965c0e000813bd7805e8494a5b/packages/python/plotly/plotly/figure_factory/_dendrogram.py#L340
#therefore you should convert distance matrix to condensed form yourself as below with squareform
distance_matrix = np.array([[jaccard_dissimilarity(a,b, filler_val) for b in mn_data] for a in mn_data])
return squareform(distance_matrix)
# toy data to visually check clustering looks sensible
data_array = np.array([[1, 2, 3,0],
[2, 3, 10, 0],
[4, 5, 6, 0],
[5, 6, 7, 0],
[7, 8, 1, 0],
[1,2,8,7],
[1,2,3,8],
[1,2,3,4]])
y_labels = [f'MODULE_{i}' for i in range(8)]
#this is the distance matrix and condensed distance matrix made by data_to_dist_matrix and is only included so I can check what it's doing
dist_matrix = np.array([[jaccard_dissimilarity(a,b, 0) for b in data_array] for a in data_array])
condensed_dist_matrix = data_to_dist_matrix(data_array, 0)
# Create Side Dendrogram
fig = ff.create_dendrogram(data_array,
orientation='right',
labels = y_labels,
distfun = data_to_dist_matrix)

How to find the most frequent value in a column using np.histogram()

I have a DataFrame in which one column contains different numerical values. I would like to find the most frequently occurring value specifically using the np.histogram() function.
I know that this task can be achieved using functions such as column.value_counts().nlargest(1), however, I am interested in how the np.histogram() function can be used to achieve this goal.
With this task I am hoping to get a better understanding of the function and the resulting values, as the description from the documentation (https://numpy.org/doc/1.18/reference/generated/numpy.histogram.html) is not so clear to me.
Below I am sharing an example Series of values to be used for this task:
data = pd.Series(np.random.randint(1,10,size=100))
This is one way to do it:
import numpy as np
import pandas as pd
# Make data
np.random.seed(0)
data = pd.Series(np.random.randint(1, 10, size=100))
# Make bins
bins = np.arange(data.min(), data.max() + 2)
# Compute histogram
h, _ = np.histogram(data, bins)
# Find most frequent value
mode = bins[h.argmax()]
# Mode computed with Pandas
mode_pd = data.value_counts().nlargest(1).index[0]
# Check result
print(mode == mode_pd)
# True
You can also define bins as:
bins = np.unique(data)
bins = np.append(bins, bins[-1] + 1)
Or if your data contains only positive numbers you can directly use np.bincount:
mode = np.bincount(data).argmax()
Of course there is also scipy.stats.mode:
import scipy.stats
mode = scipy.stats.mode(data)[0][0]
It can be done with:
hist, bin_edges = np.histogram(data, bins=np.arange(0.5,10.5))
result = np.argmax(hist)
You just need to read documentation more carefully. It says that if bins is [1, 2, 3, 4] then first bin is [1, 2), second is [2, 3) and third is [3, 4).
We calculate which amount of numbers are in bins [0.5, 1.5), [1.5, 2.5), ..., [8.5, 9.5) specifically in your problem and choose index of the maximum one.
Just in case, it's worth to use
np.unique(data)[np.argmax(hist)]
if we are not sure that your sorted data set np.unique(data) includes all the consecutive integers 0, 1, 2, 3, ...

Difference between R.scale() and sklearn.preprocessing.scale()

I am currently moving my data analysis from R to Python. When scaling a dataset in R i would use R.scale(), which in my understanding would do the following: (x-mean(x))/sd(x)
To replace that function I tried to use sklearn.preprocessing.scale(). From my understanding of the description it does the same thing. Nonetheless I ran a little test-file and found out, that both of these methods have different return-values. Obviously the standard deviations are not the same... Is someone able to explain why the standard deviations "deviate" from one another?
MWE:
# import packages
from sklearn import preprocessing
import numpy
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up R namespaces
R = rpy2.robjects.r
np1 = numpy.array([[1.0,2.0],[3.0,1.0]])
print "Numpy-array:"
print np1
print "Scaled numpy array through R.scale()"
print R.scale(np1)
print "-------"
print "Scaled numpy array through preprocessing.scale()"
print preprocessing.scale(np1, axis = 0, with_mean = True, with_std = True)
scaler = preprocessing.StandardScaler()
scaler.fit(np1)
print "Mean of preprocessing.scale():"
print scaler.mean_
print "Std of preprocessing.scale():"
print scaler.std_
Output:
It seems to have to do with how standard deviation is calculated.
>>> import numpy as np
>>> a = np.array([[1, 2],[3, 1]])
>>> np.std(a, axis=0)
array([ 1. , 0.5])
>>> np.std(a, axis=0, ddof=1)
array([ 1.41421356, 0.70710678])
From numpy.std documentation,
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Apparently, R.scale() uses ddof=1, but sklearn.preprocessing.StandardScaler() uses ddof=0.
EDIT: (To explain how to use alternate ddof)
There doesn't seem to be a straightforward way to calculate std with alternate ddof, without accessing the variables of the StandardScaler() object itself.
sc = StandardScaler()
sc.fit(data)
# Now, sc.mean_ and sc.std_ are the mean and standard deviation of the data
# Replace the sc.std_ value using std calculated using numpy
sc.std_ = numpy.std(data, axis=0, ddof=1)
The current answers are good, but sklearn has changed a bit meanwhile. The new syntax that makes sklearn behave exactly like R.scale() now is:
from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()
sc.fit(data)
sc.scale_ = np.std(data, axis=0, ddof=1).to_list()
sc.transform(data)
Feature request:
https://github.com/scikit-learn/scikit-learn/issues/23758
R.scale documentation says:
The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)
However, sklearn.preprocessing.StandardScale always scale with standard deviation.
In my case, I want to replicate R.scale in Python without centered,I followed #Sid advice in a slightly different way:
import numpy as np
def get_scale_1d(v):
# I copy this function from R source code haha
v = v[~np.isnan(v)]
std = np.sqrt(
np.sum(v ** 2) / np.max([1, len(v) - 1])
)
return std
sc = StandardScaler()
sc.fit(data)
sc.std_ = np.apply_along_axis(func1d=get_scale_1d, axis=0, arr=x)
sc.transform(data)

Categories

Resources