I have the following pandas dataframe df with 2 columns, which looks like:
0 0
1. 22
2. 34
3. 21
4. 21
5. 92
I would like to integrate the area under this curve if we were to plot the first columns as the x-axis and the second column as the y-axis. I have tried doing this using the integrated module from scipy (from scipy import integrate), and applied as follows as I have seen in examples online:
print(df.integrate)
However, it seems the integrate function does not work. I'm receiving the error:
Dataframe object has no attribute integrate
How would I go about this?
Thank you
You want numerical integration given a fixed sample of data. The Scipy package lists a handful of methods to do this: https://docs.scipy.org/doc/scipy/reference/integrate.html#integrating-functions-given-fixed-samples
For your data, the trapezoidal is probably the most straight forward. You provide the y and x values to the function. You did not post the column names of your data frame, so I am using the 0-index for x and the 1-index for y values
from scipy.integrate import trapz
trapz(df.iloc[:, 1], df.iloc[:, 0])
Since integrate is a scipy method not a pandas method, you need to invoke it as follows:
from scipy.integrate import trapz, simps
print(trapz(*args))
https://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html
Try this
import pandas as pd
import numpy as np
def integrate(x, y):
area = np.trapz(y=y, x=x)
return area
df = pd.DataFrame({'x':[0, 1, 2, 3, 4, 4, 5],'y':[0, 1, 3, 3, 5, 6, 7]})
x = df.x.values
y = df.y.values
print(integrate(x, y))
Related
I am new to plotly and need to draw a dendrogram with group average linkage.
I am aware that there is a distfun parameter in create_dendrogram(), but I have no idea what to pass to that argument to get Group Average Linkage. The distfun argument apparently have to be callable. What function should I pass to it?
As a sidenote, I have a sample pairwise distance matrix
0
13 0
2 14 0
17 1 18 0
which, when I passed to the create_dendrogram() method, seems to produce an incorrect result. What am I doing wrong here?
code:
import plotly.figure_factory as ff
import numpy as np
X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])
names = list("0123")
fig = ff.create_dendrogram(X, orientation='left', labels=names)
fig.update_layout(width=800, height=800)
fig.show()
Code literally copied from the plotly website bc idk wth I'm supposed to do.
This website: https://plotly.com/python/v3/dendrogram/
You can choose a linkage method using scipy.cluster.hierarchy.linkage()
via linkagefun argument in create_dendrogram() function.
For example, to use UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm:
import plotly.figure_factory as ff
import scipy.cluster.hierarchy as sch
import numpy as np
X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])
names = "0123"
fig = ff.create_dendrogram(X,
orientation='left',
labels=names,
linkagefun=lambda x: sch.linkage(x, "average"),)
fig.update_layout(width=800, height=800)
fig.show()
Please, note that X has to be a matrix of data samples.
This is a bit old but, for anyone else with similar issues, I think the distfun param simply specifies how you want to convert your data matrix to a condensed distance matrix - you define the function yourself.
For example, after a bit of head banging I cobbled together data_to_dist to convert a data matrix to a Jaccard distance matrix, then condense it. You should be aware that plotly's dendrogram implementation does not check whether your matrix is condensed so your distfun needs to ensure this occurs. Maybe this is wrong, but it looks like distfun should only take one positional param (the data matrix) and return one object (the condensed distance matrix):
import plotly.figure_factory as ff
import numpy as np
from scipy.spatial.distance import jaccard, squareform
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
all_features = set([i for i in feature_list1 if i != filler_val])#filler val can be used to even up ragged lists and ignore certain dtypes ie prots not in a module
all_features.update(set([i for i in feature_list2 if i != filler_val]))#works for both numpy arrays and lists
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
def data_to_dist_matrix(mn_data, filler_val = 0):
#notes:
#the original plotly example uses pdist to find manhatten distance for clustering.
#pdist 'Returns a condensed distance matrix Y' - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist.
#a condensed distance matrix is required for input into scipy linkage for clustering.
#plotly dendrogram function does not do this conversion to the output of a given distfun call - https://github.com/plotly/plotly.py/blob/cfad7862594b35965c0e000813bd7805e8494a5b/packages/python/plotly/plotly/figure_factory/_dendrogram.py#L340
#therefore you should convert distance matrix to condensed form yourself as below with squareform
distance_matrix = np.array([[jaccard_dissimilarity(a,b, filler_val) for b in mn_data] for a in mn_data])
return squareform(distance_matrix)
# toy data to visually check clustering looks sensible
data_array = np.array([[1, 2, 3,0],
[2, 3, 10, 0],
[4, 5, 6, 0],
[5, 6, 7, 0],
[7, 8, 1, 0],
[1,2,8,7],
[1,2,3,8],
[1,2,3,4]])
y_labels = [f'MODULE_{i}' for i in range(8)]
#this is the distance matrix and condensed distance matrix made by data_to_dist_matrix and is only included so I can check what it's doing
dist_matrix = np.array([[jaccard_dissimilarity(a,b, 0) for b in data_array] for a in data_array])
condensed_dist_matrix = data_to_dist_matrix(data_array, 0)
# Create Side Dendrogram
fig = ff.create_dendrogram(data_array,
orientation='right',
labels = y_labels,
distfun = data_to_dist_matrix)
I'm trying to make a graph I've generated a little more useful in comparisons. Is there a way to make something like I've pictured below? Or possibly a way to draw a line across along the y value of the max and min?
I've tried using max() and min() and placing it in a plot like so:
plt.plot(dat[0]['end_date'], max(dat[0]['pct']))
Which throws a value error because the x list has something like 48 entries whereas the y list would only have one.
ValueError: x and y must have same first dimension, but have shapes (48,) and (1,)
Could I use some version of this that somehow fills the remaining 47 spaces with that same max value?
Thank you!
You can use plt.axhline():
plt.axhline(max(dat[0]['pct']))
plt.axhline(min(dat[0]['pct']))
Demonstration using random data:
import matplotlib.pyplot as plt
import numpy as np, pandas as pd
df = pd.DataFrame({'x':[np.random.randint(0,10) for i in range(10)]})
df.plot()
plt.axhline(df.x.max())
plt.axhline(df.x.min())
Result:
I am not sure how I used kmedoids in python. I have installed the pyclustering module from https://pypi.org/project/pyclustering/ yet I'm not sure how i call kmedoids? I am trying to implement PAM on my gower distance matrix.
I'm trying to cluster features from an trade dataset. I used this https://sourceforge.net/projects/gower-distance-4python/files/ to calculate gower distance on my matrix. Then i use this matrix which i've called D to pass through PAM/kmedoids
import pyclustering
import pyclustering.cluster.kmedoids
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
D = gower_distances(trade_data)
pam=pyclustering.kmedoids(D)
AttributeError: module 'pyclustering' has no attribute 'kmedoids'
I get the above error how do I call kmedoids/ use PAM?
You need to correct import and K-Medoids initialization:
from pyclustering.cluster.kmedoids import kmedoids
... ...
pam=kmedoids(D, initial_medoids)
You need to import kmedoids as
from pyclustering.cluster.kmedoids import kmedoids
You can read more about it in pyclustering's documentation here https://codedocs.xyz/annoviko/pyclustering/classpyclustering_1_1cluster_1_1kmedoids_1_1kmedoids.html
This is a very small code example from https://stats.stackexchange.com/questions/94172/how-to-perform-k-medoids-when-having-the-distance-matrix/470141#470141. It starts with an already given distance matrix, you use the gower_distances() then.
from pyclustering.cluster.kmedoids import kmedoids
import numpy as np
dm = np.array(
[[0.,1.91,2.23,3.14,4.25,3.37],
[0.,0.,2.15,1.82,2.41,2.58],
[0.,0.,0.,3.12,3.83,4.64],
[0.,0.,0.,0.,1.9,2.66],
[0.,0.,0.,0.,0.,3.12],
[0.,0.,0.,0.,0.,0.]])
dm = dm + np.transpose(dm)
k = 2
# choose medoid 2 and 4 in your C1 and C2 because min(D) in their cluster
initial_medoids = [1,3]
kmedoids_instance = kmedoids(dm, initial_medoids, data_type = 'distance_matrix')
# Run cluster analysis and obtain results.
kmedoids_instance.process()
clusters = kmedoids_instance.get_clusters()
centers = kmedoids_instance.get_medoids()
print(clusters)
# [[1, 0, 2, 5], [3, 4]]
print(centers)
# [1, 3]
Looking to plot a histogram emanating from a dataframe, I seem to lack in transforming to a right object type that matplotlib can deal with. Here are some failed attempts. How do I fix it up?
And more generally, how do you typically salvage something like that?
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
filter(lambda v: v > 0, df['foo_col']).hist(bins=10)
---> 10 filter(lambda v: v > 0, df['foo_col']).hist(bins=100)
AttributeError: 'filter' object has no attribute 'hist'
hist(filter(lambda v: v > 0, df['foo_col']), bins=100)
---> 10 hist(filter(lambda v: v > 0, df['foo_col']), bins=100)
TypeError: 'Series' object is not callable
By all accounts, filter is lucky to be part of the standard library. IIUC, you just want to filter your dataframe to plot a histogram of values > 0. Pandas has its own syntax for that:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = np.random.randint(-50, 1000, 10000)
df = pd.DataFrame({'some_data': data})
df[df['some_data'] >= 0].hist(bins=100)
plt.show()
Note that this will run much faster than python builtins could ever hope to (it doesn't make much difference in my trivial example, but it will with bigger datasets). It's important to use pandas methods with dataframes wherever possible because, in many cases, the calculation will be vectorized and run in highly optimised C/C++ code.
import seaborn as sns
import matplotlib.pyplot as plt
sns.tsplot(data = df_month, time = 'month', value = 'pm_local');
plt.show()
Using this code I get this blank plot, I presume because of the scale of the y-axis. I don't know why this is, here are the first 5 rows of my dataframe (which consists of 12 rows - 1 row for each month):
How can I fix this?
I think the problem is related to the field unit. The function expects in the case of data passed as DataFrame a unit indicated which subject the data belongs to. This function behavior is not obvious for me, but see this example.
# Test Data
df = pd.DataFrame({'month': [1, 2, 3, 4, 5, 6],
'value': [11.5, 9.7, 12, 8, 4, 12.3]})
# Added a custom unit with a value = 1
sns.tsplot(data=df, value='value', unit=[1]*len(df), time='month')
plt.show()
You can also use extract a Series and plot it.
sns.tsplot(data=df.set_index('month')['value'])
plt.show()
I had this same issue. In my case it was due to incomplete data, such that every time point had at least one missing value, causing the default estimator to return NaN for every time point.
Since you only show the first 5 records of your data we can't tell if your data has the same issue. You can see if the following fix works:
from scipy import stats
sns.tsplot(data = df_month, time = 'month',
value = 'pm_local',
estimator = stats.nanmean);