Plot CDF from Pandas series with index as x-axis - python

I have a Pandas series that has an index and the values are the counts for each value of the index. I want to plot a CDF (preferably just the line, not the full histogram) where the x-axis represents the index.
For example, if my series is s, I have s.index as the array of values that should be represented on the x-axis and s.values are the counts. I have tried just doing s.plot(cumulative = True,...)but that puts the values on the x-axis, not the index.
Example: s.index yields an array of values from 0 to 1, with 0.01 increments (0.00, 0.01, 0.02, ... 1.00). s.values yields an array of the counts, for example (4372, 1340, 205,...), where each one corresponds to the index (0.01 has a count of 1340). I would like the x-axis to be the 0.00, 0.01,... and the y-axis goes from 0 to 1 as the cumulative distribution based on the counts.

Using seaborn package, you can achieve that:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
x = np.arange(0,.1,0.01)
df = pd.DataFrame({'value':[1340,1200,1300,1150,1421,1175,1232,1432,1123,1231]},index=x)
df
value
0.00 1340
0.01 1200
0.02 1300
0.03 1150
0.04 1421
0.05 1175
0.06 1232
0.07 1432
0.08 1123
0.09 1231
sns.distplot(df.index, rug=True, hist=False)
plt.show()

Related

KMeans cluster with multiple columns using grouped dataframe?

I am currently working on k-means clustering of multiple groups using groupby.
The data I'm working on looks like this
date
permno
mom1m
mom2m
...
mom48m
2004-01-31
80000
0.515
-0.32
...
0.773
2004-02-29
80000
0.415
-0.043
...
0.64
2004-03-31
80000
0.314
0.045
...
0.43
2004-01-30
80001
0.643
-0.234
...
0.34
2004-02-29
80001
0.646
-0.456
...
0.646
2004-03-31
80001
0.876
-0.044
...
0.321
2004-01-31
80002
0.453
0.045
...
0.324
I will be grouping the dataframe based on the dates and I want to perform k-means clustering starting from the columns mom2m to mom48m.
I would want to have a separate column that shows the labels as well.
What I have done until now is to make a function that performs the k-means clustering and use transform.
def cluster(X, n_clusters):
features = X[features_to_KMeans]
k_means = KMeans(n_clusters=n_clusters)
y = kmeans.fit_predict(features)
return y
crsp['cluster_id'] = crsp.groupby("date").transform(cluster, n_clusters=50)
For 'scikit-learn', the data needs to be converted into a numpy array. Also keep in mind that if you have a one-dimensional array, you also need to convert it to a two-dimensional one. For example, if you used only one column, then you need to do the following:
np.array(crsp.loc[:, 'mom2m'].reshape(-1, 1)
I do not know if it is necessary to apply grouping here, in my opinion it is not needed.
At the end, the library 'mglearn' is used to draw the result. The triangles show the center of each cluster.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import mglearn
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters)
y = k_means.fit_predict(X)
return y, k_means
arr = np.array(crsp.loc[:, ['mom2m', 'mom48m']])
aaa = cluster(arr, 3)
aaa_result = aaa[1]
mglearn.discrete_scatter(arr[:, 0], arr[:, 1] , aaa[1].labels_, markers='o')
mglearn.discrete_scatter(aaa[1].cluster_centers_[:, 0],
aaa[1].cluster_centers_[:, 1], [0, 1, 2], markers='^', markeredgewidth=2)
plt.show()

How to plot a scatter plot with values against a category and colored by a different category

I have a Python Pandas dataframe in the following format:
gender
disease1
disease2
male
0.82
0.76
female
0.75
0.93
......
....
....
I'm looking to plot this in Python (matplotlib, or plotly express, etc.) so that it looks like something this:
How can I restructure my dataframe and/or use a python visualisation library to achieve this result?
You can create a scatterplot in Plotly where disease1 is located at x=0 and disease2 is located at x=1... and so on for more diseases, then rename the tickmarks, and set the color and offset of the marker depending on the gender.
The most dynamic way to make this plot is to add the data as you slice the DataFrame by disease and gender (I added some more points to your DataFrame to demonstrate that you can keep your DataFrame in the same format and achieve the desired plot):
import pandas as pd
import plotly.graph_objects as go
df = pd.DataFrame({'gender':['male','female','male','female'],'disease1':[0.82,0.75,0.60,0.24],'disease2':[0.76,0.93,0.51,0.44]})
fig = go.Figure()
offset = {'male': -0.1, 'female': 0.1}
marker_color_dict = {'male': 'teal', 'female':'pink'}
## set yaxis range
values = df[['disease1','disease2']].values.reshape(-1)
padding = 0.1
fig.update_yaxes(range=[min(values) - padding, 1.0])
for gender in ['male','female']:
for i, disease in enumerate(['disease1','disease2']):
## ensure that
if gender == 'male' and i == 0:
showlegend=True
elif gender == 'female' and i == 0:
showlegend=True
else:
showlegend=False
fig.add_trace(go.Scatter(
x=[i + offset[gender]]*len(df.loc[df['gender'] == gender, 'disease1'].values),
y=df.loc[df['gender'] == gender, disease].values,
mode='markers',
marker=dict(color=marker_color_dict[gender], size=20),
legendgroup=gender,
name=gender,
showlegend=showlegend
))
fig.update_layout(
xaxis = dict(
tickmode = 'array',
tickvals = [0.0,1.0],
ticktext = ['disease1','disease2']
)
)
fig.show()
The easiest option is to use seaborn.catplot with kind='swarm' or kind='strip'.
seaborn is a high-level API for matplotlib
seaborn: Plotting with categorical data
'swarm' draws a categorical scatterplot with non-overlapping points, but if there are many points, consider using 'strip'.
Reshape the dataframe from a wide to long format with pandas.DataFrame.melt, and then plot.
Incidentally, this is just two lines of code, (1) melt, and (2) plot
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2
import pandas as pd
import numpy as np # only for sample data
import seaborn as sns
np.random.seed(365)
rows = 200
data = {'Gender': np.random.choice(['Male', 'Female'], size=rows),
'Cancer': np.random.rand(rows).round(2),
'Covid-19': np.random.rand(rows).round(2)}
df = pd.DataFrame(data)
# display(df.head())
Gender Cancer Covid-19
0 Male 0.82 0.88
1 Male 0.02 0.95
2 Female 0.28 0.92
3 Female 0.55 0.28
4 Male 0.15 0.46
# convert to long form
data = df.melt(id_vars='Gender', var_name='Disease')
# display(data.head())
Gender Disease value
0 Male Cancer 0.82
1 Male Cancer 0.02
2 Female Cancer 0.28
3 Female Cancer 0.55
4 Male Cancer 0.15
# plot
sns.catplot(data=data, x='Disease', y='value', hue='Gender', kind='swarm', palette=['blue', 'pink'], s=4)

average of binned values

I have 2 separate dataframes and want to do correlation between them
Time temperature | Time ratio
0 32 | 0 0.02
1 35 | 1 0.1
2 30 | 2 0.25
3 31 | 3 0.17
4 34 | 4 0.22
5 34 | 5 0.07
I want to bin my data every 0.05 (from ratio), with time as index and do an average in each bin on all the temperature values that correspond to that bin.
I will therefore obtain one averaged value for each 0.05 point
anyone could help out please? Thanks!
****edit on how data look like**** (df1 on the left, df2 on the right)
Time device-1 device-2... | Time device-1 device-2...
0 32 34 | 0 0.02 0.01
1 35 31 | 1 0.1 0.23
2 30 30 | 2 0.25 0.15
3 31 32 | 3 0.17 0.21
4 34 35 | 4 0.22 0.13
5 34 31 | 5 0.07 0.06
This could work with the pandas library:
import pandas as pd
import numpy as np
temp = [32,35,30,31,34,34]
ratio = [0.02,0.1,0.25,0.17,0.22,0.07]
times = range(6)
# Create your dataframe
df = pd.DataFrame({'Time': times, 'Temperature': temp, 'Ratio': ratio})
# Bins
bins = pd.cut(df.Ratio,np.arange(0,0.25,0.05))
# get the mean temperature of each group and the list of each time
df.groupby(bins).agg({"Temperature": "mean", "Time": list})
Output:
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.1, 0.15] NaN []
(0.15, 0.2] 31.0 [3]
You can discard the empty bins with .dropna() like this:
df.groupby(bins).agg({"Temperature": "mean", "Time": list}).dropna()
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.15, 0.2] 31.0 [3]
EDIT: In the case of multiple machines, here is a solution:
import pandas as pd
import numpy as np
n_machines = 3
# Generate random data for temperature and ratios
temperature_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.randint(30,40,10))
for i in range(n_machines)} )
ratio_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.uniform(0.01,0.5,10))
for i in range(n_machines)} )
# If ratio is between 0 and 1, we get the bins spaced by .05
def get_bins(s):
return pd.cut(s,np.arange(0,1,0.05))
# Get bin assignments for each machine
bins = ratio_df.apply(get_bins,axis=1)
# Get the mean of each group for each machine
df = temperature_df.apply(lambda x: x.groupby(bins[x.name]).agg("mean"))
Then if you want to display the result, you could use the seaborn package:
import matplotlib.pyplot as plt
import seaborn as sns
df_reshaped = df.reset_index().melt(id_vars='index')
df_reshaped.columns = [ 'Ratio bin','Machine','Mean temperature' ]
sns.barplot(data=df_reshaped,x="Ratio bin",y="Mean temperature",hue="Machine")
plt.show()

data visualisation - which kind of chart that comfortable with time and multiple data columns?

I have an excel file which consists of the 4 spreadsheets for representing the period of time. each spreadsheet has 3 columns data which are 'subject', 'measure', and 'frequency' (the data is considering the student's interested rate in every 10 years)
E.G, sheet 1970-1980
frequency score
math 3.4 1
english 2.5 0.95
art 0.4 0.8
sheet 1981-1990
frequency score
math 4.7 0.5
english 2.3 0.48
art -0.4 0.13
sheet 1991-2000
frequency score
math 4.2 0.6
english 2.1 0.77
art -0.2 0.24
sheet 2000-2010
frequency score
math 4.5 0.55
english 1.9 0.66
art -0.23 0.19
I have created the scatter plot for each period of time, but I would like to see the movement of the data over the period of time. for example, an x-axis represents the time period and a y-axis represents the frequency and score.
are any suggestions?
First of all, I will reproduce the tables that you have here as pandas Dataframes and for three decades:
data_80s = {'math':[ 3.4, 1], 'english':[2.5, 0.95],'art':[0.4, 0.8]}
df_80s = pd.DataFrame.from_dict(data_80s, orient = 'index', columns=['frequency',
'score'])
df_80s['decade'] = pd.to_datetime(1990, format='%Y')
df_80s['index'] = df_80s.index
data_90s = {'math':[ 4.7, 0.5], 'english':[2.3, 0.48],'art':[-0.4, 0.13]}
df_90s = pd.DataFrame.from_dict(data_90s, orient = 'index', columns=['frequency',
'score'])
df_90s['decade'] = pd.to_datetime(1990, format='%Y')
df_90s['index'] = df_90s.index
data_20s = {'math':[ 4.2, 0.6], 'english':[2.1, 0.77],'art':[-0.2, 0.24]}
df_20s = pd.DataFrame.from_dict(data_20s, orient = 'index', columns=['frequency',
'score'])
df_20s['decade'] = pd.to_datetime(2000, format='%Y')
df_20s['index'] = df_20s.index
Probably you will just need to convert your excell sheet to pandas Dataframes. Just don't forget to add the extra column index and decade.
Then you can merge the dataframes:
frames = [df_90s, df_20s]
result = df_80s.append(frames)
And finally plot whatever you want:
f, (ax1, ax2) = plt.subplots(2, figsize=(15,10))
sns.lineplot(x='decade', y='score', hue = 'index', data=result, ax=ax1)
sns.lineplot(x='decade', y='frequency', hue = 'index', data=result, ax=ax2)

Plot with pandas: group and mean

My data from my 'combos' data frame looks like this:
pr = [1.0,2.0,3.0,4.0,1.0,2.0,3.0,4.0,1.0,2.0,3.0,4.0,.....1.0,2.0,3.0,4.0]
lmi = [200, 200, 200, 250, 250,.....780, 780, 780, 800, 800, 800]
pred = [0.16, 0.18, 0.25, 0.43, 0.54......., 0.20, 0.34, 0.45, 0.66]
I plot the results like this:
fig,ax = plt.subplots()
for pr in [1.0,2.0,3.0,4.0]:
ax.plot(combos[combos.pr==pr].lmi, combos[combos.pr==pr].pred, label=pr)
ax.set_xlabel('lmi')
ax.set_ylabel('pred')
ax.legend(loc='best')
And I get this plot:
How can I plot means of "pred" for each "lmi" data point when keeping the pairs (lmi, pr) intact?
You can first group your DataFrame by lmi then compute the mean for each group just as your title suggests:
combos.groupby('lmi').pred.mean().plot()
In one line we:
Group the combos DataFrame by the lmi column
Get the pred column for each lmi
Compute the mean across the pred column for each lmi group
Plot the mean for each lmi group
As of your updates to the question it is now clear that you want to calculate the means for each pair (pr, lmi). This can be done by grouping over these columns and then simply calling mean(). With reset_index(), we then restore the DataFrame format to the previous form.
$ combos.groupby(['lmi', 'pr']).mean().reset_index()
lmi pr pred
0 200 1.0 0.16
1 200 2.0 0.18
2 200 3.0 0.25
3 250 1.0 0.54
4 250 4.0 0.43
5 780 2.0 0.20
6 780 3.0 0.34
7 780 4.0 0.45
8 800 1.0 0.66
In this new DataFrame pred does contain the means and you can use the same plotting procedure you have been using before.

Categories

Resources