So this is my dataset:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'female': [0, 1, 1, 0, 1],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'female', 'preTestScore', 'postTestScore'])
I'm new to plotting data and a bit lost here. I want to plot a line for each person, where the x-ticks are preTestScore and postTestScore and the y-ticks go from 0 to 100 (the possible range of test scores).
I was thinking that I could just make a scatter plot but then I wouldn't know how to connect the dots.
Graph
A slopegrahp was what I was looking for. Thanks #mostlyoxygen
x = df.loc[:, "preTestScore":"postTestScore"]
x["full_name"] = df["first_name"] + " " + df["last_name"]
num_students = x.shape[0]
num_times = x.shape[1] - 1
plt.xlim(0, 1)
plt.ylim(0, 100)
plt.xticks(np.arange(2), ["perTestScore", "postTestScore"])
plt.title("Score changes after Test taking")
plt.ylabel("Testscore")
for row in x.values:
plt.plot([0, 1], [row[0], row[1]], label=row[2])
plt.legend(loc="best")
plt.show()
Related
I'm trying to cluster time series. I also want to use Sklearn OPTICS. In the documentation it says that the input vector X should have dimensions (n_samples,n_features). My array is on the form (n_samples, n_time_stamps, n_features). Example in code further down.
My question is how I can use the Fit-function from OPTICS with a time series. I know that people have used OPTICS and DBSCAN with time series. I just can't figure out how they have implemented it. Any help will be much appreciated.
[[[t00, x0], [t01, x01], ... [t0_n_timestamps, x0_n_timestamps]],
[[t10, x10], [t11, x11], ... [t1_n_timestamps, x1_n_timestamps]],
.
.
.
[[t_n_samples_0, x_n_samples_0], [[t_n_samples_1, x_n_samples_1], ... [t_n_samples_n_timestamps, x_n_samples_n_timestamps]]]
Given the following np.array as an input:
data = np.array([
[["00:00", 7], ["00:01", 37], ["00:02", 3]],
[["00:00", 27], ["00:01", 137], ["00:02", 33]],
[["00:00", 14], ["00:01", 17], ["00:02", 12]],
[["00:00", 15], ["00:01", 123], ["00:02", 11]],
[["00:00", 16], ["00:01", 12], ["00:02", 92]],
[["00:00", 17], ["00:01", 23], ["00:02", 22]],
[["00:00", 18], ["00:01", 23], ["00:02", 112]],
[["00:00", 100], ["00:01", 200], ["00:02", 301]],
[["00:00", 101], ["00:01", 201], ["00:02", 302]],
[["00:00", 102], ["00:01", 203], ["00:02", 303]],
[["00:00", 104], ["00:01", 207], ["00:02", 304]]])
I will proceed as follows:
# save shape info in three separate variables
x, y, z = data.shape
# idea from https://stackoverflow.com/a/36235454/5050691
output_arr = np.column_stack((np.repeat(np.arange(x), y), data.reshape(x * y, -1)))
# create a df out of the arr
df = pd.DataFrame(output_arr)
# rename for understandability
df = df.rename(columns={0: 'index', 1: 'time', 2: 'value'})
# Change the orientation between rows and columns so that rows
# that contain time info become columns
df = df.pivot(index="index", columns="time", values="value")
df.rename_axis(None, axis=1).reset_index()
# get columns that refer to specific interval of time series
temporal_accessors = ["00:00", "00:01", "00:02"]
# extract data that will be used to carry out clustering
data_for_clustering = df[temporal_accessors].to_numpy()
# a set of exemplary params
params = {
"xi": 0.05,
"metric": "euclidean",
"min_samples": 3
}
clusterer = OPTICS(**params)
fitted = clusterer.fit(data_for_clustering)
cluster_labels = fitted.labels_
df["cluster"] = cluster_labels
# Note: density based algortihms have a notion of the "noise-cluster", which is marked with
# -1 by sklearn algorithms. That's why starting index is -1 for density based clustering,
# and 0 otherwise.
For the given data and the presented choice of params, you'll get the following clusters: [0 0 1 0 0 0 0 0 1 1 1]
This seem like it should be easy, but can not seem to get it working.
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'],
'Age':[31, 46, 21, 37, 31, 46, 21, 37],
'Times':[20, 21, 19, 18, 19, 20, 20, 19]}
df = pd.DataFrame(data)
df
# basic boxplot for 'Times'
df['Times'].plot(kind='box')
# Filtered version
filt = df['Name'] == 'Tom'
df.loc[filt, 'Times'].plot(kind='box')
# comparing two columns is easy but I want to compare the same column with different row filters.
df[['Times', 'Age']].plot(kind='box')
So how to I compare these two versions of the same column side by side?
Many thanks
You simply pass a list to plt.boxplot():
box = plt.boxplot([df['Times'], df[df['Name'] == 'Tom']['Times']],
labels=['all','Toms'])
I compared Tom, Others, and All
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'],
'Age':[31, 46, 21, 37, 31, 46, 21, 37],
'Times':[20, 21, 19, 18, 19, 20, 20, 19]}
df = pd.DataFrame(data)
print(df)
df.boxplot(column='Times', by='Age')
grouped=df.groupby(['Name','Times']).any().unstack().reset_index().transpose()
df2=pd.DataFrame(grouped)
new_header = df2.iloc[0]
df2 = df2[1:]
df2.columns = new_header
df2.reset_index(inplace=True)
others=[x for x in df2.columns if x not in(['Tom','Times'])]
all=[x for x in df2.columns if x not in(['Times'])]
df2['Others']=df2[others].any(axis=1)
df2['All']=df2[all].any(axis=1)
print(df2.columns)
print(df2)
df2.boxplot(column='Times',by=['Others'])
df2.boxplot(column='Times',by=['Tom'])
df2.boxplot(column='Times',by=['All'])
plt.show()
A similar approach with the accepted answer, no need to hardcode the names
import pandas as pd
import matplotlib.pyplot as plt
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'],
'Age':[31, 46, 21, 37, 31, 46, 21, 37],
'Times':[20, 21, 19, 18, 19, 20, 20, 19]}
df = pd.DataFrame(data)
df_list = [df["Times"]]
labels_list = ["all"]
# if you dont want all, just set them to empty list
#df_list = []
#labels_list = []
grouped_df = df.groupby("Name")
for name in grouped_df.groups.keys():
labels_list.append(name)
df_list.append(grouped_df.get_group(name)["Times"])
plt.boxplot(df_list, labels = labels_list)
plt.show()
for name in grouped_df.groups.keys():
labels_list.append(name)
df_list.append(grouped_df.get_group(name)["Times"])
plt.boxplot(df_list, labels = labels_list)
plt.show()
here is the result
There are options to have plots side by side, likewise for pandas dataframes. Is there a way to plot a pandas dataframe and a plot side by side?
This is the code I have so far, but the dataframe is distorted.
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import table
# sample data
d = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'jan': [4, 24, 31, 2, 3],
'feb': [25, 94, 57, 62, 70],
'march': [5, 43, 23, 23, 51]}
df = pd.DataFrame(d)
df['total'] = df.iloc[:, 1:].sum(axis=1)
plt.figure(figsize=(16,8))
# plot table
ax1 = plt.subplot(121)
plt.axis('off')
tbl = table(ax1, df, loc='center')
tbl.auto_set_font_size(False)
tbl.set_fontsize(14)
# pie chart
ax2 = plt.subplot(122, aspect='equal')
df.plot(kind='pie', y = 'total', ax=ax2, autopct='%1.1f%%',
startangle=90, shadow=False, labels=df['name'], legend = False, fontsize=14)
plt.show()
It's pretty simple to do with plotly and make_subplots()
define a figure with appropriate specs argument
add_trace() which is tabular data from your data frame
add_trace() which is pie chart from your data frame
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# sample data
d = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'jan': [4, 24, 31, 2, 3],
'feb': [25, 94, 57, 62, 70],
'march': [5, 43, 23, 23, 51]}
df = pd.DataFrame(d)
df['total'] = df.iloc[:, 1:].sum(axis=1)
fig = make_subplots(rows=1, cols=2, specs=[[{"type":"table"},{"type":"pie"}]])
fig = fig.add_trace(go.Table(cells={"values":df.T.values}, header={"values":df.columns}), row=1,col=1)
fig.add_trace(px.pie(df, names="name", values="total").data[0], row=1, col=2)
The plot I am trying to make needs to achieve 3 things.
If a quiz is taken on the same day with the same score, that point needs to be bigger.
If two quiz scores overlap there needs to be some jitter so we can see all points.
Each quiz needs to have its own color
Here is how I am going about it.
import seaborn as sns
import pandas as pd
data = {'Quiz': [1, 1, 2, 1, 2, 1],
'Score': [7.5, 5.0, 10, 10, 10, 10],
'Day': [2, 5, 5, 5, 11, 11],
'Size': [115, 115, 115, 115, 115, 355]}
df = pd.DataFrame.from_dict(data)
sns.lmplot(x = 'Day', y='Score', data = df, fit_reg=False, x_jitter = True, scatter_kws={'s': df.Size})
plt.show()
Setting the hue, which almost does everything I need, results in this.
import seaborn as sns
import pandas as pd
data = {'Quiz': [1, 1, 2, 1, 2, 1],
'Score': [7.5, 5.0, 10, 10, 10, 10],
'Day': [2, 5, 5, 5, 11, 11],
'Size': [115, 115, 115, 115, 115, 355]}
df = pd.DataFrame.from_dict(data)
sns.lmplot(x = 'Day', y='Score', data = df, fit_reg=False, hue = 'Quiz', x_jitter = True, scatter_kws={'s': df.Size})
plt.show()
Is there a way I can have hue while keeping the size of my points?
It doesn't work because when you are using hue, seaborn does two separate scatterplots and therefore the size argument you are passing using scatter_kws= no longer aligns with the content of the dataframe.
You can recreate the same effect by hand however:
x_col = 'Day'
y_col = 'Score'
hue_col = 'Quiz'
size_col = 'Size'
jitter=0.2
fig, ax = plt.subplots()
for q,temp in df.groupby(hue_col):
n = len(temp[x_col])
x = temp[x_col]+np.random.normal(scale=0.2, size=(n,))
ax.scatter(x,temp[y_col],s=temp[size_col], label=q)
ax.set_xlabel(x_col)
ax.set_ylabel(y_col)
ax.legend(title=hue_col)
I tried the following manual approach:
dict = {'id': ['a','b','c','d'], 'testers_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'activated_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'engaged_time' : [70, None, None, None]}
df = pd.DataFrame(dict, columns=['id', 'testers_time', 'stage_1_to_2_time', 'activated_time', 'stage_2_to_3_time', 'engaged_time'])
df= df.dropna(subset=['testers_time']).sort_values('testers_time')
prob = df['testers_time'].value_counts(normalize=True)
print(prob)
#0.333333, 0.333333, 0.333333
plt.plot(df['testers_time'], prob, marker='.', linestyle='-')
plt.show()
And I tried the following approach I found on stackoverflow:
dict = {'id': ['a','b','c','d'], 'testers_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'activated_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'engaged_time' : [70, None, None, None]}
df = pd.DataFrame(dict, columns=['id', 'testers_time', 'stage_1_to_2_time', 'activated_time', 'stage_2_to_3_time', 'engaged_time'])
df= df.dropna(subset=['testers_time']).sort_values('testers_time')
fit = stats.norm.pdf(df['testers_time'], np.mean(df['testers_time']), np.std(df['testers_time']))
print(fit)
#0.02902547, 0.04346777, 0.01829513]
plt.plot(df['testers_time'], fit, marker='.', linestyle='-')
plt.hist(df['testers_time'], normed='true')
plt.show()
As you can see I get completely different values- the probabilities are correct for #1, but for #2 they aren't (nor do they add up to 100%), and the y axis (%) of the histogram is based on 6 bins, not 3.
Can you explain how I can get the right probability for #2?
The first approach gives you a probability mass function. The second gives you a probability density - hence the name probability density function (pdf). Hence both are correct, they just show something different.
If you evaluate the pdf over a larger range (e.g. 10 times the standard deviation), it will look much like an expected gaussian curve.
import pandas as pd
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
dict = {'id': ['a','b','c','d'], 'testers_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'activated_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'engaged_time' : [70, None, None, None]}
df = pd.DataFrame(dict, columns=['id', 'testers_time', 'stage_1_to_2_time', 'activated_time', 'stage_2_to_3_time', 'engaged_time'])
df= df.dropna(subset=['testers_time']).sort_values('testers_time')
mean = np.mean(df['testers_time'])
std = np.std(df['testers_time'])
x = np.linspace(mean - 5*std, mean + 5*std)
fit = stats.norm.pdf(x, mean, std)
print(fit)
plt.plot(x, fit, marker='.', linestyle='-')
plt.hist(df['testers_time'], normed='true')
plt.show()