PDF plotting concern - python

I tried the following manual approach:
dict = {'id': ['a','b','c','d'], 'testers_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'activated_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'engaged_time' : [70, None, None, None]}
df = pd.DataFrame(dict, columns=['id', 'testers_time', 'stage_1_to_2_time', 'activated_time', 'stage_2_to_3_time', 'engaged_time'])
df= df.dropna(subset=['testers_time']).sort_values('testers_time')
prob = df['testers_time'].value_counts(normalize=True)
#0.333333, 0.333333, 0.333333
plt.plot(df['testers_time'], prob, marker='.', linestyle='-')
And I tried the following approach I found on stackoverflow:
dict = {'id': ['a','b','c','d'], 'testers_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'activated_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'engaged_time' : [70, None, None, None]}
df = pd.DataFrame(dict, columns=['id', 'testers_time', 'stage_1_to_2_time', 'activated_time', 'stage_2_to_3_time', 'engaged_time'])
df= df.dropna(subset=['testers_time']).sort_values('testers_time')
fit = stats.norm.pdf(df['testers_time'], np.mean(df['testers_time']), np.std(df['testers_time']))
#0.02902547, 0.04346777, 0.01829513]
plt.plot(df['testers_time'], fit, marker='.', linestyle='-')
plt.hist(df['testers_time'], normed='true')
As you can see I get completely different values- the probabilities are correct for #1, but for #2 they aren't (nor do they add up to 100%), and the y axis (%) of the histogram is based on 6 bins, not 3.
Can you explain how I can get the right probability for #2?

The first approach gives you a probability mass function. The second gives you a probability density - hence the name probability density function (pdf). Hence both are correct, they just show something different.
If you evaluate the pdf over a larger range (e.g. 10 times the standard deviation), it will look much like an expected gaussian curve.
import pandas as pd
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
dict = {'id': ['a','b','c','d'], 'testers_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'activated_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'engaged_time' : [70, None, None, None]}
df = pd.DataFrame(dict, columns=['id', 'testers_time', 'stage_1_to_2_time', 'activated_time', 'stage_2_to_3_time', 'engaged_time'])
df= df.dropna(subset=['testers_time']).sort_values('testers_time')
mean = np.mean(df['testers_time'])
std = np.std(df['testers_time'])
x = np.linspace(mean - 5*std, mean + 5*std)
fit = stats.norm.pdf(x, mean, std)
plt.plot(x, fit, marker='.', linestyle='-')
plt.hist(df['testers_time'], normed='true')


How to use Time series with the Sklearn OPTICS Algorithm?

I'm trying to cluster time series. I also want to use Sklearn OPTICS. In the documentation it says that the input vector X should have dimensions (n_samples,n_features). My array is on the form (n_samples, n_time_stamps, n_features). Example in code further down.
My question is how I can use the Fit-function from OPTICS with a time series. I know that people have used OPTICS and DBSCAN with time series. I just can't figure out how they have implemented it. Any help will be much appreciated.
[[[t00, x0], [t01, x01], ... [t0_n_timestamps, x0_n_timestamps]],
[[t10, x10], [t11, x11], ... [t1_n_timestamps, x1_n_timestamps]],
[[t_n_samples_0, x_n_samples_0], [[t_n_samples_1, x_n_samples_1], ... [t_n_samples_n_timestamps, x_n_samples_n_timestamps]]]
Given the following np.array as an input:
data = np.array([
[["00:00", 7], ["00:01", 37], ["00:02", 3]],
[["00:00", 27], ["00:01", 137], ["00:02", 33]],
[["00:00", 14], ["00:01", 17], ["00:02", 12]],
[["00:00", 15], ["00:01", 123], ["00:02", 11]],
[["00:00", 16], ["00:01", 12], ["00:02", 92]],
[["00:00", 17], ["00:01", 23], ["00:02", 22]],
[["00:00", 18], ["00:01", 23], ["00:02", 112]],
[["00:00", 100], ["00:01", 200], ["00:02", 301]],
[["00:00", 101], ["00:01", 201], ["00:02", 302]],
[["00:00", 102], ["00:01", 203], ["00:02", 303]],
[["00:00", 104], ["00:01", 207], ["00:02", 304]]])
I will proceed as follows:
# save shape info in three separate variables
x, y, z = data.shape
# idea from https://stackoverflow.com/a/36235454/5050691
output_arr = np.column_stack((np.repeat(np.arange(x), y), data.reshape(x * y, -1)))
# create a df out of the arr
df = pd.DataFrame(output_arr)
# rename for understandability
df = df.rename(columns={0: 'index', 1: 'time', 2: 'value'})
# Change the orientation between rows and columns so that rows
# that contain time info become columns
df = df.pivot(index="index", columns="time", values="value")
df.rename_axis(None, axis=1).reset_index()
# get columns that refer to specific interval of time series
temporal_accessors = ["00:00", "00:01", "00:02"]
# extract data that will be used to carry out clustering
data_for_clustering = df[temporal_accessors].to_numpy()
# a set of exemplary params
params = {
"xi": 0.05,
"metric": "euclidean",
"min_samples": 3
clusterer = OPTICS(**params)
fitted = clusterer.fit(data_for_clustering)
cluster_labels = fitted.labels_
df["cluster"] = cluster_labels
# Note: density based algortihms have a notion of the "noise-cluster", which is marked with
# -1 by sklearn algorithms. That's why starting index is -1 for density based clustering,
# and 0 otherwise.
For the given data and the presented choice of params, you'll get the following clusters: [0 0 1 0 0 0 0 0 1 1 1]

Why does setting hue in seaborn plot change the size of a point?

The plot I am trying to make needs to achieve 3 things.
If a quiz is taken on the same day with the same score, that point needs to be bigger.
If two quiz scores overlap there needs to be some jitter so we can see all points.
Each quiz needs to have its own color
Here is how I am going about it.
import seaborn as sns
import pandas as pd
data = {'Quiz': [1, 1, 2, 1, 2, 1],
'Score': [7.5, 5.0, 10, 10, 10, 10],
'Day': [2, 5, 5, 5, 11, 11],
'Size': [115, 115, 115, 115, 115, 355]}
df = pd.DataFrame.from_dict(data)
sns.lmplot(x = 'Day', y='Score', data = df, fit_reg=False, x_jitter = True, scatter_kws={'s': df.Size})
Setting the hue, which almost does everything I need, results in this.
import seaborn as sns
import pandas as pd
data = {'Quiz': [1, 1, 2, 1, 2, 1],
'Score': [7.5, 5.0, 10, 10, 10, 10],
'Day': [2, 5, 5, 5, 11, 11],
'Size': [115, 115, 115, 115, 115, 355]}
df = pd.DataFrame.from_dict(data)
sns.lmplot(x = 'Day', y='Score', data = df, fit_reg=False, hue = 'Quiz', x_jitter = True, scatter_kws={'s': df.Size})
Is there a way I can have hue while keeping the size of my points?
It doesn't work because when you are using hue, seaborn does two separate scatterplots and therefore the size argument you are passing using scatter_kws= no longer aligns with the content of the dataframe.
You can recreate the same effect by hand however:
x_col = 'Day'
y_col = 'Score'
hue_col = 'Quiz'
size_col = 'Size'
fig, ax = plt.subplots()
for q,temp in df.groupby(hue_col):
n = len(temp[x_col])
x = temp[x_col]+np.random.normal(scale=0.2, size=(n,))
ax.scatter(x,temp[y_col],s=temp[size_col], label=q)

Plot specific values on y axis instead of increasing scale from dataframe

When plotting 2 columns from a dataframe into a line plot, is it possible to, instead of a consistently increasing scale, have fixed values on your y axis (and keep the distances between the numbers on the axis constant)? For example, instead of 0, 100, 200, 300, ... to have 0, 21, 53, 124, 287, depending on the values from your dataset? So basically to have on the axis all your possible values fixed instead of an increasing scale?
Yes, you can use: ax.set_yticks()
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
Or if the values are very distant each other, you can use ax.set_yscale('log').
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6], [20, 300]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yscale('log', basex=2)
What you need to do is:
get all distinct y values and sort them
set their y position on the plot according to their place on the ordered list
set the y labels according to distinct ordered values
The code below would do
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame([[13, 1], [14, 1.8], [16, 2], [15, 1.5], [17, 2], [18, 3 ],
[19, 200],[20, 3.6], ], columns = ['A','B'])
x = df['A']
y = df['B']
y_keys = np.sort(y.unique())
y_values = range(len(y_keys))
y_dict = dict(zip(y_keys,y_values))
fig, ax = plt.subplots()
ax.plot(x,[y_dict[k] for k in y],'o-')

Plot graph for every dataset entry

So this is my dataset:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'female': [0, 1, 1, 0, 1],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'female', 'preTestScore', 'postTestScore'])
I'm new to plotting data and a bit lost here. I want to plot a line for each person, where the x-ticks are preTestScore and postTestScore and the y-ticks go from 0 to 100 (the possible range of test scores).
I was thinking that I could just make a scatter plot but then I wouldn't know how to connect the dots.
A slopegrahp was what I was looking for. Thanks #mostlyoxygen
x = df.loc[:, "preTestScore":"postTestScore"]
x["full_name"] = df["first_name"] + " " + df["last_name"]
num_students = x.shape[0]
num_times = x.shape[1] - 1
plt.xlim(0, 1)
plt.ylim(0, 100)
plt.xticks(np.arange(2), ["perTestScore", "postTestScore"])
plt.title("Score changes after Test taking")
for row in x.values:
plt.plot([0, 1], [row[0], row[1]], label=row[2])

Add probability of x and conversion %

This is how the data currently looks like:
id testers_time stage_1_to_2_time activated_time stage_2_to_3_time engaged_time
a 10 30 40 30 70
b 30
c 15 30 45
dict = {'id': ['a','b','c','d'], 'testers_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'activated_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'engaged_time' : [70, None, None, None]}
df = pd.DataFrame(dict, columns=['id', 'testers_time', 'stage_1_to_2_time', 'activated_time', 'stage_2_to_3_time', 'engaged_time'])
I have a plot of testers_time against its cumulative probability from a CDF:
def ecdf(df):
n = len(df)
x = np.sort(df)
y = np.arange(1.0, n+1) / n
return x, y
df = df['testers_time'].dropna().sort_values()
x, y = ecdf(df)
plt.plot(x, y, marker='.', linestyle='none')
plt.axvline(x.mean(), color='gray', linestyle='dashed', linewidth=2) #Add mean
x_m = int(x.mean())
y_m = stats.percentileofscore(df, x.mean())/100.0
plt.annotate('(%s,%s)' % (x_m,int(y_m*100)) , xy=(x_m,y_m), xytext=(10,-5), textcoords='offset points')
percentiles= np.array([0,25,50,75,100])
x_p = np.percentile(df, percentiles)
y_p = percentiles/100.0
plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay quartiles
for x,y in zip(x_p, y_p):
plt.annotate('%s' % int(x), xy=(x,y), xytext=(10,-5), textcoords='offset points')
What I am trying to do is graph testers_time against:
1) Its none-cumulative probability, if graphed it should look like a sort of a PDF
2) Its cumulative conversion %, where conversion is any id that has a populated (not blank or null) testers_time. So id a (1 of 4 ids) converts, that's 25%, id b converts, thats 50% (since cumulative), id c converts, that's 75%, and id d doesnt convert so 75% conversion is the max, at 30 days testers_time.
Can you assist with adding the above into columns in the df, or graph them? Thank you.
A1: df['prob'] = df['testers_time'].map(df.testers_time.value_counts(normalize=True))
A2: df['conv'] = df['testers_time'].rank(ascending=1)/len(df)

