Wrong logistic regression, analysis of customers churn - python

I want to make a prediction of customer churn based two columns. One - total_day_minutes, which shows me total amount of minutes (how much time customers spoke) and churn - 1 : customer leaved us, 0: customer did not leave us. During exploring my date I came across some outliers. enter image description here . On the first graph you can see some abnormal values, which are not lined up. I decided to clean them and make a logical regression with the following code:
Unfortunately, when I made an S-curve and decided to plot it on my graph as a vertical line - it looks pretty strange, because a threshold line is on the top of the S-curve. What am I doing wrong?
Screenshot of my S-curve and results of logistic regression - enter image description here
By the end of this observation I have to find out, which customers will probably leave me soon (based on these two columns and logical regression). It should be a time, from which they start leaving me. (people who tend to speak more or less leave me?)
Thanks in advance.
# cleaning outliers
Q1 = df_data['total_day_minutes'].quantile(0.25)
Q3 = df_data['total_day_minutes'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 2 * IQR
upper_bound = Q3 + 2 * IQR
# filter the data within the bounds
df_filtered2 = df_data[(df_data['total_day_minutes'] >= lower_bound) &
(df_data['total_day_minutes'] <= upper_bound)]
# define the dependent and independent variables
y = df_filtered2['churn']
X = df_filtered2['total_day_minutes']
# add a constant term to X
X = sm.add_constant(X)
# transform the independent variable
#X['total_day_minutes'] = np.log(X['total_day_minutes'])
# fit the logistic regression model
result = sm.Logit(y, X).fit()
# print the model summary
print(result.summary())
# get the minimum and maximum values of X
x_min = X['total_day_minutes'].min()
x_max = X['total_day_minutes'].max()
# create a new range of values for X
X_new = pd.DataFrame({'total_day_minutes': np.linspace(x_min, x_max, 1000)})
X_new = X_new.astype(float)
# add a constant term to X_new
X_new = sm.add_constant(X_new)
# predict the probabilities of churn for X_new
y_pred = result.predict(X_new)
# plot the S-curve
plt.plot(X_new['total_day_minutes'], y_pred, label='S-curve')
plt.xlabel('Total Day Minutes')
plt.ylabel('Probability of Churn')
# calculate and plot the threshold value
threshold_value = np.exp(X_new.loc[y_pred[y_pred >= 0.5].index[0]]['total_day_minutes'])
print(threshold_value)
plt.axhline(y=threshold, color='black', linestyle='--', label='Threshold')
plt.legend()
plt.show()

Related

Outlier detection of time series with One Class Support Vector Machine

I'm trying to detect outliers in a time series data using the one class support vector machine from Sklearn. My problem is that every point is classified as an outlier, and I can't see what I'm doing wrong.
Here is my code:
#data consists of [index of value, value]
outliers_fraction = 0.01
model = OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma='auto')
model.fit(data)
df['anomaly'] = pd.Series(model.predict(data))
fig, ax = plt.subplots(figsize=(10,6))
a = df.loc[df['anomaly'] == -1, ['DateTime(UTC+01:00)','feature'] ]#anomaly
ax.plot(df['DateTime(UTC+01:00)'], df['feature'], color='blue')
ax.scatter(a['DateTime(UTC+01:00)'],a['feature'], color='red')
And here is my result: Outliers in time series
The red dots represents the outliers. As you can see every point is classified as an outlier. I would expect every point after a jump to be classified as an outlier.
Edit: added the score_samples
score_samples

How to calculate distance of coordinates and categorical dataset with DBSCAN Algorithm?

I have a dataset containing coordinates and categorical data, such as below:
I have searched a lot of papers and journals trying to find explanations regarding which distance measurement method I should apply on my dataset with DBSCAN Algorithm. Here I have a mixed dataset with Latitude and Longitude (coordinates), and Jenis Kecelakaan (Accident Type) as categorical data. Here I found it hard, how do we cluster mixed dataset as above? is there any recommendations of which distance measurement method is good and can be applied in dbscan in my case?
I've been stuck with this problem for days. Please help me out of this problem by giving me some explanation, paper/journal link, or blog like medium/towardsdatascience.
Read this article, I prefer using OneHotEncoding
import pandas as pd
your_df = pd.read_csv('./your_data.csv')
# generate binary values using get_dummies
dum_df = pd.get_dummies(your_df, columns=["Jenis Kecelakaan"])
dum_df.head()
Try it this way.
# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
# define the number of kilometers in one radian
kms_per_radian = 6371.0088
# load the data set
df = pd.read_csv('C:\\travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()
# how many rows are in this data set?
len(df)
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)
# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)
# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))
# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
size = 150
if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
color = 'gray'
size = 30
# plot the points that match the current cluster label
# X.iloc[:-1]
# df.iloc[:, 0]
x_coords = df_coords.iloc[:, 0]
y_coords = df_coords.iloc[:, 1]
ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)
ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian
# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))
# Result:
Number of clusters: 138
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
data:
https://github.com/gboeing/2014-summer-travels/tree/master/data
sample code:
https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
I had the same question and this is the best link I could find online. It's a bit complex but I think creating the distance matrix by yourself, as suggested in the link, is the best option I'm aware of.
Many ML algorithms create a distance matrix internally to find the neighbors. Here, you need to make your distance matrix based on lat/long using Harvesine, then create another distance matrix for the categorical feature, then concatenate the two matrices side by side and pass it as input to the model.

Plotting DWT Scaleogram in Python

I have a signal from a magnetic detector that I'm interested in analyzing, I've made signal decomposition using wavedec()
coeffs = pywt.wavedec(dane_K180_40['CH1[uV]'], 'coif5', level=5)
And I've received decomposition coefficients as follows:
cA1, cD5, cD4, cD3, cD2, cD1 = coeffs
These are ndarrays objects with various lengths.
cD1 is (1519,) cD2 is (774,) and so on. Different length of arrays is my main obstacle.
coefficients
My question:
I have to make DWT Scaleogram and I can't stress it enough that I've tried my best and couldn't do it.
What is the best approach? Using matpllotlib's imshow() as follows:
plt.imshow(np.abs([cD5, cD4, cD3, cD2, cD1]), cmap='bone', interpolation='none', aspect='auto')
gives me an error
TypeError: Image data of dtype object cannot be converted to float
I've tried to google it since I'm not an expert in python and I've tried to change the ndarrays to float.
What is the best for plotting scaleogram, matshow, pcolormesh? ;D
Basically, each cDi array has half the amount of samples as the previous array (this is not the case for every mother wavelet!), so I create a 2D numpy array where the first element is the 'full' amount of samples, and for each subsequent level I repeat the samples 2^level times so that the end result is a rectangular block. You can pick whether you want the Y-axis plotted as a linear or as a logarithmic scale.
# Create signal
xc = np.linspace(0, t_n, num=N)
xd = np.linspace(0, t_n, num=32)
sig = np.sin(2*np.pi * 64 * xc[:32]) * (1 - xd)
composite_signal3 = np.concatenate([np.zeros(32), sig[:32], np.zeros(N-32-32)])
# Use the Daubechies wavelet
w = pywt.Wavelet('db1')
# Perform Wavelet transform up to log2(N) levels
lvls = ceil(log2(N))
coeffs = pywt.wavedec(composite_signal3, w, level=lvls)
# Each level of the WT will split the frequency band in two and apply a
# WT on the highest band. The lower band then gets split into two again,
# and a WT is applied on the higher band of that split. This repeats
# 'lvls' times.
#
# Since the amount of samples in each step decreases, we need to make
# sure that we repeat the samples 2^i times where i is the level so
# that at each level, we have the same amount of transformed samples
# as in the first level. This is only necessary because of plotting.
cc = np.abs(np.array([coeffs[-1]]))
for i in range(lvls - 1):
cc = np.concatenate(np.abs([cc, np.array([np.repeat(coeffs[lvls - 1 - i], pow(2, i + 1))])]))
plt.figure()
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.title('Discrete Wavelet Transform')
# X-axis has a linear scale (time)
x = np.linspace(start=0, stop=1, num=N//2)
# Y-axis has a logarithmic scale (frequency)
y = np.logspace(start=lvls-1, stop=0, num=lvls, base=2)
X, Y = np.meshgrid(x, y)
plt.pcolormesh(X, Y, cc)
use_log_scale = False
if use_log_scale:
plt.yscale('log')
else:
yticks = [pow(2, i) for i in range(lvls)]
plt.yticks(yticks)
plt.tight_layout()
plt.show()

How to create balanced k-means geospatial clusters?

I have 9000 USA-based points (that are accounts) with a variety of different string and numeric columns/attributes. I'm trying to evenly divide these points/accounts up into equitable groupings that are both spatially grouped as well as weighted (in a gravity sense) by number of employees, which is one of the columns/attributes. I used sklearn K-means clustering to do a grouping and it seemed to work fine but I noticed that the groupings are not equal. Some of the groups have ~600 and some of them have ~70. This is somewhat logical as there is more data in certain areas. The problem here is that I need these groups to be more equal. Here’s the code I used:
kmeans = KMeans(n_clusters = 30, max_iter=1000, init ='k-means++')
lat_long = dftobeclustered[dftobeclustered.columns[1:3]]
_employees = dftobeclustered[dftobeclustered.columns[3]]
weighted_kmeans_clusters = kmeans.fit(lat_long, sample_weight = _employees)
dftobeclustered['cluster_label'] = kmeans.predict(lat_long, sample_weight = _employees)
centers = kmeans.cluster_centers_
labels = dftobeclustered['cluster_label']
Is it possible to divide up the k-means clusters in a more equal way? I think the core problem is that it breaks low population areas like Montana or Hawaii off into their own groups when I actually need it to combine those areas into bigger groups. But I don't know.
K-means is not written to work that way. Observations are assigned to clusters based on their actual MEASURED distances from centroids.
If you try to coerce the the number of members in a cluster, it completely un-does the that distance measurement component, especially when you are talking geographically with Lat Lon.
You may need to look at another method of subsetting your observations or reconsider the equivalent sizes of clusters.
In all honesty, most of the time geographic distance-clustering is directly related to the similarity of observations in other ways (think of how house styles, or demographics or income in neighborhoods and how that might translate to a zip code or trees types in a localized region). These sorts of things do not respect our needs for them to be groups of the same size.
Clusters based on qualities OTHER than geography are more likely to level out if there is clear differentiation in even numbers of observations than straight up lat lon, as they will be distance sorted...no way around it.
So areas with dense populations of observations WILL have more members than those with less. And the distance between MT and HI will always be greater than MT and NYC so they will NOT be geographically cluster by distance.
I understand that you want equal groupings...is it necessary that they are geographically grouped? Given the fact that MT and HI would be together, the geographic label would not mean much. It might be better to use all of the NON geographic numerical values to cluster to create observations that are contextually alike.
Otherwise, you can use business rules to dissect the observations (by this I mean if var_x > 7 & var_y <227 & .... label=1 and make some groups yourself. You can use groupby() and describe() in pandas to create cross tables to see what might be good values to split on.
Try DBSCAN. See my sample code below.
# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
# define the number of kilometers in one radian
kms_per_radian = 6371.0088
# load the data set
df = pd.read_csv('C:\\your_path\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()
# how many rows are in this data set?
len(df)
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)
# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)
# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))
# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
size = 150
if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
color = 'gray'
size = 30
# plot the points that match the current cluster label
# X.iloc[:-1]
# df.iloc[:, 0]
x_coords = df_coords.iloc[:, 0]
y_coords = df_coords.iloc[:, 1]
ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)
ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian
# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))
Result:
Number of clusters: 138
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
Result:
0 lat lon
1587 37.921659 22...
1 lat lon
1658 37.933609 23...
2 lat lon
1607 37.966766 23...
3 lat lon
1586 38.149019 22...
4 lat lon
1584 38.374766 21...
133 lat lon
662 50.37369 18.889205
134 lat lon
561 50.448704 19.0...
135 lat lon
661 50.462271 19.0...
136 lat lon
559 50.489304 19.0...
137 lat lon
1 51.474005 -0.450999
Data Source:
https://github.com/gboeing/2014-summer-travels/tree/master/data
Relevant Resources:
https://github.com/gboeing/urban-data-science/blob/2017/15-Spatial-Cluster-Analysis/cluster-analysis.ipynb
https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
During cluster assignment, one can also add to the distance a 'frequency penalty'. This is described in "Frequency Sensitive Competitive Learning for
Balanced Clustering on High-dimensional Hyperspheres - Arindam Banerjee and Joydeep Ghosh - IEEE Transactions on Neural Networks"
http://www.ideal.ece.utexas.edu/papers/arindam04tnn.pdf
They also have an online/streaming version.

Plotting the 2-d diffusion equation over time

So basically, I've created a plot in python which models two interacting populations on an island and shows the uses the diffusion equation to model the movement and change in the population in one dimension, and I'm trying to modify it so I have a two dimensional plot instead so it takes into account both the x and y position of the species on the island.
Apologies that the code is quite long, most of it is just defining parameters though- I'm not too sure how I create a two dimensional plot which has the density of the population in each position, this is my attempt below but the plot isn't correct at all. I think my numerical solution should be correct, I'm assuming it's it's my initial/boundary conditions and how I've set up my array for the probability distribution which is incorrect but I'm not sure how to change it. For example, one of the problems I have is that it has the Y position on the x axis - how can I fix this?
r1=0.1 # growth rate B
r2=0.3 # growth rate M
KB = 20 # carrying capacity B
KM = 6000 # carrying capacity M
x=np.arange(0,70) # distance from the coast (100m)
y=np.arange(0,90) # distance from the coast
dx=1 # change in x position
dy=1 # change in y position
dt=1 # time step
m=71 # number of x steps
o=91 # number of y steps
n=21 # time
alpha = 0.00002 # predation rate
beta = 0 # growth rate from predation (assumed to be zero)
B=np.zeros(shape=(m,o,n)) # species B
M=np.zeros(shape=(m,o,n)) # species M
D=0.35 # diffusivity of B
D2=0.05 # diffusivity of M
Alpha=(D*dt)/(dx*dx)
Beta=(D*dt)/(dy*dy)
Alpha2=(D2*dt)/(dx*dx)
Beta2 = (D2*dt)/(dy*dy)
M[0,:,0]=0 #initial conditions
M[m-1,:,0]=0
M[:,0,0]=0
M[:,m-1,0]=0
B[1:26,:,0]=0.96
B[26:44,:,0]=4.6
B[44:m-1,:,0]=0.96
B[:,1:26,0]=0.96
B[:,26:44,0]=4.6
B[:,44:m-1,0]=0.96
M[1:26,:,0]=2500
M[26:44,:,0]=1000
M[44:m-1,:,0]=2500
M[:,1:26,0]=2500
M[:,26:44,0]=1000
M[:,44:m-1,0]=2500
for k in range(0,n-1): # loop for time
B[0,:,k+1]=B[1,:,k+1] # boundary conditions
B[m-1,:,k+1]=B[m-2,:,k+1]
M[0,:,k+1]=0
M[m-1,:,k+1]=0
B[:,0,k+1]=B[:,1,k+1]
B[:,m-1,k+1]=B[:,m-2,k+1]
M[:,0,k+1]=0
M[:,m-1,k+1]=0
for i in range(1,m-1): # loop for x
for j in range(1,o-1): # loop for y
Bdxx=(-2*Alpha)*B[i,j,k]+Alpha*B[i+1,j,k]+Alpha*B[i-1,j,k]
Bdyy=(-2*Beta)*B[i,j,k]+Beta*B[i,j+1,k]+Beta*B[i,j-1,k]
B[i,j,k+1]=B[i,j,k]+Bdxx+Bdyy+r1*B[i,j,k]*(1-B[i,j,k]/KB)-alpha*M[i,j,k]*B[i,j,k] # numerical solution for B
Mdxx=(-2*Alpha2)*M[i,j,k]+Alpha2*M[i+1,j,k]+Alpha2*M[i-1,j,k]
Mdyy=(-2*Beta2)*M[i,j,k]+Beta2*M[i,j+1,k]+Beta2*M[i,j-1,k]
M[i,j,k+1]=M[i,j,k]+Mdxx+Mdyy+r2*M[i,j,k]*(1-M[i,j,k]/KM)+beta*M[i,j,k]*B[i,j,k] # numerical solution for M
plt.pcolormesh(B[:,:,n-1])
plt.colorbar()
plt.xlabel('X Position')
plt.ylabel('Y Position')
Any help is really appreciated, thanks!
edit: I've realised part of my problem is that the initial conditions for the y coordinates are wrong

Categories

Resources