I am currently working on k-means clustering of multiple groups using groupby.
The data I'm working on looks like this
date
permno
mom1m
mom2m
...
mom48m
2004-01-31
80000
0.515
-0.32
...
0.773
2004-02-29
80000
0.415
-0.043
...
0.64
2004-03-31
80000
0.314
0.045
...
0.43
2004-01-30
80001
0.643
-0.234
...
0.34
2004-02-29
80001
0.646
-0.456
...
0.646
2004-03-31
80001
0.876
-0.044
...
0.321
2004-01-31
80002
0.453
0.045
...
0.324
I will be grouping the dataframe based on the dates and I want to perform k-means clustering starting from the columns mom2m to mom48m.
I would want to have a separate column that shows the labels as well.
What I have done until now is to make a function that performs the k-means clustering and use transform.
def cluster(X, n_clusters):
features = X[features_to_KMeans]
k_means = KMeans(n_clusters=n_clusters)
y = kmeans.fit_predict(features)
return y
crsp['cluster_id'] = crsp.groupby("date").transform(cluster, n_clusters=50)
For 'scikit-learn', the data needs to be converted into a numpy array. Also keep in mind that if you have a one-dimensional array, you also need to convert it to a two-dimensional one. For example, if you used only one column, then you need to do the following:
np.array(crsp.loc[:, 'mom2m'].reshape(-1, 1)
I do not know if it is necessary to apply grouping here, in my opinion it is not needed.
At the end, the library 'mglearn' is used to draw the result. The triangles show the center of each cluster.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import mglearn
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters)
y = k_means.fit_predict(X)
return y, k_means
arr = np.array(crsp.loc[:, ['mom2m', 'mom48m']])
aaa = cluster(arr, 3)
aaa_result = aaa[1]
mglearn.discrete_scatter(arr[:, 0], arr[:, 1] , aaa[1].labels_, markers='o')
mglearn.discrete_scatter(aaa[1].cluster_centers_[:, 0],
aaa[1].cluster_centers_[:, 1], [0, 1, 2], markers='^', markeredgewidth=2)
plt.show()
Related
I am not comfortable with Python - much less intimidated and at ease with R. So indulge me on a silly question that is taking me a ton of searches without success.
I want to fit in a regression model with sklearn both with OLS and lasso. In particular, I like the mtcars dataset that is so easy to call in R, and, as it turns out, also very accessible in Python:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
It looks like this:
mpg cyl disp hp drat ... qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 ... 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 ... 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 ... 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 ... 19.44 1 0 3 1
In trying to use LinearRegression() the usual structure found is
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x, y)
but to do so, I need to select several columns of df to fit into the regressors x, and a column to be the independent variable y. For example, I'd like to get an x matrix that includes a column of 1's (for the intercept) as well as the disp and qsec (numerical variables), as well as cyl (categorical variable). On the side of the independent variable, I'd like to use mpg.
It would look if it were possible to word this way as
model = LinearRegression().fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
But how do I go about the syntax for it?
Similarly, how can I do the same with lasso:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
but again this is not the right syntax.
I did find that you can get the actual regression (OLS or lasso) by turning the dataframe into a matrix. However, the names of the columns are gone, and it is hard to read the variable corresponding to each coefficients. And I still haven't found a simple method to run diagnostic values, like p-values, or the r-square to begin with.
You can maybe try patsy which is used by statsmodels:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
from patsy import dmatrix
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
mat = dmatrix("disp + qsec + C(cyl)", mtcars)
Looks like this, we can omit first column intercept since it is included in sklearn:
mat
DesignMatrix with shape (32, 5)
Intercept C(cyl)[T.6] C(cyl)[T.8] disp qsec
1 1 0 160.0 16.46
1 1 0 160.0 17.02
1 0 0 108.0 18.61
1 1 0 258.0 19.44
1 0 1 360.0 17.02
X = pd.DataFrame(mat[:,1:],columns = mat.design_info.column_names[1:])
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X,mtcars['mpg'])
But the parameters names in model.coef_ will not be named. You just have to put them into a series to read them maybe:
pd.Series(model.coef_,index = X.columns)
C(cyl)[T.6] -5.087564
C(cyl)[T.8] -5.535554
disp -0.025860
qsec -0.162425
Pvalues from sklearn linear regression, there's no ready method to do it, you can check out these answers, maybe one of them is what you are looking for.
Here are two ways - unsatisfactory, especially because the variables labels seem to be gone once the regression gets going:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
import numpy as np
from sklearn.linear_model import LinearRegression
Single variable regression mpg (i.v.) ~ hp (d.v.):
lm = LinearRegression()
mat = np.matrix(df)
lmFit = lm.fit(mat[:,3], mat[:,0])
print(lmFit.coef_)
print(lmFit.intercept_)
For multiple regression drat ~ wt + cyl + carb:
lmm = LinearRegression()
wt = np.array(df['wt'])
cyl = np.array(df['cyl'])
carb = np.array(df['carb'])
stack = np.column_stack((cyl,wt,carb))
stackmat = np.matrix(stack)
lmFit2 = lmm.fit(stackmat,mat[:,4])
print(lmFit2.coef_)
print(lmFit2.intercept_)
New to python, building a classifier that predicts likelihood of vaccination if trust in government (trustingov) and trust in public health (poptrusthealth) from the dataset is greater than a certain percentage. Not sure how to get both as classes.
UPDATE: Concatenated the dataframe values, but why is the accuracy of the model 1.0?
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
df = pd.read_csv("covidpopulation2.csv")
print(df.head())
99853 8254 219 0.649999976 0.80763793
0 99853 8254 219 0.65 0.807638
1 48490 4007 227 0.49 0.357625
2 190179 8927 107 0.54 0.853186
3 190179 8927 107 0.54 0.853186
4 190179 8927 107 0.54 0.853186
print(df.describe())
99853 8254 219 0.649999976 0.80763793
count 1.342500e+04 13425.000000 13425.000000 13425.000000 13425.000000
mean 3.095292e+05 20555.570056 225.864655 0.473157 0.684484
std 5.070872e+05 28547.608184 218.078176 0.184501 0.167985
min 1.225700e+04 26.000000 2.000000 0.000000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.370000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.490000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.630000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.770000 0.983146
df = pd.read_csv("covidpopulation2.csv", na_values = ['?'], names = ['covidcases','coviddeaths','mortalityperm','trustngov','poptrusthealth'])
print(df.head())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
0 99853 8254 219 0.65 0.807638
1 99853 8254 219 0.65 0.807638
2 48490 4007 227 0.49 0.357625
3 190179 8927 107 0.54 0.853186
4 190179 8927 107 0.54 0.853186
print(df.describe())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
count 1.342600e+04 13426.000000 13426.000000 13426.00000 13426.000000
mean 3.095136e+05 20554.653806 225.864144 0.47317 0.684493
std 5.070715e+05 28546.742358 218.070062 0.18450 0.167982
min 1.225700e+04 26.000000 2.000000 0.00000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.37000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.49000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.63000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.77000 0.983146
df.dropna(inplace=True)
In [212]:
print(df.describe())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
count 1.342600e+04 13426.000000 13426.000000 13426.00000 13426.000000
mean 3.095136e+05 20554.653806 225.864144 0.47317 0.684493
std 5.070715e+05 28546.742358 218.070062 0.18450 0.167982
min 1.225700e+04 26.000000 2.000000 0.00000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.37000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.49000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.63000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.77000 0.983146
all_features = df[['covidcases',
'coviddeaths',
'mortalityperm',
'trustngov',
'poptrusthealth',]].values
all_classes = (df['poptrusthealth'].values + df['trustngov'].values)
willing = 0
unwilling = 0
label = [None] * 13426
for i in range (len(all_classes)):
if all_classes[i] > 0.70:
willing += 1
label[i] = 1
else:
unwilling = unwilling + 1
label[i] = 0
print(willing)
print(unwilling)
all_classes = label
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)
from sklearn.model_selection import train_test_split
np.random.seed(1234)
(training_inputs,testing_inputs,training_classes,testing_classes) = train_test_split(all_features_scaled,all_classes,train_size = 0.8,test_size = 0.2,random_state = 1)
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(random_state=1)
clf.fit(training_inputs, training_classes)
DecisionTreeClassifier(random_state=1)
print(clf)
DecisionTreeClassifier(random_state=1)
print('the accuracy of the decision tree is:',clf.score(testing_inputs, testing_classes))
the accuracy of the decision tree is: 1.0
import pydotplus
from sklearn import tree
import collections
import graphviz
feature_names = ['covidcases','coviddeaths', 'mortalityperm','trustngov',
'poptrusthealth']
dot_data = tree.export_graphviz(clf, feature_names = feature_names, out_file =None, filled = True, rounded = True)
graph = pydotplus.graph_from_dot_data(dot_data)
colors = ('turquoise','orange')
edges = collections.defaultdict(list)
for edge in graph.get_edge_list():
edges[edge.get_source()].append(int(edge.get_destination()))
for edge in edges:
edges[edge].sort()
for i in range (2):
dest = graph.get_node(str(edges[edge][i]))[0]
dest.set_fillcolor(colors[i])
graph.write_png('tree.png')
Any help or ideas would be appreciated.
Sorry, but this makes no sense from a machine learning point of view. Your label is directly created from the input features. That's why the model accuracy is 100%.
Here is your final classifier (without needing any machine learning):
if trustingov + poptrusthealth > 0.7 predict 1, otherwise predict 0.
It is perfectly possible to have 100% accuracy with training data, as the ML algorithm know them.
You have to apply your ML to data not used during the learning phase. It is usually done by splitting data into a training data set and a test data set.
Then you train/fit the ML with train data only. Then test it and calculate accuracy on test data. The test data result/Accuracy will tell you if your ML is well trained and working.
Unused test data is important to do a good ML test. So you will find unbiased accuracy of it.
I have a Pandas series that has an index and the values are the counts for each value of the index. I want to plot a CDF (preferably just the line, not the full histogram) where the x-axis represents the index.
For example, if my series is s, I have s.index as the array of values that should be represented on the x-axis and s.values are the counts. I have tried just doing s.plot(cumulative = True,...)but that puts the values on the x-axis, not the index.
Example: s.index yields an array of values from 0 to 1, with 0.01 increments (0.00, 0.01, 0.02, ... 1.00). s.values yields an array of the counts, for example (4372, 1340, 205,...), where each one corresponds to the index (0.01 has a count of 1340). I would like the x-axis to be the 0.00, 0.01,... and the y-axis goes from 0 to 1 as the cumulative distribution based on the counts.
Using seaborn package, you can achieve that:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
x = np.arange(0,.1,0.01)
df = pd.DataFrame({'value':[1340,1200,1300,1150,1421,1175,1232,1432,1123,1231]},index=x)
df
value
0.00 1340
0.01 1200
0.02 1300
0.03 1150
0.04 1421
0.05 1175
0.06 1232
0.07 1432
0.08 1123
0.09 1231
sns.distplot(df.index, rug=True, hist=False)
plt.show()
I have this kind of data :
ID x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1 -0.18 5 -0.40 -0.26 0.53 -0.66 0.10 2 -0.20 1
2 -0.58 5 -0.52 -1.66 0.65 -0.15 0.08 3 3.03 -2
3 -0.62 5 -0.09 -0.38 0.65 0.22 0.44 4 1.49 1
4 -0.22 -3 1.64 -1.38 0.08 0.42 1.24 5 -0.34 0
5 0.00 5 1.76 -1.16 0.78 0.46 0.32 5 -0.51 -2
what's the best method for visualizing this data, i'm using matplotlib to visualizing it, and read it from csv using pandas
thanks
Visualising data in a high-dimensional space is always a difficult problem. One solution that is commonly used (and is now available in pandas) is to inspect all of the 1D and 2D projections of the data. It doesn't give you all of the information about the data, but that's impossible to visualise unless you can see in 10D! Here's an example of how to do this with pandas (version 0.7.3 upwards):
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
#first make some fake data with same layout as yours
data = pd.DataFrame(np.random.randn(100, 10), columns=['x1', 'x2', 'x3',\
'x4','x5','x6','x7','x8','x9','x10'])
#now plot using pandas
scatter_matrix(data, alpha=0.2, figsize=(6, 6), diagonal='kde')
This generates a plot with all of the 2D projections as scatter plots, and KDE histograms of the 1D projections:
I also have a pure matplotlib approach to this on my github page, which produces a very similar type of plot (it is designed for MCMC output, but is also appropriate here). Here's how you'd use it here:
import corner_plot as cp
cp.corner_plot(data.as_matrix(),axis_labels=data.columns,nbins=10,\
figsize=(7,7),scatter=True,fontsize=10,tickfontsize=7)
You may change the plot over the time, for each instant you plot a different "dimension" of the dataframe.
Here an example on how you can do plots that change over the time, you may adjust it for your purposes
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(111)
plt.grid(True)
plt.hold(False)
x = np.arange(-3, 3, 0.01)
for n in range(15):
y = np.sin(np.pi*x*n) / (np.pi*x*n)
line, = ax.plot(x, y)
plt.draw()
plt.pause(0.5)
Trying to learn PCA through and through but interestingly enough when I use numpy and sklearn I get different covariance matrix results.
The numpy results match this explanatory text here but the sklearn results different from both.
Is there any reason why this is so?
d = pd.read_csv("example.txt", header=None, sep = " ")
print(d)
0 1
0 0.69 0.49
1 -1.31 -1.21
2 0.39 0.99
3 0.09 0.29
4 1.29 1.09
5 0.49 0.79
6 0.19 -0.31
7 -0.81 -0.81
8 -0.31 -0.31
9 -0.71 -1.01
Numpy Results
print(np.cov(d, rowvar = 0))
[[ 0.61655556 0.61544444]
[ 0.61544444 0.71655556]]
sklearn Results
from sklearn.decomposition import PCA
clf = PCA()
clf.fit(d.values)
print(clf.get_covariance())
[[ 0.5549 0.5539]
[ 0.5539 0.6449]]
Because for np.cov,
Default normalization is by (N - 1), where N is the number of observations given (unbiased estimate). If bias is 1, then normalization is by N.
Set bias=1, the result is the same as PCA:
In [9]: np.cov(df, rowvar=0, bias=1)
Out[9]:
array([[ 0.5549, 0.5539],
[ 0.5539, 0.6449]])
So I've encountered the same issue, and I think that it returns different values because the covariance is calculated in a different way. According to the sklearn documentation, the get_covariance() method, uses the noise variances to obtain the covariance matrix.