Python regression analysis error - python

I'm trying to run a regression analysis with the below mentioned code. I encounter ImportError: No module named statsmodels.api and No module named matplotlib.pyplot. Any suggestions will be appreciated to overcome this error.
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats, integrate
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("F:\Projects\Poli_Map\DAT_OL\MASTRTAB.csv")
# define the data/predictors as the pre-set feature names
df = pd.DataFrame(data.data, columns=data.feature_names)
# Put the target (IMR) in another DataFrame
target = pd.DataFrame(data.target, columns=["IMR"])
X = df["HH_LATR","COMM_TOILT","PWS"]
y = target["IMR"]
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model
# Print out the statistics
model.summary()
plt.scatter(predictions, y, s=30, c='r', marker='+', zorder=10) #Plot graph
plt.xlabel("Independent variables")
plt.ylabel("Outcome variables")
plt.show()

I highly recommend that you install ANACONDA. This way the environment variables are set automatically and you don't need to worry about anything else. There are many useful packages (e.g. numpy, sympy, scipy) which are bundled with anaconda.
Moreover, based on personal experience I can tell you that using pip on windows and compiling from source (you need visual studio) is a pain in the neck sometimes. That's why ANACONDA has been conceived.
see : https://www.anaconda.com/download/
Hope this helps.

Related

Python completely crashes when I use kmeans clustering from sklearn, and it is not a memory issue. Any suggestions?

When I run any code using kmeans clustering from sklearn, my python crashes (e.g., the kernel dies in Jupyter). This is not a memory usage issue and from what I can tell sklearn is up to date (version 1.0.2).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
sns.set_style('white')
from sklearn.cluster import KMeans
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
# Sample data for clustering
data_file = 'cluster_data.csv'
df = pd.read_csv(data_file,index_col='id')
X = df[['x1','x2']]
# Plotting data for visual inspection of clusters
plt.figure(figsize = (10, 10)) # determines the size of the plot area
ax = sns.scatterplot(x='x1', y='x2',data=df,edgecolor='grey',alpha=0.5)
# Kmeans clustering
sklearn.cluster.KMeans(n_clusters=3, init='random').fit(df) # This is where the kernel dies
kmeans_centroids = kmeans.cluster_centers_
kmeans_labels_k3 = kmeans.labels_
When running the 'sklearn.cluster.KMeans' I get the message:
'The kernel appears to have died. It will restart automatically.'
Any suggestions?
(Other sklearn packages work e.g., random forests)
Access to data can be found here:
https://github.com/JakeTufts/Health-Data-Science-Msc/blob/Stack-overflow/cluster_data.csv
Your input data is a Pandas DataFrame, try to use a numpy matrix instead
df2 = df.to_numpy()
sklearn.cluster.KMeans(n_clusters=3, init='random').fit(df2) # This is where the kernel dies
Now it should work, please let me know the results.
You maybe see a warning message which says about incompatible libraries with tutorial data.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
export MKL_THREADING_LAYER=GNU
This solved that kmeans kills my jupyter kernel.
Ref = https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md#workarounds-for-intel-openmp-and-llvm-openmp-case

plot calibration curve for machine learning

I have the code below and this code work only with the binary class so how can I use with three classes.
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import scikitplot as skp
orgnal_data = pd.read_excel("movie.xls")
# Program extracting first column
text = orgnal_data.iloc[:,0]
lable = orgnal_data.iloc[:,1]
x_train,x_test,y_train,y_test=train_test_split(fe,lable,test_size=0.30,random_state=40)
DT = DecisionTreeClassifier()
DT_y = DT.fit(x_train,y_train).predict(x_test)
clf_names = ['Decision Tree']
skp.metrics.plot_calibration_curve(y_test,DT_y,clf_names)
plt.show()
Since you use scikit-plot module, there is no function for a multiclass problem.
Read the source code here:
This function currently only works for binary classification.
So you can either 1) modify the source code or 2) open a github issue and request a function for multiclass problems.
EDIT 1:
Using scikit-learn you have some ML models that can handle multiclass problems. For example for the LinearSVC function here, the multiclass support is handled according to a one-vs-the-rest scheme.
So you can actually have models like this and then use the plot_calibration_curve function for each case (one VS rest) separately.

Sklearn decision tree plot does not appear

I am trying to follow scikit learn example on decision trees:
from sklearn.datasets import load_iris
from sklearn import tree
X, y = load_iris(return_X_y=True)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
When I try to plot the tree:
tree.plot_tree(clf.fit(iris.data, iris.target))
I get
NameError Traceback (most recent call last)
<ipython-input-2-e72b33a93ee6> in <module>
----> 1 tree.plot_tree(clf.fit(iris.data, iris.target))
NameError: name 'iris' is not defined
Your problem was different, but I ended up here through googling this issue and you have also same-ish issue present.
At least on windows matplotlib (which is used to show the tree with tree.plot_tree) will not show anything if you don't have plt.show() somewhere.
from sklearn import tree
import matplotlib.pyplot as plt
sometree = ....
tree.plot_tree(sometree)
plt.show() # mandatory on Windows
iris doesn't exist if you don't assign it. Use this line to plot:
tree.plot_tree(clf.fit(X, y))
You already assigned the X and y of load_iris() to a variable so you can use them.
Additionally, make sure the graphviz library's bin folder is in PATH.

Tsfresh takess too long that the computer can handle

I am trying to use tsfresh feature extraction library in python 3.7.1 using efficient parameters with a test file (24 rows x 366 columns)
it never stops and keeps processing and i tried to run same library on a different laptop with installed python 2.17.16 but the tsfresh library did not work.
what should i do?
# Import Data from CSV file
#import csv
#with open('T7.csv') as T7:
# reader = csv.reader(T7)
# try:
# for row in reader:
# print(row)
# finally:
# T7.close()
from matplotlib import pyplot as plt
from matplotlib import style
import numpy as np
import pandas as pd
style.use('ggplot')
#from tsfresh import extract_features #as tsfreshobj
#from tsfresh import MinimalFeatureExtractionSettings
from tsfresh.feature_extraction import extract_features, EfficientFCParameters
#X = extract_features(df, column_id='id', column_sort='time')
y=pd.read_csv ('1.csv')#, skiprows=1)
#y=np.loadtxt('T7_2.csv')#,
#unpack=True,
# delimiter=',')
#y1=tsfreshobj.feature_extraction.extraction.generate_data_chunk_format(y)
#y2=tsfreshobj.feature_extraction.feature_calculators.absolute_sum_of_changes(y1)
#y1=extract_features(y, feature_extraction_settings=MinimalFeatureExtractionSettings)
print (y)
# from tsfresh.feature_extraction import MinimalFeatureExtractionSettings
y1=extract_features(y, column_id='time', default_fc_parameters=EfficientFCParameters())#, column_sort='time')
print (y)
print (y1)
plt.plot(y1)
print (y)
plt.title ('some numbers')
plt.ylabel('Y axis')
plt.xlabel ('X axis')
plt.show()
Have you tried with the MinimalFCParameters if it works at all? With these, it should be finished in a matter of seconds.
One problem could be, that you need to wrap your code in a if __name__ == "__main__", otherwise the multiprocessing library will have a problem.
If this does not help, you could use any of the techniques I described e.g. here to parallelize the tsfresh computation.
The issue with installing tsfresh on your other machine is unrelated to tsfresh - the error message shows that you did not have internet connection while calling pip install.

Cannot find reference to Python package (plt.cm.py)

I have a small issue with running code from a tutorial that isn't working as it should. It's not a syntax problem for sure. I'm working with scikit-learn and matplotlib, and I'm getting a warning message in my IDE "Cannot find reference 'gray_r' in 'cm.py'..." All my packages are installed properly (via pip) and have worked for sample programs except this.
Any advice?
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
print(digits.data)
print(digits.target)
print(digits.images[0])
clf = svm.SVC(gamma=0.001, C=100)
print(len(digits.data))
x, y = digits.data[:-1], digits.target[:-1]
clf.fit(x,y)
print('Prediction:', clf.predict(digits.data[-1])
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
Well for starters your missing a closing parenthesis on your last print statement: print('Prediction:', clf.predict(digits.data[-1])) Other than that, this code runs on my computer with only a deprecation warning. What does the traceback say?

Categories

Resources