I using linear regression model to predict my data.
Orig Data
When I using sns plot; I able to see the line cut's thru to all the data point.
Using snsborn.lmplot
But when I using train_test_split function:
The coeff & interc as below :
Weight = [0.20504568]
Intercept = -1.0383656275693958
But graph is totally out
graph using train test split
How can I fix this?
My bad... i using the wrong X value for the line graph.
Related
This image includes my data I'm using for an ANOVA test.
Using this data, how can I test certain parts or comparisons for normality? For Python3
you could use a qq-plot
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot
column = "column of your dataset"
# q-q plot
qqplot(column, line='s')
pyplot.show()
basically the upper end and also the lower end of the Q-Q plot shouldn't deviate from the straight line
I need to run a simple linear regression on a large dataset 30GB that can't be load in memory. The features are mostly categorical data. I've already build a prototype in scikit-learn that works just fine, but only works on a subsample of the data.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model
data = pd.read_csv('datafile.csv', nrows=5e6)
""" data['categorical_feature'] it's a text field, which has categories comma separated. Example of structure is shown below.
categorical_feature
1 1671,1293
2 1293
3 1233,1671
"""
cat_vec = CountVectorizer(min_df=2)
m_cat = cat_vec.fit_transform(data['categorical_feature'])
lm = linear_model.Ridge()
lm.fit(m_cat, data['target'])
How would I write this in tensorflow? I've looked around didn't find much that can replicate the behaviour of CountVectorizer in scikit-learn.
Well I think I should mention it that it's the very first time I'm trying Audio signal processing in Python. I have an audio data set and I am extracting pitch features using Aubio library, and MFCC feature using the python_speech_features library in Python. The thing is, for a single audio file, I am getting around 84 valued vector for the pitch and 12 valued feature vector for MFCC.
Image of extracted pitch feature vector
So how do I save all these so many values in a single csv file? I have around 700 audio files separated in different directories wrt to emotions. Should I take the mean of all of these values and save them wrt the audio file in a csv? Like this:
Also, how would I used these values for classification then?
Any help would be much appreciated, Thanks.
There is not a simple answer to your question.
I have understand that for each data sample you extract a set of features, the same for each sample, don't you?
I suppose you work within a for loop, something like this:
import numpy as np
all_features = []
for path in path_list:
x = open_file(path) #an hypothetical function to open your files
features = extract_features(x) #an hypothetical function to extract features
all_features.append(features)
if your code looks like my simple example, you have created a list all_features whose elements all_features[i] contains the extracted features from the sample i. In addition i suppose that your extracted features is a numpy vector. If it is not, you should convert it into a numpy vector (something like features = np.array(features)).
Ok, now you are ready to create a dataset:
data = np.vstack(all_features)
the vertical stack np.vstack generates a matrix of shape (n_samples, n_features). Warning: all features vector must have the same shape!
Now you want to save the dataset, there is on ocean of possibilities, this my favorite three options:
1) using pandas to create a csv file:
import pandas as pd
df = pd.DataFrame(data)
df.to_csv(filename+'.csv', index=False, header=header) #header is a list of string to name columns of csv
#see https://pandas.pydata.org/pandasdocs/stable/generated/pandas.DataFrame.to_csv.html
2) dump memory into a pickle file:
import six.moves.cPickle as pickle
with open(filename+'.pkl', 'wb') as f:
pickle.dump(data, f)
3)save as numpy file:
np.save(filename+'.npy', data)
Concerning the classification problem, if you want to use a supervised method (MLP, RF, SVM, KNN, ...) you need a class labels (the ground truth), i.e. a vector with shape equals to the number of sample that relates each sample to a integer (for example 0,1 in a binary classification, or 0,1,2,3 for a 4-class classification). This strongly depend from what you want, what is the goal of your training.
Once you have the the data matrix and the label vector, each machine-learning method will be able to classify, if you have enough samples. With this aim, i suggest you to use same augmenting criteria, to have an idea have a look to this paper, it could give you same ideas.
Hoping i have help you, good work!
Python has a built-in csv module.
This section's example gives a simple example on how to use a writer to write rows to your csv.
I am learning Python and I have a side project to learn to display data using matplotlib.pyplot module. Here is an example to display the data using dates[] and prices[] as data. Does anyone know why we need line 5 and line 6? I am confused why this step is needed to have the graph displayed.
from sklearn import linear_model
import matplotlib.pyplot as plt
def showgraph(dates, prices):
dates = numpy.reshape(dates, (len(dates), 1)) # line 5
prices = numpy.reshape(prices, (len(prices), 1)) # line 6
linear_mod = linear_model.LinearRegression()
linear_mod.fit(dates,prices)
plt.scatter(dates,prices,color='yellow')
plt.plot(dates,linear_mod.predict(dates),color='green')
plt.show()
try the following in terminal to check the backend:
import matplotlib
import matplotlib.pyplot
print matplotlib.backends.backend
If it shows 'agg', it is a non-interactive one and wont show but plt.savefig works.
To show the plot, you need to switch to TkAgg or Qt4Agg.
You need to edit the backend in matplotlibrc file. To print its location in terminal do the following.
import matplotlib
matplotlib.matplotlib_fname()
more about matplotlibrc
Line 5 and 6 transform what Im assuming are row vectors (im not sure how data and prices are encoded before this transformation) into column vectors. So now you have vectors that look like this.
[0,
1,
2,
3]
which is the form that linear_model.Linear_Regression.fit() is expecting. The reshaping was not necessary for plotting under the assumption that data and prices are row vectors.
My approach is exactly like yours but still without line 5 and 6 display is correct. I think those line are unnecessary. It seems that you do not need fit() function because of your input data are in row format.
I have two data sets, trainig and test. They have labels "1" and "0". I need to evaluate these data sets using "oneClassSVM" Algorithm with "rbf" kernel in scikit learn. I loaded training data set, but I have no idea how to evaluate that with test data set. Below is my code,
from sklearn import svm
import numpy as np
input_file_data = "/home/anuradha/TrainData.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Please some one can help me to solve this problem ?
It's as simple as adding the following two lines of code at the end of your script:
estimator.fit(X_train)
y_pred_test = estimator.predict(X_test)
The first line tells svn which training data to use and the second one makes prediction on the test set (be sure to load both datasets and to change variable names accordingly).
Here there is a complete example on how to use OneClassSVM and here the class reference.