Tensorflow OneHotEncoder on large file for Linear Regression - python

I need to run a simple linear regression on a large dataset 30GB that can't be load in memory. The features are mostly categorical data. I've already build a prototype in scikit-learn that works just fine, but only works on a subsample of the data.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model
data = pd.read_csv('datafile.csv', nrows=5e6)
""" data['categorical_feature'] it's a text field, which has categories comma separated. Example of structure is shown below.
categorical_feature
1 1671,1293
2 1293
3 1233,1671
"""
cat_vec = CountVectorizer(min_df=2)
m_cat = cat_vec.fit_transform(data['categorical_feature'])
lm = linear_model.Ridge()
lm.fit(m_cat, data['target'])
How would I write this in tensorflow? I've looked around didn't find much that can replicate the behaviour of CountVectorizer in scikit-learn.

Related

merging train and test datasets into one using tensorflow

I am working with the classic titanic dataset and trying to apply NNs. My data comes already split into train and dev sets. However, I want to merge the datasets together for many things (for example, my own splitting, etc..)
Is there a way I can merge both datasets?
I have looked around and only found information about how to split a dataset, but I was unable to find how to merge them back together.
Any help?
A MWE is provided below!
from __future__ import absolute_import,division,print_function,unicode_literals
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output
from six.moves import urllib
import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf
import seaborn as sns
# URL address of data
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
# Downloading data
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
# Reading data
data_train = pd.read_csv(train_file_path)
data_test = pd.read_csv(test_file_path)
MY_DATA= MERGE HERE????? # merge(data_train,data_test)??
I assume data_train and data_test have the same number of columns and the column names are the same. Then just do
merged_df= pd.concat([data_train, data_test], axis=0)

How to use scikit-learn linear regression without using split?

I got 2 CSV named train.csv and test.csv.
Both files have the same structure, and I want to use train.csv as train data and test.csv as test data.
The thing is, I can't find anywhere how to use scikit-learn linear regression without using split, every tutorial/documentation I find uses the function train_test_split(), but if I understand correctly it's used to split one file (let's say data.csv) as both train and test data.
Is it even possible? If no, what alternative can I use?
If you have separate train, test data,
define X_train and y_train
X_train is the features excluding the target variable
# Sudo Code
X_train = train.drop(target, axis=1)
y_train is the target variable
# Sudo Code
y_train = train[target]
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

Why the line not cut across the data?

I using linear regression model to predict my data.
Orig Data
When I using sns plot; I able to see the line cut's thru to all the data point.
Using snsborn.lmplot
But when I using train_test_split function:
The coeff & interc as below :
Weight = [0.20504568]
Intercept = -1.0383656275693958
But graph is totally out
graph using train test split
How can I fix this?
My bad... i using the wrong X value for the line graph.

Sign Language Glove project from GitHub: Help in understanding code

I'm very new to Python and I'm trying to replicate this Sign Language Glove project heree with my own hardware for a first practice into Machine Learning. I could already write data in CSV files from my accelerometers, but I can't understand the process. The file named 'modeling' confuses me. Can anyone help me understand what are the processes happening?
import numpy as np
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import pandas as pd
df= pd.read_csv("final.csv") ##This I understand. I've successfully created csv files with data
#########################################################################
## These below, I do not.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.2)
train_features = train[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
train_label = train.cl
test_features = test[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
test_label = test.cl
## SVM
model = svm.SVC(kernel='linear', gamma=1, C=1)
model.fit(train_features, train_label)
model.score(train_features, train_label)
predicted_svm = model.predict(test_features)
print "svm"
print accuracy_score(test_label, predicted_svm)
cn =confusion_matrix(test_label, predicted_svm)
Welcome to the community. That looks like a nice way to start off.
Like #hilverts_drinking_problem suggested, I would recommend looking at sklearn documentation. But here's a quick explanation of what's going on.
The train, test split function randomly splits the dataset into two datasets for the sake of training and testing. test_size = 0.2 means 20% of the data will be in the test set, remaining 80% in train.
The next two lines are just separating out the inputs (features) and outputs (targets) for training. Same for test in the next two lines.
Finally, you create an SVM object, train the model using model.fit, and get its score using .score. You then use the model to predict stuff for the test set. Finally, you print the accuracy score for your test set, along with its confusion matrix.
If you need me to clarify/detail something, let me know!

OneClassSVM scikit learn

I have two data sets, trainig and test. They have labels "1" and "0". I need to evaluate these data sets using "oneClassSVM" Algorithm with "rbf" kernel in scikit learn. I loaded training data set, but I have no idea how to evaluate that with test data set. Below is my code,
from sklearn import svm
import numpy as np
input_file_data = "/home/anuradha/TrainData.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Please some one can help me to solve this problem ?
It's as simple as adding the following two lines of code at the end of your script:
estimator.fit(X_train)
y_pred_test = estimator.predict(X_test)
The first line tells svn which training data to use and the second one makes prediction on the test set (be sure to load both datasets and to change variable names accordingly).
Here there is a complete example on how to use OneClassSVM and here the class reference.

Categories

Resources