How to make linear regression to String column in python?

How to make linear regression to String column in python? - python

I have a csv file with 2 columns. One column has string toxic comments, other column has float toxicity values 0 to 1. (comments become more toxic when toxicity value close to 1).
And I want to do linear regression for correctly predict amount of toxic values.
For that, first I converted the "comment" (string) column to integer like that :
train['comment']= pd.to_numeric(train['comment'], errors='coerce').fillna(0).astype(np.int64)
Then, I wrote that code for linear regression :
linX = train.iloc[:, 0].values.reshape(-1,1)
linY = train.iloc[:, 1].values.reshape(-1,1)
lr = LinearRegression()
lr.fit(linX, linY)
Y_pred = lr.predict(linX)
plt.scatter(linX,linY)
plt.plot(linX,Y_pred, color='red')
That worked but I don't think I did right. Because that regression table didn't seem right to me :
I couldn't solve the problem. My questions is ;
Is my code for linear regression for this problem right ?
Should I split the "toxicity" column seperate from 0 values ?

I'm not sure if turning strings into numeric values with the code below will return the results you are looking for.
pd.to_numeric(train['comment'], errors='coerce')
This code only change the variable type of the string comments. String comments are unable to be converted into integers. The coerce optional parameter causes the strings to be converted into NaN values, and the NaN values are converted into zeros with fillna.
To solve text classification problems using machine learning techniques you will need to preprocess the data using techniques such as TF-IDF.

Related

How to deal with really small (order of -322) floating values in pandas dataframe?

I have a pandas dataframe with feature values that are, really really small, of the order -322. I am trying to standardize the features but getting
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
A few values from the dataframe are as follows:
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
I am assuming that I am dealing with value underflow problem. How can I deal with this problem.
This is for python 3.6 and pandas dataframe.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The values in the dataframe should be standardized as needed but getting error due to value underflow.

Multiply them.
You're right: your values are too small for Pandas to handle as floats. The minimum np.float64 value is ~2.22e-308. You can handle somewhat smaller values by using more obscure types like np.longdouble, but these have their limits too and can be system-dependent.
As some of the comments point out, most plausible use cases don't require values this small. But if yours does, one simple way to get around the float boundaries is to multiply all of your values by a consistent integer that brings them within the acceptable float range (perhaps by 10^320). You're not losing any information, just dropping a long sequence of zeroes.
Note: this only works if you're not simultaneously storing numbers too huge to multiply without breaking the float limits in the other direction. But this seems unlikely.

Store the log of the number, and reverse with exp when needed later. If you then need to shift them the shift is additive (instead of multiplicative). Working in the log-space helps avoid machine zero though you'll still have issues you need to deal with operating with the log values, i.e. log-of-sum != sum-of-logs

You should try normalization of your data to bring it within some scale of value.
Here is the sample code
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
You are receiving NAN because the numbers went off your handling scale.
EDIT1:
Your error says that your dataset contains NAN values and cannot be converted to float64 type. Are you sure there are no empty values. If so try to drop those
values using .drop function like below:
DataFrame.drop()

Python SKLearn fit Value Error Input

I'm trying to fit and transform some data to use later in a model to a Classifier but it's always giving me an error and I don't understand why.
Please, can somebody help me?
##stores the function Pipeline with parameters decided above
inputPipe = getPreProcPipe(normIn=normIn, pca=pca, pcaN=pcaN, whiten=whiten)
print inputPipe
print
#print devData[classTrainFeatures].values.astype('float32')
print devData[classTrainFeatures].shape
print type(devData[classTrainFeatures].values)
##fit pipeline to inputs features and types
inputPipe.fit(devData[classTrainFeatures].values.astype('float32'))
##transform inputs X
X_class = inputPipe.transform(devData[classTrainFeatures].values.astype(double))
## Output Y, i.e, 0 or 1 as it is the target
Y_class = devData['gen_target'].values.astype('int')
#print Y_class
Output:
Pipeline(memory=None,
steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('normPCA', StandardScaler(copy=True, with_mean=True, with_std=True))])
(32583, 2)
<type 'numpy.ndarray'>
Error in the end of code:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Code
Error part 1
Error part 2

you have to check the data you use ( not the code ) if it contains NaN ( not a number values ), in numpy there is the function .isnan() ( https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html ) for this How to get the indices list of all NaN value in numpy array?
also check for infinite values with .isinf()
in this kaggle kernel is example code for filling NaNs and Infs in datasets that then are used in classifiers https://www.kaggle.com/mknorps/titanic-with-decision-trees , also see https://datascience.stackexchange.com/questions/25924/difference-between-interpolate-and-fillna-in-pandas?rq=1 for interpolate()
dropping rows that contain NaNs and Infs is done by
indx = devData[classTrainFeatures].index[devData[classTrainFeatures].apply(np.isnan)]
devData=devData.drop(devData.index[indx]).copy()
devData=devData.reset_index(drop=True)
( get index of NaN , drop all rows containing NaN using the index, reset index of dataframe )

I see 3 possibilities for this kind of error:
You may have Infs in your data. In that case you may need to remove those samples. To find the Infs try. df.index[np.isinf(df).any(1)]
You may have NaNs in yout data. Check it using df.index[np.isnan(df).any(1)]. In that case you may replace the NaNs with the mean value of the column doing df.fillna(df.mean()).dropna(axis=1, how='all') .
Finally but most probably, is that you have a constant or almost constant feature that, once it gets normalized and divided by the standard deviation gives you NaNs or Infs. In that case you should drop that feature using VarianceThreshold

conversion of pandas dataframe to h2o frame efficiently

I have a Pandas dataframe which has Encoding: latin-1 and is delimited by ;. The dataframe is very large almost of size: 350000 x 3800. I wanted to use sklearn initially but my dataframe has missing values (NAN values) so i could not use sklearn's random forests or GBM. So i had to use H2O's Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do h2o.H2OFrame(data). I checked for the possibility for providing the Encoding Options but there is nothing in the documentation.
Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python

import h2o
import pandas as pd
df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)
Since the problem that you are facing is due to the high number of NANs in the dataset, this should be handled first. There are two ways to do so.
Replace NAN with a single, obviously out-of-range value.
Ex. If a feature varies between 0-1 replace all NAN with -1 for that feature.
Use the class Imputer to handle NAN values. This will replace NAN with either of mean, median or mode of that feature.

If there are large number of missing values in your data and you want to increase the efficiency of conversion, I would recommend explicitly specifying the column types and NA strings instead of letting H2O interpret it. You can pass a list of strings to be interpreted as NAs and a dictionary specifying column types to H2OFrame() method.
It will also allow you to create custom labels for the sensors that are not present, instead of having a generic "not available" (impute NaN values with a custom string in pandas).
import h2o
col_dtypes = {'col1_name':col1_type, 'col2_name':col2_type}
na_list = ['NA', 'none', 'nan', 'etc']
hf = h2o.H2OFrame(df, column_types=col_dtypes, na_strings=na_list)
For more information - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/frame.html#H2OFrame
Edit: #ErinLeDell 's suggestion to use h2o.import_file() directly with specifying column dtypes and NA string will give you the largest speed-up.

Output of sklearn.ensemble.RandomForestClassifier includes NaN values

I am using sklearn.ensemble.RandomForestClassifier to analyze data and I was puzzled to see NaN values in the prediction without any NaN in the training set or in testing set.
print preds_y[preds_y.isnull().any(axis=1)].shape
print train_y[train_y.isnull().any(axis=1)].shape
print train_features[train_features.isnull().any(axis=1)].shape
print test_features[train_features.isnull().any(axis=1)].shape
> (4830, 1)
> (0, 1)
> (0, 22)
> (0, 22)
These NaN values are causing the call to sklearn.metrics.classification_report to fail with the following error:
> ValueError: Mix of label input types (string and number)
Right now I'm mostly interested in understanding why the random forest is spitting out NaNs. As soon as I figure that out, I can filter the results accordingly and see how well the method is performing.
Thanks in advance for your input.
(I'm sorry if this has been asked before. I searched for it but all the results I found concerned NaNs in the training data, which is not my issue at all.)
EDIT 1: Just to be clear, there are many valid predictions in the output:
print preds_y[~preds_y.isnull().any(axis=1)].shape
print train_y[~train_y.isnull().any(axis=1)].shape
> (11760, 1)
> (39749, 1)
EDIT 2:
As I wrote in a comment below, the original data has numeric and categorical columns. All the categorical columns are converted to numeric using pandas.get_dummies() before calling fit(). I convert the results back to a pandas.DataFrame and reconstruct the original categorical columns for readability. The two pandas.Series -- predicted and actual values -- I am feeding classification_report() have only one type (category).
It seems that the NaNs in the predictions arise if the random forest predicts 0 for every dummy binary column corresponding to the original categorical column. I was not expecting this to happen so often -- it seems that 30% of my entries go unclassified -- but I'm not sure there is anything further to add on this issue.

You can first remove all NaN by replacing them as zeros.
See this link.
Maybe use df.fillna(0), then you should be fine I suppose.

Problems with a binary one-hot (one-of-K) coding in python

Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me.
Pandas and get_dummies in the categorical columns of the data-frame. This method seems excellent as far as the original data-frame contains all data available. That is, you do the one-hot coding before splitting your data in training, validation, and test sets. However, if the data is already split in different sets, this method doesn't work very well. Why? Because one of the data sets (say, the test set) can contain fewer values for a given variable. For example, it can happen that whereas the training set contain the values red, blue, yellow, and unknown for the variable color, the test set only contains red and blue. So the test set would end up having fewer columns than the training set. (I don't know either how the new columns are sorted, and if even having the same columns, this could be in a different order in each set).
Sklearn and DictVectorizer This solves the previous issue, as we can make sure that we are applying the very same transformation to the test set. However, the outcome of the transformation is a numpy array instead of a pandas data-frame. If we want to recover the output as a pandas data-frame, we need to (or at least this is the way I do it): 1) pandas.DataFrame(data=outcome of DictVectorizer transformation, index=index of original pandas data frame, columns= DictVectorizer().get_features_names) and 2) join along the index the resulting data-frame with the original one containing the numerical columns. This works, but it is somewhat cumbersome.
Is there a better way to do a binary one-hot encoding within a pandas data-frame if we have our data split in training and test set?

If your columns are in the same order, you can concatenate the dfs, use get_dummies, and then split them back again, e.g.,
encoded = pd.get_dummies(pd.concat([train,test], axis=0))
train_rows = train.shape[0]
train_encoded = encoded.iloc[:train_rows, :]
test_encoded = encoded.iloc[train_rows:, :]
If your columns are not in the same order, then you'll have challenges regardless of what method you try.

You can set your data type to categorical:
In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})
In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green
In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
See this issue of Pandas.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.