conversion of pandas dataframe to h2o frame efficiently

conversion of pandas dataframe to h2o frame efficiently - python

I have a Pandas dataframe which has Encoding: latin-1 and is delimited by ;. The dataframe is very large almost of size: 350000 x 3800. I wanted to use sklearn initially but my dataframe has missing values (NAN values) so i could not use sklearn's random forests or GBM. So i had to use H2O's Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do h2o.H2OFrame(data). I checked for the possibility for providing the Encoding Options but there is nothing in the documentation.
Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python

import h2o
import pandas as pd
df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)
Since the problem that you are facing is due to the high number of NANs in the dataset, this should be handled first. There are two ways to do so.
Replace NAN with a single, obviously out-of-range value.
Ex. If a feature varies between 0-1 replace all NAN with -1 for that feature.
Use the class Imputer to handle NAN values. This will replace NAN with either of mean, median or mode of that feature.

If there are large number of missing values in your data and you want to increase the efficiency of conversion, I would recommend explicitly specifying the column types and NA strings instead of letting H2O interpret it. You can pass a list of strings to be interpreted as NAs and a dictionary specifying column types to H2OFrame() method.
It will also allow you to create custom labels for the sensors that are not present, instead of having a generic "not available" (impute NaN values with a custom string in pandas).
import h2o
col_dtypes = {'col1_name':col1_type, 'col2_name':col2_type}
na_list = ['NA', 'none', 'nan', 'etc']
hf = h2o.H2OFrame(df, column_types=col_dtypes, na_strings=na_list)
For more information - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/frame.html#H2OFrame
Edit: #ErinLeDell 's suggestion to use h2o.import_file() directly with specifying column dtypes and NA string will give you the largest speed-up.

Related

Not getting stats analysis of binary column pandas

I have a dataframe, 11 columns 18k rows. The last column is either a 1 or 0, but when I use .describe() all I get is
count 19020
unique 2
top 1
freq 12332
Name: Class, dtype: int64
as opposed to an actual statistical analysis with mean, std, etc.
Is there a way to do this?

If your numeric (0, 1) column is not being picked up automatically by .describe(), it might be because it's not actually encoded as an int dtype. You can see this in the documentation of the .describe() method, which tells you that the default include parameter is only for numeric types:
None (default) : The result will include all numeric columns.
My suggestion would be the following:
df.dtypes # check datatypes
df['num'] = df['num'].astype(int) # if it's not integer, cast it as such
df.describe(include=['object', 'int64']) # explicitly state the data types you'd like to describe
That is, first check the datatypes (I'm assuming the column is called num and the dataframe df, but feel free to substitute with the right ones). If this indicator/(0,1) column is indeed not encoded as int/integer type, then cast it as such by using .astype(int). Then, you can freely use df.describe() and perhaps even specify columns of which data types you want to include in the description output, for more fine-grained control.

You could use
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
data.describe(percentiles = perc, include = include)
where data is your dataframe (important point).
Since you are new to stack, I might suggest that you include some actual code (i.e. something showing how and on what you are using your methods). You'll get better answers

How to deal with really small (order of -322) floating values in pandas dataframe?

I have a pandas dataframe with feature values that are, really really small, of the order -322. I am trying to standardize the features but getting
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
A few values from the dataframe are as follows:
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
I am assuming that I am dealing with value underflow problem. How can I deal with this problem.
This is for python 3.6 and pandas dataframe.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The values in the dataframe should be standardized as needed but getting error due to value underflow.

Multiply them.
You're right: your values are too small for Pandas to handle as floats. The minimum np.float64 value is ~2.22e-308. You can handle somewhat smaller values by using more obscure types like np.longdouble, but these have their limits too and can be system-dependent.
As some of the comments point out, most plausible use cases don't require values this small. But if yours does, one simple way to get around the float boundaries is to multiply all of your values by a consistent integer that brings them within the acceptable float range (perhaps by 10^320). You're not losing any information, just dropping a long sequence of zeroes.
Note: this only works if you're not simultaneously storing numbers too huge to multiply without breaking the float limits in the other direction. But this seems unlikely.

Store the log of the number, and reverse with exp when needed later. If you then need to shift them the shift is additive (instead of multiplicative). Working in the log-space helps avoid machine zero though you'll still have issues you need to deal with operating with the log values, i.e. log-of-sum != sum-of-logs

You should try normalization of your data to bring it within some scale of value.
Here is the sample code
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
You are receiving NAN because the numbers went off your handling scale.
EDIT1:
Your error says that your dataset contains NAN values and cannot be converted to float64 type. Are you sure there are no empty values. If so try to drop those
values using .drop function like below:
DataFrame.drop()

Pandas - String values encoding

Can anyone please suggest what is the best way to encode string features wherein I have > 500 unique features. Does this fall under categorical Data?
I need to basically normalize data with string features having huge number of unique features and adjacent features are co-realted. ( eg. col1 and col2 have a particular combination for one class in classification Problem. Similarly col3 and col4 again have some fixed pattern for each class)
How do I encode my data in this scenario before making it ready for ML algorithm?

There are several ways to encode categorical features. The best way really depends on your dataset and which ML algorithm you are going to use, so you could try different encoding schemes and pick the one that has the best results.
I've worked with categorical features with hundreds of unique values (e.g. Product Brands) and with tree-based algorithms and a label-encoder worked well with the algorithm.
For example you could use the scikit-learn label encoder:
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']
You can do that in pandas as well, for example, if you have a column with the string categories you want to encode you could try this:
df["categorical_feature"] = df["categorical_feature"].astype('category')
df["categorical_feature_enc"] = df["categorical_feature"].cat.codes
Another useful encoding you could try is the one-hot encoding. However, since you have a lot of categories to encode that would result in an addition of n columns to your dataset per categorical feature (n = number of categories). Check the pandas get_dummies to see an example.

How to impute each categorical column in numpy array

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?

We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.

Problems with a binary one-hot (one-of-K) coding in python

Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me.
Pandas and get_dummies in the categorical columns of the data-frame. This method seems excellent as far as the original data-frame contains all data available. That is, you do the one-hot coding before splitting your data in training, validation, and test sets. However, if the data is already split in different sets, this method doesn't work very well. Why? Because one of the data sets (say, the test set) can contain fewer values for a given variable. For example, it can happen that whereas the training set contain the values red, blue, yellow, and unknown for the variable color, the test set only contains red and blue. So the test set would end up having fewer columns than the training set. (I don't know either how the new columns are sorted, and if even having the same columns, this could be in a different order in each set).
Sklearn and DictVectorizer This solves the previous issue, as we can make sure that we are applying the very same transformation to the test set. However, the outcome of the transformation is a numpy array instead of a pandas data-frame. If we want to recover the output as a pandas data-frame, we need to (or at least this is the way I do it): 1) pandas.DataFrame(data=outcome of DictVectorizer transformation, index=index of original pandas data frame, columns= DictVectorizer().get_features_names) and 2) join along the index the resulting data-frame with the original one containing the numerical columns. This works, but it is somewhat cumbersome.
Is there a better way to do a binary one-hot encoding within a pandas data-frame if we have our data split in training and test set?

If your columns are in the same order, you can concatenate the dfs, use get_dummies, and then split them back again, e.g.,
encoded = pd.get_dummies(pd.concat([train,test], axis=0))
train_rows = train.shape[0]
train_encoded = encoded.iloc[:train_rows, :]
test_encoded = encoded.iloc[train_rows:, :]
If your columns are not in the same order, then you'll have challenges regardless of what method you try.

You can set your data type to categorical:
In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})
In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green
In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
See this issue of Pandas.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.