One hot encoding when reading into tf.dataset - python

I am running a tensorflow model on the gcp-ai platform. The dataset is large and not everything can be kept in memory at the same time, therefore I read the data into a tf.dataset using the following code:
def read_dataset(filepattern):
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0])
features=[cols[1],cols[2]]
label = cols[0]
return features, label
# Create list of files that match pattern
file_list = tf.io.gfile.glob(filepattern)
# Create dataset from file list
dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
return dataset
training_data=read_dataset(<filepattern>)
The problem is that the second column in my data is categorical, and I need to use one hot encoding. How can this be done, either in the function decode_csv or manipulate the tf.dataset later.

You could use tf.one_hot. Assuming that the second column is cols[1] and that the categorical values have been converted to integers, you could do the following:
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0]])
features=[cols[1], tf.one_hot(cols[2], nb_classes)]
label = cols[0]
return features, label
NOTE: Not tested.

Related

Detect the two most frequent labels in the labels variable, the other records of the dataset will be eliminated

I am working with a data set, from which I need to remove some records from a variable.
The datasets a is from the sklearn library:
from sklearn.datasets import fetch_kddcup99
Detect the two most frequent labels in the labels variable, the other records of the dataset will be eliminated.
datos = pd_data.groupby('labels').size().sort_values(ascending=False)
top = datos.head(2)
print(top)
I try to delete them this way but I can't delete them:
When looking at the dataset the other records still follow:
And I need:
If I understand your question, you want to create a dataframe containing only those records containing the two most frequent labels.
Assuming you have a list of the desired labels a you can filter the dataframe as follows:
a = ["b'neptune,'", "b'normal,'"]
dfout = df['labels].isin(a)

Keeping track of the output columns in sklearn preprocessing

How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:
What is the source variable of each column in the output array?
If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
What is the exact imputed value for each variable?
What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)
I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.
The answer which had mentioned is based on this in Sklearn.
You can get the answer for your first two question using the following snippet.
def get_feature_names(columnTransformer):
output_features = []
for name, pipe, features in columnTransformer.transformers_:
if name!='remainder':
for i in pipe:
trans_features = []
if hasattr(i,'categories_'):
trans_features.extend(i.get_feature_names(features))
else:
trans_features = features
output_features.extend(trans_features)
return output_features
import pandas as pd
pd.DataFrame(preprocessor.fit_transform(X_train),
columns=get_feature_names(preprocessor))
transformed_cols = get_feature_names(preprocessor)
def get_original_column(col_index):
return transformed_cols[col_index].split('_')[0]
get_original_column(3)
# 'embarked'
get_original_column(0)
# 'age'
def get_category(col_index):
new_col = transformed_cols[col_index].split('_')
return 'no category' if len(new_col)<2 else new_col[-1]
print(get_category(3))
# 'Q'
print(get_category(0))
# 'no category'
Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.

Factorize real time data with consistent mappings to training data?

In production level, I would like to use the beforehand-saved model to predict my real-time data.
However, I don't know how to set my real time data to have a consistent mapping with training data when factorizing categorical data.
From this article I know I can stack training data & new data together and make them consistent.
However, stacking and going through the whole process (doing the whole feature engineering, training and prediction) is too time consuming.
Whole process:15 mins v.s. model.prediction only: 3 sec
As the production level system is time sensitive, is there any method that I can use to factorize the new data to have the same mapping as training data?
Or can I only do it by «manually», such as
df.loc[df['col_name']=='YES', 'col_name'] = '1'
which could lead to very long coding?
If what you mean is accounting for novel categorical values as they come in (say, you get a new value of 'blue-green' for df.color), you could bounce any unexpected values to the same -1 bucket (unknown, let's say) and then handle that in post-processing or whenever it is that you re-tune the model.
Essentially, you could catch the category-exceptions and then handle them later on.
After few hours of work, I switch from pd.factorize to LabelEncoder().
As LabelEncoder only supports pd.series data, I tried to use a loop to go through all columns and store each LabelEncoder() fitted model to a dictionary.
In training data part
# list you want to do Label encoding
col_list = ['colA', 'colB', 'colC']
df[col_list]= df[col_list].fillna('NAN')
# create a dictionary to save LabelEncoder() model for each col
model = {}
# convert Categorical Data
for x in col_list:
encoder = LabelEncoder()
df[x]=encoder.fit_transform(df.__getattr__(x))
model[x]= encoder
# save dictionary to pickle file
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
In real-time data part :
with open('model.pkl', 'rb') as f:
model= pickle.load(f)
for x in col_list:
encoder = model[x]
try:
df[x]=encoder.transform(df[x].astype(str))
except:
df[x]=encoder.transform(df[x].astype(int))
As result it cost me 1.5 sec to load the data, do feature engineering and prediction.
Which algorithm are you using? I have come accross the same problem, but since I am using a LGBM, it turns out there is no need to factorize my data, the algorithm can handle catergorised values. I had to change the data type from 'object' to 'category'.
categorical_feats = [f for f in combined_data.columns if combined_data[f].dtype == 'object']
categorical_feats
for f_ in categorical_feats:
# Set feature type as categorical
combined_data[f_] = combined_data[f_].astype('category')

Appending to h5py groups

I have files with the following structure:
time 1
index 1
value x
value y
time 1
index 2
value x
value y
time 2
index 1
...
I wish to convert the file to the hdf5 format using h5py, and sort the values from each index into separate groups.
My approach is
f = h5py.File(filename1,'a')
trajfile = open(filename2, 'rb')
for i in range(length_of_filw):
time = struct.unpack('>d', filename2.read(8))[0]
index = struct.unpack('>i', filename2.read(4))[0]
x = struct.unpack('>d', filename2.read(8))[0]
y = struct.unpack('>d', filename2.read(8))[0]
f.create_dataset('/'+str(index), data=[time,x,y,z])
But in this way I am not able to append to the groups (I am only able to write to each group once...). The error message is "RuntimeError: Unable to create link (name already exists)".
Is there a way to append to the groups?
You can write to a dataset as many times as you want - you just can't have twice a dataset with the same name. This is the error you're getting. Note that you are creating a dataset and at the same time you are putting some data inside of it. In order to write other data to it, it has to be large enough to accomodate it.
Anyway, I believe you are confusing groups and datasets.
Groups are created with e.g.
grp = f.create_group('bar') # this create the group '/bar'
and you want to store datasets in a dataset, created like you said with:
dst = f.create_dataset('foo',shape=(100,)) # this create the dataset 'foo', with enough space for 100 elements.
you only need to create groups and datasets once - but you can refer to them through their handles, (grp and dst), in order to write in them.
I suggest you first go through your file once, create your desired groups and datasets using the 'shape' parameter to properly size it, and then populate the datasets with actual data.

Pre-process non-image data to feed into Tensorflow DNN

I have a large amount of non-image data spread across several delimited files which I want to use as inputs to a DNN in TensorFlow. The data need some pre-processing, so I am trying to use the CIFAR10 example in the TensorFlow source as an example because it has pre-processing, it processes multiple files, and queues data for the model.
I cannot figure out how the data should be represented given that I'll have multiple FeatureColumns and the data are read record by record.
My input data look like below, delimited by '|'. The first column I want to pre-process, the result of which is two values; these values will then convert into Tensors with tf.contrib.layers.sparse_column_with_hash_bucket; the second is a real-valued column which I want to convert with tf.contrib.layers.real_valued_column; and the third is the label I want to predict.
uywohy|12.3|0
asdfsvjlk|2.2|1
nlnliu|1.0|1
nlwljw|9.6|0
My plan is to read the data with tf.TextLineReader, split the data on the delimiter, and then pre-process. The example code starts here.
# Read in and pre-process a single record
DELIMITER = "|"
reader = tf.TextLineReader()
unparsed_record = reader.read()
col1, col2, label = unparsed_record.split(DELIMITER)
result.label = tf.cast(label, tf.int32)
col1_a, col1_b = _preprocess(col1)
# How to convert col1_a, col1_b, and col2 into a Tensor?
However, I'm not sure how to then re-assemble the data (col1_a, col1_b, and col2) into something that can be fed into the model. The CIFAR10 model doesn't make use of feed_dict so I don't see how the assemble the data.
Any help is much appreciated.
You can use tf.learn which requires a input function that populates the mappings from feature columns to tensors or sparse tensors. Here is an example: https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/learn/wide_n_deep_tutorial.py

Categories

Resources