I have a large amount of non-image data spread across several delimited files which I want to use as inputs to a DNN in TensorFlow. The data need some pre-processing, so I am trying to use the CIFAR10 example in the TensorFlow source as an example because it has pre-processing, it processes multiple files, and queues data for the model.
I cannot figure out how the data should be represented given that I'll have multiple FeatureColumns and the data are read record by record.
My input data look like below, delimited by '|'. The first column I want to pre-process, the result of which is two values; these values will then convert into Tensors with tf.contrib.layers.sparse_column_with_hash_bucket; the second is a real-valued column which I want to convert with tf.contrib.layers.real_valued_column; and the third is the label I want to predict.
uywohy|12.3|0
asdfsvjlk|2.2|1
nlnliu|1.0|1
nlwljw|9.6|0
My plan is to read the data with tf.TextLineReader, split the data on the delimiter, and then pre-process. The example code starts here.
# Read in and pre-process a single record
DELIMITER = "|"
reader = tf.TextLineReader()
unparsed_record = reader.read()
col1, col2, label = unparsed_record.split(DELIMITER)
result.label = tf.cast(label, tf.int32)
col1_a, col1_b = _preprocess(col1)
# How to convert col1_a, col1_b, and col2 into a Tensor?
However, I'm not sure how to then re-assemble the data (col1_a, col1_b, and col2) into something that can be fed into the model. The CIFAR10 model doesn't make use of feed_dict so I don't see how the assemble the data.
Any help is much appreciated.
You can use tf.learn which requires a input function that populates the mappings from feature columns to tensors or sparse tensors. Here is an example: https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/learn/wide_n_deep_tutorial.py
Related
How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:
What is the source variable of each column in the output array?
If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
What is the exact imputed value for each variable?
What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)
I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.
The answer which had mentioned is based on this in Sklearn.
You can get the answer for your first two question using the following snippet.
def get_feature_names(columnTransformer):
output_features = []
for name, pipe, features in columnTransformer.transformers_:
if name!='remainder':
for i in pipe:
trans_features = []
if hasattr(i,'categories_'):
trans_features.extend(i.get_feature_names(features))
else:
trans_features = features
output_features.extend(trans_features)
return output_features
import pandas as pd
pd.DataFrame(preprocessor.fit_transform(X_train),
columns=get_feature_names(preprocessor))
transformed_cols = get_feature_names(preprocessor)
def get_original_column(col_index):
return transformed_cols[col_index].split('_')[0]
get_original_column(3)
# 'embarked'
get_original_column(0)
# 'age'
def get_category(col_index):
new_col = transformed_cols[col_index].split('_')
return 'no category' if len(new_col)<2 else new_col[-1]
print(get_category(3))
# 'Q'
print(get_category(0))
# 'no category'
Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.
I am running a tensorflow model on the gcp-ai platform. The dataset is large and not everything can be kept in memory at the same time, therefore I read the data into a tf.dataset using the following code:
def read_dataset(filepattern):
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0])
features=[cols[1],cols[2]]
label = cols[0]
return features, label
# Create list of files that match pattern
file_list = tf.io.gfile.glob(filepattern)
# Create dataset from file list
dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
return dataset
training_data=read_dataset(<filepattern>)
The problem is that the second column in my data is categorical, and I need to use one hot encoding. How can this be done, either in the function decode_csv or manipulate the tf.dataset later.
You could use tf.one_hot. Assuming that the second column is cols[1] and that the categorical values have been converted to integers, you could do the following:
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0]])
features=[cols[1], tf.one_hot(cols[2], nb_classes)]
label = cols[0]
return features, label
NOTE: Not tested.
In production level, I would like to use the beforehand-saved model to predict my real-time data.
However, I don't know how to set my real time data to have a consistent mapping with training data when factorizing categorical data.
From this article I know I can stack training data & new data together and make them consistent.
However, stacking and going through the whole process (doing the whole feature engineering, training and prediction) is too time consuming.
Whole process:15 mins v.s. model.prediction only: 3 sec
As the production level system is time sensitive, is there any method that I can use to factorize the new data to have the same mapping as training data?
Or can I only do it by «manually», such as
df.loc[df['col_name']=='YES', 'col_name'] = '1'
which could lead to very long coding?
If what you mean is accounting for novel categorical values as they come in (say, you get a new value of 'blue-green' for df.color), you could bounce any unexpected values to the same -1 bucket (unknown, let's say) and then handle that in post-processing or whenever it is that you re-tune the model.
Essentially, you could catch the category-exceptions and then handle them later on.
After few hours of work, I switch from pd.factorize to LabelEncoder().
As LabelEncoder only supports pd.series data, I tried to use a loop to go through all columns and store each LabelEncoder() fitted model to a dictionary.
In training data part
# list you want to do Label encoding
col_list = ['colA', 'colB', 'colC']
df[col_list]= df[col_list].fillna('NAN')
# create a dictionary to save LabelEncoder() model for each col
model = {}
# convert Categorical Data
for x in col_list:
encoder = LabelEncoder()
df[x]=encoder.fit_transform(df.__getattr__(x))
model[x]= encoder
# save dictionary to pickle file
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
In real-time data part :
with open('model.pkl', 'rb') as f:
model= pickle.load(f)
for x in col_list:
encoder = model[x]
try:
df[x]=encoder.transform(df[x].astype(str))
except:
df[x]=encoder.transform(df[x].astype(int))
As result it cost me 1.5 sec to load the data, do feature engineering and prediction.
Which algorithm are you using? I have come accross the same problem, but since I am using a LGBM, it turns out there is no need to factorize my data, the algorithm can handle catergorised values. I had to change the data type from 'object' to 'category'.
categorical_feats = [f for f in combined_data.columns if combined_data[f].dtype == 'object']
categorical_feats
for f_ in categorical_feats:
# Set feature type as categorical
combined_data[f_] = combined_data[f_].astype('category')
I have nearly a TB of data to process. I have a field which is of tags list that video is linked to. The problem is there are plenty of tags and one video info is linked to too many tags, How can I convert it( clean it) before processing. OnehotEncoding and all other algorithms don't fit with this one.
Example:
{"user_id":1, "vid_id":101, "name":"abc", "tags":["night", "horror"], "gender":"Male"}
{"user_id":2, "vid_id":192, "name":"xyz", "tags":["action", "twins"], "gender":"Male"}
and so on
the above json data has so many other params too. But I wanted to use this tag params into consideration.
Now I wanted to predict the gender of the data. Help me out with the algorithms or ideas. Using Python currently and using spark to load the big data.
You can read all of your data into a sparse matrix. The code below was built based on the brief data example you provided and will produce a sparse dictionary where each record is a row and each column is the count of how many times each term appears in the list of tags for that record. The vocabulary dict will provide a mapping of terms to their column index in the final matrix. Also, while looping across the dataset counting the tags a separate list, targets, is constructed with the outcome variables. In the end you should be able to use mat and targets to train your classifier.
idx_pointer = [0]
indices = []
mat_data = []
vocabulary = {}
targets = []
for d in data:
targets.append(d['gender'])
for t in d['tags']:
index = vocabulary.setdefault(t, len(vocabulary))
indices.append(index)
mat_data.append(1)
idx_pointer.append(len(indices))
mat = scipy.sparse.csr_matrix((mat_data, indices, idx_pointer), dtype=int)
Using the example input you provided the dense output would be a matrix like below.
night horror action twins
1 1 0 0
0 0 1 1
I want to read more than just numeric features and labels from each line of the input CSV data file. My features and labels are both numeric, but I also want to use a date-like or string-like identifier for each line so I can refer back to the line after a random-shuffle batch read.
Since TensorFlow requires a record_defaults array that I pre-populate programmatically, how can I make TF read mixed types in one-line? If I pre-populate record_defaults with floats, I get an error for cells that contain dates/strings. If I pre-populate record_defaults with strings, with the view of casting the numeric parts later into floats, I get an error that TF doesn't support string-to-float casting.
The docs state:
A list of Tensor objects with types from: float32, int32, int64, string. One tensor per column of the input record, with either a scalar default value for that column or empty if the column is required
As per above, how can I specify 'empty' if I just want all columns as required and skip the whole defaults headache? And in that case how would TF know what data types to pick automatically for each cell? I have lots of columns, so it's not practical to write it out one-by-one as is done in the one example in the docs:
record_defaults = [[1], [1], [1], [1], [1]]
(I have hundreds of cells!)
My code looks something like this:
rDefaults = [[0.02] for row in range((300))]
reader = tf.TextLineReader(skip_header_lines=False)
_, csv_row = reader.read(filename_queue)
data = tf.decode_csv(csv_row, record_defaults=rDefaults)
dateLbl = tf.slice(data, [0], [TD]) # <- this is where I'm having the problem
features = tf.slice(data, [TD], [TS])
label = tf.slice(data, [TS], [TL])
Thanks for any input!