Hello guys this is my code, am still a beginner using tensorflow, this is my code
am trying to run a text classification DNN until now everything is fine.
I want to save my model and import it so i can use it to predict new values but I don't have any idea how to do it.
To give you a genral idea on what am trying to do.
I have 2 folders (training & test)
each folder has (4 folders (classification categories))
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import logging
print("Loading all files from directory ...")
# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
data = {}
data["sentence"] = []
data["tnemitnes"] = []
print("getting in a loop")
for file_path in os.listdir(directory):
with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
print("directory : ",directory)
print("file path : ",file_path)
data["sentence"].append(f.read())
data["tnemitnes"].append(re.match("(\d+)\.txt", file_path).group(1))
return pd.DataFrame.from_dict(data)
print("merging all files in the training set ...")
# Merge all type of emails examples, add a polarity column and shuffle.
def load_dataset(directory):
pos_df = load_directory_data(os.path.join("train/br"))
neg_df = load_directory_data(os.path.join(directory, "train/mi"))
dos_df = load_directory_data(os.path.join(directory, "train/Brouillons")) #dsd
nos_df = load_directory_data(os.path.join(directory, "train/favoris")) #dsd
pos_df["polarity"] = 3
neg_df["polarity"] = 2
dos_df["polarity"] = 1
nos_df["polarity"] = 0
return pd.concat([pos_df, neg_df, dos_df , nos_df]).sample(frac=1).reset_index(drop=True)
print("Getting the data from files ...")
# Download and process the dataset files.
def download_and_load_datasets():
train_df = load_dataset(os.path.dirname("train"))
test_df = load_dataset(os.path.dirname("test"))
return train_df, test_df
print("configurring all logging output ...")
# Reduce logging output. ERROR
#logging.set_verbosity(tf.logging.INFO)
logging.getLogger().setLevel(logging.INFO)
print("Setting Up the data for the trainning ...")
train_df, test_df = download_and_load_datasets()
train_df.head()
print("Setting Up a Training input on the whole training set with no limit on training epochs ...")
# Training input on the whole training set with no limit on training epochs.
train_input_fn = tf.estimator.inputs.pandas_input_fn(train_df, train_df["polarity"], num_epochs=None, shuffle=True)
print("Setting Up a Prediction on the whole training set ...")
# Prediction on the whole training set.
predict_train_input_fn = tf.estimator.inputs.pandas_input_fn(train_df, train_df["polarity"], shuffle=False)
print("Setting Up a Prediction on the test set ...")
# Prediction on the test set.
predict_test_input_fn = tf.estimator.inputs.pandas_input_fn(test_df, test_df["polarity"], shuffle=False)
print("Removal of punctuation and splitting on spaces from the data ...")
#The module is responsible for preprocessing of sentences (e.g. removal of punctuation and splitting on spaces).
embedded_text_feature_column = hub.text_embedding_column(key="sentence", module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")
print("Setting Up The Classifier ...")
#Estimator : For classification I did use a DNN Classifier
estimator = tf.estimator.DNNClassifier(
hidden_units=[10, 20],
feature_columns=[embedded_text_feature_column],
n_classes=4,
optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))
print("Starting the Training ...")
# Training for 50 steps means 5000 training examples with the default
# batch size. This is roughly equivalent to 5 epochs since the training dataset
# contains less examples.
estimator.train(input_fn=train_input_fn, steps=20);
print("the Training had ended...")
print("setting Up the results ...")
train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)
print("Showing the results ...")
print("Training set accuracy: {accuracy}".format(**train_eval_result))
print("Test set accuracy: {accuracy}".format(**test_eval_result))
#this is when am having trouble !!! <====
tf.estimator.export(
os.path.dirname("Model"),
serving_input_fn,
default_output_alternative_key=None,
assets_extra=None,
as_text=False,
checkpoint_path=None,
graph_rewrite_specs=(GraphRewriteSpec((tag_constants.SERVING,), ()),),
strip_default_attrs=False
)
now after I have added the estimator export function it askes me to give serving_input_fn and to be honest I did find it hard to understand how to create one.
if there is an easier way it would be better.
You can esily get a serving_input_fn with tf.estimator.export.build_parsing_serving_input_receiver_fn (link)
In your case do something like:
serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(
[embedded_text_feature_column])
If you expect to pass tensors directly there's also build_raw_serving_input_receiver_fn in the same package.
All I had to do is add model_dir= os.getcwd()+'\Model' to the estimator
model_dir= os.getcwd()+'\Model'
this is the new Code , I have created a new Folder and named it model.
estimator = tf.estimator.DNNClassifier(
hidden_units=[10, 20],
feature_columns=[embedded_text_feature_column],
n_classes=4,
optimizer=tf.train.AdagradOptimizer(learning_rate=0.003),
model_dir= os.getcwd()+'\Model')
You might read this before.
Tensorflow: how to save/restore a model?
A serving_input_receiver_fn should be defined.
https://www.tensorflow.org/api_docs/python/tf/estimator/export/build_parsing_serving_input_receiver_fn
This document introduce a valuable method to build the serving_input_receiver_fn.
Here is the example:
# first you should prepare feature_spec. it include the speciation of your feature columns.
feature_spec = tf.feature_column.make_parse_example_spec(my_feature_columns)
print feature_spec
serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
export_model = classifier.export_savedmodel('./iris/', serving_input_receiver_fn)
Related
I am using DistilBERT to do sentiment analysis on my dataset. The dataset contains text and a label for each row which identifies whether the text is a positive or negative movie review (eg: 1 = positive and 0 = negative). Here is the code from the huggingface documentation (https://huggingface.co/transformers/custom_datasets.html?highlight=imdb)
#This dataset can be explored in the Hugging Face model hub (IMDb), and can be alternatively downloaded with the 🤗 Datasets library with load_dataset("imdb").
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz
#This data is organized into pos and neg folders with one text file per example. Let’s write a function that can read this in.
from pathlib import Path
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text())
labels.append(0 if label_dir is "neg" else 1)
return texts, labels
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
import torch
class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
#Now that our datasets our ready, we can fine-tune a model either #with the 🤗 Trainer/TFTrainer or with native PyTorch/TensorFlow. See #training.
#Fine-tuning with Trainer
#The steps above prepared the datasets in the way that the trainer is #expected. Now all we need to do is create a model to fine-tune, #define the TrainingArguments/TFTrainingArguments and instantiate a #Trainer/TFTrainer.
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
trainer.train()
#We can also train with Pytorch/Tensorflow
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
optim = AdamW(model.parameters(), lr=5e-5)
for epoch in range(3):
for batch in train_loader:
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
loss.backward()
optim.step()
model.eval()
I want to know test this model on a new piece of data. So, I have a dataframe which contains a piece of text/review for each row, and I want to predict the label. Does anyone know how I would go about doing that? I apologize, I am very new to this and would greatly appreciate any help! I tried taking in text, cleaning it, and then doing
prediction = model.predict(text)
and I got an error saying DistilBERT has no attribute .predict.
If you just want to use the model, you can use the corresponding pipeline:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
Then you can use it:
classifier("I hate this book")
The code that you've shared from the documentation essentially covers the training and evaluation loop. Beware that your shared code contains two ways of fine-tuning, once with the trainer, which also includes evaluation, and once with native Pytorch/TF, which contains just the training portion and not the evaluation portion.
Here is how the native method can be tweaked to generate predictions on the test set:
# Put model in evaluation mode
model.eval()
# Tracking variables for storing ground truth and predictions
predictions , true_labels = [], []
# Prediction Loop
for batch in test_dataset:
# Unpack the inputs from our dataloader and move to GPU/accelerator
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Telling the model not to compute or store gradients, saving memory and
# speeding up prediction
with torch.no_grad():
# Forward pass, calculate logit predictions
outputs = model(input_ids, attention_mask=attention_mask,
labels=labels)
logits = outputs[0]
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
label_ids = labels.to('cpu').numpy()
# Store predictions and true labels
predictions.append(logits)
true_labels.append(label_ids)
After the execution of this loop, predictions will contain logits, i.e., the probability distribution from the model before any form of normalization.
You can use the following to pick the label with the maximum score from the logits, and produce a classification report
from sklearn.metrics import classification_report, accuracy_score
# Combine the results across all batches.
flat_predictions = np.concatenate(predictions, axis=0)
# For each sample, pick the label (0 or 1) with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)
# Accuracy
print(accuracy_score(flat_true_labels, flat_predictions))
# Classification Report
report = classification_report(flat_true_labels, flat_predictions)
For a more elegant way of performing predictions, you can create a BERTModel Class that would contain different methods and variables for handling the tokenization, creation of dataloader, running the predictions, etc.
You can try code like this example: Link-BERT
You'll arrange the dataset according to the BERT model. D Section in this link, you can just change the model name and your dataset.
I'm learning to use Detecron2. I've followed this link to create a custom object detector.
My training code -
# training Detectron2
from detectron2.engine import DefaultTrainer
from detectron2.config import get_cfg
import os
cfg = get_cfg()
cfg.merge_from_file("./detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("pedestrian",)
cfg.DATASETS.TEST = () # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl" # initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.02
cfg.SOLVER.MAX_ITER = 300 # 300 iterations seems good enough, but you can certainly train longer
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 # faster, and good enough for this dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
It saves a log file in output dir thus I can use tensorboard to show the training accuracy -
%load_ext tensorboard
%tensorboard --logdir output
It works fine and I can see my model's training accuracy. But When testing/validating the model -
cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7 # set the testing threshold for this model
cfg.DATASETS.TEST = ("pedestrian_day", )
predictor = DefaultPredictor(cfg)
Although from Detectron2 tutorial I've got -
from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from detectron2.data import build_detection_test_loader
evaluator = COCOEvaluator("pedestrian_day", cfg, False, output_dir="./output/")
val_loader = build_detection_test_loader(cfg, "pedestrian_day", mapper=None)
inference_on_dataset(trainer.model, val_loader, evaluator)
but this gives the AP, AP50, AP75, APm, APl and APs for both training and testing.
My question is how can I able to see the testing accuracy in tensorboard like the training one?
By default evaluation during training is disabled
If you would like to enable it you have to set below param
# set eval step intervals
cfg.TEST.EVAL_PERIOD =
But for evaluation to work you have to modify build_evaluator function in detectron2/engine/defaults.py
An example of build_evaluator function is provided in tools/train_net.py script of https://github.com/facebookresearch/detectron2 repo
This issue in detectron2 discusses about creating custom LossEvalHook to monitor eval loss, sounds like a good approach to try
I'm making a speech recognition model with an input shape of (56088,22050,1) which as a whole can be loaded from a .npy file(~5GB in size) into the memory but I wanted to figure out a better way. I came across the keras fit_generator() method but most examples were based on mnist and used the ImageDataGenerator() function. I realised that I had to make a custom generator function but I wasn't really sure how. As per this thread, I referenced his generator function to make something like this but I still have to load the entire data to memory which takes a lot of time. Plus I'm uncertain if this program would run at all because it doesn't output anything at all for the first 20 minutes that I ran it for
Any other way out?
import librosa
import glob
import tensorflow as tf
import os
import numpy as np
class_list, X_train, Y_train = [],[],[]
filename = "D:\\SpeechRecognitionData\\train\\audio\\"
class_names = os.listdir(filename)
print(class_names)
for classes in class_names:
if classes == '_background_noise_':
continue
else:
class_list.append(''.join(filename+classes))
print(class_list,"\n",len(class_list))
def create_X(address):
wave,sr = librosa.load(address)
wave.reshape(-1,1)
yield wave
def getLabel(filename):
base_name = os.path.basename(filename)
return base_name
def onehot(Y_train):
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
Y_train = Y_train.reshape(-1,1)
enc.fit(Y_train)
Y_train = enc.transform(Y_train).toarray()
return Y_train
def execute(X_train, Y_train):
loop = 0
for i in class_list:
c=0
loop+=1
for file in glob.glob("".join(i+"\\*.wav")): # iterating through each .wav audio file in the directory to create training data
if np.array(list(create_X(file))).shape[0] == 22050:
c+=1
Y_train.append(class_names.index(getLabel(i)))
X_train.append(create_X(file))
if c%100==0:
print("{} files processed in loop {}".format(c,loop))
while 1:
for i in range(1558): # 36*1558 = 56088
if i%125==0:
print("i= "+str(i))
yield np.array(X_train[i*36:(i+1)*36]).reshape(X_train.shape[0],X_train.shape[1],1), onehot(np.array(Y_train[i*36:(i+1)*36]))
input_shape = (22050,1)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv1D(16,activation='relu',input_shape=input_shape,kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv1D(32,activation='relu',kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv1D(16,activation='relu',kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128,activation='relu'))
model.add(tf.keras.layers.Dense(64,activation='relu'))
model.add(tf.keras.layers.Dense(30,activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
generator = execute(X_train,Y_train)
model.fit_generator(generator,steps_per_epoch=56088//36,shuffle=True)
model.save("model.h5")
So I figured it out by looking at this example here- https://github.com/tjh48/keras_generators/blob/master/keras_generator_example.ipynb
If someone comes across this then they can refer to my notebook
https://github.com/DarshanDeshpande/Speech-Recognition/blob/master/SpeechRecognitionWithGenerators.ipynb
Thanks!
I'm trying to move my scikit-learn python script into tensorflow code. Keep getting stuck with errors. Please help!
import pandas as pd
import numpy as np
import tensorflow as tf
# read csv
df = pd.read_csv("/Downloads/iris-2.csv", header=0)
# get header names as array
features = list(df.columns.values)
label = features.pop()
classes = len(df[label].unique())
# encode target
X = df[features]
y = df[label]
# convert feature headers into tf
for index,value in enumerate(features):
features[index] = tf.feature_column.numeric_column(value)
# initialize classifier
classifier = tf.estimator.DNNClassifier(
feature_columns=features,
hidden_units=[10, 10],
n_classes=classes)
# train the classifier
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
dataset = dataset.shuffle(1000).repeat().batch(0)
data = dataset.make_one_shot_iterator().get_next()
classifier.train(input_fn=lambda:data,steps=3)
predictions = classifier.predict([5.1,3.0,4.2,1.2])
print(predictions)
Latest error I'm stuck on is:
ValueError: Passed Tensor("dnn/head/weighted_loss/Sum:0", shape=(), dtype=float32) should have graph attribute that is equal to current graph <tensorflow.python.framework.ops.Graph object at 0x10dd9a190>.
Here's the dataset I'm using: https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv
The input tensor (variables data and dataset) cannot be precomputed. They need to be computed inside the function passed to input_fn in the call to train so that the tensors are in the graph that the Estimator (classifier) creates during the call to train(). So for your last block you could use:
# train the classifier
def my_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
dataset = dataset.shuffle(1000).repeat().batch(0)
return dataset.make_one_shot_iterator().get_next()
classifier.train(input_fn=my_input_fn, steps=3)
predictions = classifier.predict([5.1,3.0,4.2,1.2])
print(predictions)
I have about 60 thousand samples of size 200x870, they are all numpy arrays and I want to build a four-dimensional tensor out of them (with one singleton dimension) and train them with a CNN in tensorflow. Up to this point, I was using data that I could just load and create batches as below:
with tf.Graph().as_default():
data_train = tf.to_float(getInput.data_train)
phase, lr = tf.placeholder(tf.bool), tf.placeholder(tf.float32)
global_step = tf.Variable(0,trainable = False)
image_train, label_train = tf.train.slice_input_producer([data_train, labels_train], num_epochs=args.num_epochs)
images_train, batch_labels_train = tf.train.batch([image_train, label_train], batch_size=args.bsize)
Can someone suggest a way to go around it?
I wanted to split the dataset into subsets and in one epoch train one after the ather using a Queue for the paths of this files:
import scipy.io as sc
import numpy as np
import threading
import time
import tensorflow as tf
from tensorflow.python.client import timeline
def testQueues():
paths = ['data1', 'data2', 'data3', 'data4','data5']
queue_capacity = 6
bsize = 10
num_epochs = 2
filename_queue = tf.FIFOQueue(
#min_after_dequeue=0,
capacity=queue_capacity,
dtypes=tf.string,
shapes=[[]]
)
filenames_placeholder = tf.placeholder(dtype='string', shape=(None))
filenames_enqueue_op = filename_queue.enqueue_many(filenames_placeholder)
data_train, phase = tf.placeholder(tf.float32), tf.placeholder(tf.bool)
sess= tf.Session()
sess.run(filenames_enqueue_op, feed_dict={filenames_placeholder: paths})
for i in range(len(paths)):
train_set_batch_name = sess.run(filename_queue.dequeue())
train_set_batch_name = train_set_batch_name.decode('utf-8')
train_set_batch = np.load(train_set_batch_name+'.npy')
train_set_batch = tf.cast(train_set_batch, tf.float32)
init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
sess.run(init_op)
run_one_epoch(train_set_batch, sess)
size = sess.run(filename_queue.size())
print(size)
print(train_set_batch)
def run_one_epoch(train_set,sess):
image_train = tf.train.slice_input_producer([train_set], num_epochs=1)
images_train = tf.train.batch(image_train, batch_size=10)
x = tf.nn.relu(images_train)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
sess.run(x)
except tf.errors.OutOfRangeError:
pass
finally:
# When done, ask the threads to stop.
coord.request_stop()
coord.join(threads)
testQueues()
However I get an error
FailedPreconditionError: Attempting to use uninitialized value input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs
[[Node: input_producer/input_producer/fraction_of_32_full/limit_epochs/CountUpTo = CountUpTo[T=DT_INT64, _class=["loc:#input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs"], limit=1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs)]]
Also it seems as I can't feed the dictionary with a tf.tensor only with numpy array, but casting it later to tf.tensor is also troublesome.
Have a look at Dataset api.
"The tf.data API enables you to build complex input pipelines from simple, reusable pieces."
In this approach what you do is you model your graph such that it handles data for you and pulls in limited data at a time for you to train your model on.
If memory issue still persists then you might want to look into generator to create your tf.data.Dataset. Your next step could be to potentially speed up the process by preparing tfrecords to create you Dataset.
Follow all the links to learn more and feel free to comment if you don't understand something.
For data that doesn't fit into memory the standard solution is to use Queues. You can set up some ops that read from files directly (cvs files, image files), and feed them into TensorFlow -- https://www.tensorflow.org/versions/r0.11/how_tos/reading_data/index.html