Load and prepare a new dataset

Load and prepare a new dataset - python

I'm using tf to create a sentiment analysis model. Since I'm a noob of machine learning I followed a guide on the official documentation of Tensorflow to train and test a model with the IMDB_reviews dataset. It works pretty well but I wish I could train it with another dataset.
So I've downloaded this dataset: "movie_review.csv". It contains various columns and I want to access text and tag (where the tag is a positive or negative value and text is the text of the review).
What I want to do is to prepare the CSV as a dataset, access text and tag, vectorize them, and feed them to the network. There is no division between test and train, so I have to divide the file too.
So, I want to know how to:
0- Access the file I've downloaded and transform it into a dataset.
1- Access text and tag in the file, maybe without using pandas. If pandas is recommended and there is a simple way to access the file and passing to a network using TensorFlow I'll be okay with the answer.
2- Splitting the file in the test set and train set (I've already found a pandas solution for this actually).
3- Vectorize my text and tag to feed my network.
If you have an entire guide on how to do this, it'll be fine, it just has to use TensorFlow.
Questions 0 to 3 have been answered
Ok so, I have used the file posted to load a dataset to train the model on short sentences, but I'm having trouble with the training.
When I followed the guide to build the model for text classification I came out with this code:
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
encoder = info.features['text'].encoder
BUFFER_SIZE = 10000
BATCH_SIZE = 64
padded_shapes = ([None], ())
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE, padded_shapes = padded_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, padded_shapes = padded_shapes)
model = tf.keras.Sequential([tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
history = model.fit(train_dataset, epochs = 1, validation_data = test_dataset, validation_steps=30, callbacks=[cp_callback])
So, I trained my model this way (Some parts are missing, I have included all the fundamental ones). After this, I wanted to train the model with another dataset, and thanks to Andrew I have accessed a dataset created by me this way:
csv_dataset = tf.data.experimental.CsvDataset(filepath, default_values, header=header)
def reshape_dataset(txt, tag):
txt = tf.reshape(txt, shape=(1,))
tag = tf.reshape(tag, shape=(1,))
return txt, tag
csv_dataset = csv_dataset.map(reshape_dataset)
training = csv_dataset.take(10)
testing = csv_dataset.skip(10)
And my problem is to adapt the dataset to the model I already have. I have tried various solution, but I get errors on the shapes.
Can somebody be so gentle to explain me how to do this? Obviously the solution for step 3 has already been posted by Andrew in his file, but I'd like to use my model with the weights I have saved during training.

This sounds like a great place to use Tensorflow's Dataset API. Here's a notebook/tutorial that covers how to do some basic data input and preprocessing stuff, right from Tensorflow's website!
I have also made a notebook with a quick example, answering each of your questions with implementations. You can find that here.

Related

text classification problem with product description

I am new to machine learning, I have a dataset with 4000-5000 items, they are all product descriptions, and the result
example, I want to train a model to classify them into 1 or 0, can I train it with this kind of text?

Yes of course you can! The keyword you are searching for is sentiment analysis. Take a look at this article by huggingface for sentiment analysis in python with a pre-trained model and from scratch.
Using pretrained models
Install the huggingface transformers python package pip install -q transformers
Import the sentiment-analysis pipeline provided by huggingface, which already implements publicly available models on huggingface:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
Inference
data = [PRODUCT_DESCRIPTION_1, PRODUCT_DESCRIPTION_2, ...]
results = sentiment_pipeline(data)
Results then contains an array of objects with a property "label" ("POSITIVE"/"NEGATIVE") and "score" (confidence score). In order to retrieve ratings (1-5) you either have to implement some sort of gausian random distribution to generate stars based on the positive/negative ratings (e.g. positive: mean of 4 with a variance of 1, negative: 2 with a variance of 1).
Fine-tuning model
To achive better results than using binary classification and randomness to provide ratings in form of stars, you would likely need to fine-tune an existing machine learning model for sentiment-analysis. Fine-tuning in this context means, that you build up on an existing models with existing weights and use an own, smaller dataset to fit the existing models to your special needs. You can do this with the huggingface library as well:
Install python packages pip install datasets transformers huggingface_hub
Preprocess your own dataset by shuffling your samples, splitting a test and train set
Tokenize the dataset (our model doesn't understand words, so we kind of need to encode them first. A usual practice doing so, is to use one-hot encode every single word in a high dimensional vector or to assign every word an index). As these kind of machine learning models expect the input to have a predefined length, we need to pad shorter sentences (usually done with a special
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
We can then start the training process based on a pretrained model called "distilbert", but with 5 output labels (1 label equals 1 possible rating):
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=5)
Now we can finally start the training process:
training_args = TrainingArguments(
output_dir="my-model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5, # adjust this parameter to your desired training length
weight_decay=0.01,
save_strategy="epoch",
push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
trainer.save_model("my-model")
You can then use the model to inference with the difference, that you will receive results with more than 2 classes:
model = AutoModelForSequenceClassification.from_pretrained("my-model")
results = model(data)
Source: huggingface blog
Comment section summarization
As their was a misunderstanding on the problem the question author is facing, here a summarization of the discussion in the comments:
The author want's to predict review ratings (1-5 stars) based on product DESCIRPTIONS. As - in my opinion - descriptions and the resulting reviews aren't related (descriptions are always written positive, to sell the products they describe). I think that therefor no prediction based on the description itself is possible, and further inputs are needed (e.g. overall product "quality", utility of the product, etc.).

From where can i get a detailed description of all the methods for the model in pytorch torchvison？

I am a beginner of pytorch ，when i read the source code of a project about mask rcnn .I don't konw from where can i get some information about some methods that i don't understand .The official documentation doesn't seem very detailed?
# load an instance segmentation model pre-trained pre-trained on COCO
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
Just like code above ,I could not get detailed information about the " roi_head" attribute from doc of model .From where i can learn about it？

You won't be able to find such thing on the documentation. You'll have to dive into the source code. Object Detection APIs, especially anchor-based two-stage approaches, are a little bit complex, and they tend to have too many components and hyper-parameters. PyTorch team already made an incredible job making this API modular and kinda easy-to-use. In the specific case of roi_heads you can take a look here to learn more about it. In general, all components can be found in torchvision/models/detection.
Anyway, you can always open an issue, requesting them to expand the documentation. Or we can even do it ourselves and make a pull request :)

From the docstring of FasterRCNN, box_predictor returns two things
box_predictor (nn.Module): module that takes the output of box_head
and returns the classification logits and box regression deltas.
So, cls_score is the classification logits.
Box regression deltas are stored in bbox_pred
Next, cls_score is nothing but a Dense layer with input and output shape defined by in_channels and num_classes (here num_classes= 2 according to the tut).
self.cls_score = nn.Linear(in_channels, num_classes)
in_channels is the input shape of cls_score. Because we want to replace the pre-trained head, we need to preserve information about in_channels in the new head.
# new head
FastRCNNPredictor(in_features, num_classes)
P.S: In Tensorflow you can just freeze the learned layers and train the head only. These Pytorch lines of code achieve the same thing.

how can i test my model using different dataset in machine learning

i m new in machine learning and i am create a one small project using CountVectorizer model. i am split my data to 80% -20%. 80% for training model and 20% for testing model. my model work properly run on 20% test data but can i used to test my model on different data set that is similar to training data set?
i am using joblib for dump and load my model.
from joblib import dump, load
dump(pipe, filename)
loaded_model = load('filename')
my question is how i directly test my model using different dataset?

Yes, you can use the model to test similar datasets.
However, you must keep in mind the preprocessing step according to the model.
When you trained your model, it was trained on a particular dimension and the size of input would have been AxB matric. When you have a new test sentence or new dataset, you must first do the same preprocessing, otherwise, it will throw dimension mismatch errors.
Example:
suppose you have the following count vectorizer object
cv = CountVectorizer()
then you must first fit it on your training dataset, for say
X = dataframe['text_column_name']
X = cv.fit_transform(X) # Fit the Data
Once this is done, whenever you have a new sentence, say
test_sentence = "this is a test sentence"
then you must use the cv object in the following manner
model_input = cv.transform([test_sentence]).toarray()
and then you can make predictions:
model.predict(model_input)
This method must be followed even if you wish to test a new dataset which is in a data frame or some other file format.

How to load your image dataset in python

I have a folder (on my windows desktop) containing the images I want to use to build my deep learning classifier. I also have one .csv file which has the image number (for example img_1035) and the corresponding class label. How do I load the dataset with the labels into python/jupyter notebooks?
This is the link to the dataset on kaggle (https://www.kaggle.com/debdoot/bdrw).
I would preferably like to use PyTorch to do this but any other ways would also be highly appreciated.

Luckily, PyTorch has a convenient "ImageFolder" class that you can extend to create your own dataset.
Here's an example of a dataset that uses ImageFolder:
class MyDataset(torchvision.datasets.ImageFolder):
def __init__(self, train_folder_path='.', transform=None, target_transform=None):
super().__init__(train_folder_path, transform, target_transform)
# [ Some functions omitted ]
Then you load your set using PyTorch's "DataLoader".
Here's an example for a training set:
training_set = MyDataset(root_path, transform)
train_loader = torch.utils.data.DataLoader(training_set, batch_size=batch_size, shuffle=True)
Using the train loader you can get batches from your dataset. You can then use these batches to train / validate and so on:
batch = next(iter(train_loader))
images, labels = batch
Training is a rather involved process so I'm not entirely sure how deep you want to dive here. I hope this was a nudge in the right direction.

How to use a Tensorflow estimator

I was following this Tensorflow tutorial on creating a Convolutional Neural Network.
I'm at the step where the training and test data is read:
def main(unused_argv):
mnist = learn.datasets.load_dataset("mnist")
train_data = mnist.train.images # Returns np.array
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images # Returns np.array
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
Up to here, everything is fine.
But then suddenly an estimator is created:
mnist_classifier = learn.Estimator(
model_fn=cnn_model_fn, model_dir="/tmp/mnist_convnet_model")
My questions are:
What is an Estimator?
The previous code doesn't save anything under "/tmp/mnist_convnet_model". How come there is a model saved under that directory?
How did it get there?
EDIT:
When I run the code, I get:
Couldn't find trained model at ../tmp/mnist_convnet_model.
This is because the model isn't found under that directory structure.
How can I put the model there? Also, why do I have to put it there, instead of storing it in memory for the execution of the script.

The first question is answered right there in the tutorial. Estimator is "a TensorFlow class for performing high-level model training, evaluation, and inference".
The answer to the second question is that no, nothing is saved to that directory yet. The estimator object will use this directory to save training checkpoints, logs etc. When you run this code the first time, it will not load anything. But once you train the model, it will load the saved state from there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Load and prepare a new dataset - python

Related

text classification problem with product description

From where can i get a detailed description of all the methods for the model in pytorch torchvison？

how can i test my model using different dataset in machine learning

How to load your image dataset in python

How to use a Tensorflow estimator

Categories

Resources