Hello dear forum members,
I have a data set of 20 Million randomly collected individual tweets (no two tweets come from the same account). Let me refer to this data set as "general" data set. Also, I have another "specific" data set that includes 100,000 tweets collected from drug (opioid) abusers. Each tweet has at least one tag associated with it, e.g., opioids, addiction, overdose, hydrocodone, etc. (max 25 tags).
My goal is to use the "specific" data set to train the model using Keras and then use it to tag tweets in the "general" data set to identify tweets that might have been written by drug abusers.
Following examples in source1 and source2, I managed to build a simple working version of such model:
from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]
# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)
# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)
# test prediction accuracy
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# make predictions using a test set
for i in range(1000):
prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
In order to move forward, I would like to clarify a few things:
Let's say all my training tweets have a single tag -- opioids. Then if I pass the non-tagged tweets through it, isn't it likely that the model simply tags all of them as opioids as it doesn't know anything else? Should I be using a variety of different tweets/tags then for the learning purpose? Perhaps, there are any general guidelines for the selection of the tweets/tags for the training purposes?
How can I add more columns with tags for training (not a single one like is used in the code)?
Once I train the model and achieve appropriate accuracy, how do I pass non-tagged tweets through it to make predictions?
How do I add a confusion matrix?
Any other relevant feedback is also greatly appreciated.
Thanks!
Examples of "general" tweets:
everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.
Examples of "specific" tweets:
$2 million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids
My shot to this is:
Create a new dataset with tweets from general + specific data. Let's say 200k-250K where 100K is you specific data set, rest is general
Take your 25 keywords/tags and write a rule if any one or more than one exists in a tweet it is DA (Drug Abuser) or NDA(Non Drug Abuser). This will be your dependent variable.
Your new dataset will be one column with all the tweets and another column with the dependent variable telling it is DA or NDA
Now divide into train/test and use keras or any other algo. so that it can learn.
Then test the model by plotting Confusion Matrix
Passs you other remaining data set from General to this model and check,
If their are new words other than 25 which are not in the specific dataset, from the model you built it will still try to intelligently guess the right category by the group of words that come together, tone etc.
Related
I trained model with Random Forest Classifier. I saved this model using pickle. Then, in different python file, I preprocessed a sentence from input (I vectorized it in Bag of Words and then in TF-IDF). After that I used train_test_split with the parameter test_size=1 to make this sentence look like a test data. When I give this test data to my trained model it says:
ValueError: X has 14 features, but RandomForestClassifier is expecting 148409 features as input
Probably it's because i used dataset to train my model and now it's only 1 sample. But how am I supposed to use my model if 1 sample array (or matrix) doesn't have the same shape as an array with thousands samples from dataset?
Shapes while training:
train dataset features size: (23588, 148409)
train dataset label size: (23588,)
test dataset features size: (10110, 148409)
test dataset label size: (10110,)
Shape of one sentence when I try to use my model (as an example):
text_test shape (15, 14)
Code in training (building) python file:
from sklearn.feature_extraction.text import CountVectorizer, TfidTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
vectorizer = CountVectorizer()
BoW_transformer = vectorizer.fit(data['Text'])
BoW_data = BoW_transformer.transform(data['Text'])
tf_idf_transformer = TfidfTransformer().fit(BoW_data)
data_tf_idf = tf_idf_transformer.transform(BoW_data)
text_train, text_test, label_train, label_test = train_test_split(
data_tf_idf, data['Label'], test_size=0.3
)
print(f"train dataset features size: {text_train.shape}")
print(f"train dataset label size: {label_train.shape}")
print(f"test dataset features size: {text_test.shape}")
print(f"test dataset label size: {label_test.shape}")
RF_classifier = RandomForestClassifier()
RF_classifier.fit(text_train, label_train)
predict_train = RF_classifier.predict(text_train)
predict_test = RF_classifier.predict(text_test)
Code in 'use' python file:
import pickle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
vectorizer = CountVectorizer()
BoW_transformer = vectorizer.fit(input_string)
BoW_data = BoW_transformer.transform(input_string)
tf_idf_transformer = TfidfTransformer().fit(BoW_data)
data_tf_idf = tf_idf_transformer.transform(BoW_data)
text_test, label_test = train_test_split(
data_tf_idf, test_size=1
)
print("text_test shape", text_test.shape)
with open("saved_model.pickle", 'rb') as f:
RF_classifier = pickle.load(f)
predict_test = RF_classifier.predict(text_test)
I tried to put messages in array when I use fit() but I get either an error or my computer freezes (probably my RAM is not enough to train model with numpy arrays)
I tried to reshape it but I cannot reshape array with sum=210 into array with sum=3000000...
I got an answer on Russian Stackoverflow. So if this helped you then upvote either both answers or only his answer. https://ru.stackoverflow.com/questions/1493423/%d0%9f%d0%be%d1%87%d0%b5%d0%bc%d1%83-%d0%b2%d0%be%d0%b7%d0%bd%d0%b8%d0%ba%d0%b0%d0%b5%d1%82-%d0%be%d1%88%d0%b8%d0%b1%d0%ba%d0%b0-valueerror
Translated:
The problem is in the preparation of features. You need to prepare all features identically.
If you are using transformers with internal state and using fit() on the data you are using to train the model then you need to save the transformers with pickle, just like your model. And when you want to make a prediction, you need to read the transformers and use only transform(), no fit().
Also there are some transformers without internal state. For example, you can use HashingVectorizer() instead of CountVectorizer(). HashingVectorizer() doesn't have internal state, so it consumes less memory than CountVectorizer() and you don't need to save it, you just need to initialize it with the same arguments.
I have a dataset with multiple features and I am trying to build an svm model to classify new entries based on these features. To go about this, I chose to use CountVectorizer to convert the text data into numerical data for the training. I understand how to train a model with the features apart but I'm having difficulty understanding how to do so together.
Category Lyric Song_title
Rock Master of puppets pulling the strings Master of puppets
Rock Let the bodies hit the floor Bodies
Pop dreaming about the things we could be. Counting Stars
Pop Im glad you came Im glad you came NULL
[2000 rows x 3 columns]
To simplify certain steps. I decided to use built in functions to generate the data sets.
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn.linear_model import LogisticRegression
data = pd.read_excel('./music_data.xlsx',0)
train_data, test_data = train_test_split(data,test_size=0.53)
As both columns contain null values. I thought to separate the columns into 2 training sets and Train the models with the associated categories.
lyric_train = train_data[~pd.isnull(train_data['Lyric'])]
lyric_test = test_data[~pd.isnull(test_data['Lyric'])]
vectorizer_lyric = CountVectorizer(analyzer='word', ngram_range=(1, 5))
vc_lyric = vectorizer_lyric.fit_transform(lyric_train['Lyric'])
song_title_train = train_data[~pd.isnull(train_data['Song_title'])]
song_title_test = test_data[~pd.isnull(test_data['Song_title'])]
vectorizer_song = CountVectorizer(analyzer='word', ngram_range=(1, 5))
vc_song = vectorizer_song.fit_transform(song_title_train['Song_title'])
Then I build the models and try to combine them using a stacking classifier.
# Train for lyric feature
model_lyric = svm.SVC()
model_lyric.fit(vc_lyric, lyric_train['Category'])
features_test_lyric = vectorizer_lyric.transform(lyric_test['Lyric'])
model_lyric.score(features_test_lyric,lyric_test['Category']))
# train for Song Title feature
model_song = svm.SVC()
model_song.fit(vc_song, song_title_train['Category'])
features_test_song = vectorizer_song.transform(song_title_test['Song_title'])
model_song.score(features_test_song,song_title_test['Category']))
# Combine SVM models
estimators = [('lyric_svm',model_lyric),
('song_svm',model_song)]
stack_model = StackingClassifier(estimators=estimators,final_estimator=LogisticRegression())
From reading up online, this is not the correct way to do this as the StackingClassifier appears to combine multiple models using the same dataset & features. But I had separated the features for the CountVectorizer.
I have extracted word embeddings of 2 different texts (title and description) and want to train an XGBoost model on both embeddings. The embeddings are 200 in dimension each as can be seen below:
Now I was able to train the model on 1 embedding data and it worked perfectly like this:
x=df['FastText'] #training features
y=df['Category'] # target variable
#Defining Model
model = XGBClassifier(objective='multi:softprob')
#Evaluation metrics
score=['accuracy','precision_macro','recall_macro','f1_macro']
#Model training with 5 Fold Cross Validation
scores = cross_validate(model, np.vstack(x), y, cv=5, scoring=score)
Now I want to use both the features for training but it gives me an error if I pass 2 columns of df like this:
x=df[['FastText_Title','FastText']]
One solution I tried is adding both the embeddings like x1+x2 but it decreases accuracy significantly. How do I use both features in cross_validate function?
In the past for multiple inputs, I've done this:
features = ['FastText_Title', 'FastText']
x = df[features]
y = df['Category']
It is creating an array containing both datasets.
I usually need to scale the data as well using MinMaxScaler once the new array has been made.
According to the error you are getting, it seems that there is something wrong with types. Try this, it will convert your features to numeric and it should work:
df['FastText'] = pd.to_numeric(df['FastText'])
df['FastText_Title'] = pd.to_numeric(df['FastText_Title'])
I am an electrical engineer and I am looking for a solution to calculate the DC current of a permanent synchronous motor. So I decided to check the ANN solutions with Keras and so on.Long story short, I'll show you a screenshot of some measured signals.
The first 5 signals are the measured signals. The last one is the DC current, which I will estimate. Here the value was recorded with the help of a current clamp. Okay, I started building a model in Python and tried some things that I assume will increase the accuracy of the model. But after all that, I am not getting that good results from the model and my hope is that maybe I am choosing wrong parameters or not an ideal model for this purpose.
Here is my code:
import numpy as np
from keras.layers import Dense, LSTM
from keras.models import Sequential
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from matplotlib import pyplot as plt
import seaborn as sns
# Import input (x) and output (y) data, and asign these to df1 and df1
df = pd.read_csv('train_data.csv')
df = df[['rpm','iq','uq','udc','idc']]
X = df[df.columns[:-1]]
Y = df.idc
plt.figure()
sns.heatmap(df.corr(),annot=True)
plt.show()
# Split the data into input (x) training and testing data, and ouput (y) training and testing data,
# with training data being 80% of the data, and testing data being the remaining 20% of the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)#, shuffle=True)
# Scale both training and testing input data
X_train = preprocessing.maxabs_scale(X_train)
X_test = preprocessing.maxabs_scale(X_test)
model = Sequential()
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(1, input_shape=(4,)))
model.compile(optimizer="adam", loss="msle", metrics=['mean_squared_logarithmic_error','accuracy'])
# Pass several parameters to 'EarlyStopping' function and assign it to 'earlystopper'
earlystopper = EarlyStopping(monitor='val_loss', min_delta=0, patience=15, verbose=1, mode='auto')
model.summary()
history = model.fit(X_train, y_train, epochs = 2000, validation_split = 0.3, verbose = 2, callbacks = [earlystopper])
# Runs model (the one with the activation function, although this doesn't really matter as they perform the same)
# with its current weights on the training and testing data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculates and prints r2 score of training and testing data
print("The R2 score on the Train set is:\t{:0.3f}".format(r2_score(y_train, y_train_pred)))
print("The R2 score on the Test set is:\t{:0.3f}".format(r2_score(y_test, y_test_pred)))
df = pd.read_csv('test_two_data.csv')
df = df[['rpm','iq','uq','udc','idc']]
X = df[df.columns[:-1]]
Y = df.idc
X_validate = preprocessing.maxabs_scale(X)
y_pred = model.predict(X_validate)
plt.plot(Y)
plt.plot(y_pred)
plt.show()
(weight_0,bias_0) = model.layers[0].get_weights()
(weight_1,bias_1) = model.layers[1].get_weights()
One limitation is that I can't use LSTM layers or other complex algorithms because I need to implement the trained model in a microcontroller on a motor application later.
I guess you could find some words for me to make my model a little better in accuracy.
At the end here is a figure where I show you the worse prediction performance. Orange is the prediction and blue is the measured current.
The training dataset was this one.
The correlation between the individual values can be found here. Since the values of id and ud have no correlation to idc, I decided to delete them.
The most important thing to keep in mind when trying to improve the accuracy of the model is ALWAYS Normalise the input data which basically means rescaling real-valued numeric attributes into the range 0 and 1. I am not able to understand the way you are providing the training data to the model. Could you please explain that. It would be better in understanding and identifying the scope of higher accuracy.
Now if we talk about parameters, I would suggest you the addition of a Tuning Algorithm for the parameters to get the optimized value of each parameter.
It is always a good parctice to include hidden layers which could provide better feature extract.
I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data
This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.
Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).
your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.