Hello I am following a video on Udemy. We are trying to apply a random forest classifier. Before we do so, we convert one of the columns in a data frame into a string. The 'Cabin' column represents values such as "4C" but in order to reduce the number of unique values, we want to use simply the first number to map onto a new column 'Cabin_mapped'.
data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
data['Cabin_mapped'].unique(),0)}
data.loc[:,'Cabin_mapped'] = data.loc[:,'Cabin_mapped'].map(cabin_dict)
data[['Cabin_mapped', 'Cabin']].head()
This part below is simply splitting the data into training and test set. The parameters don't really matter for figuring out the problem.
X_train_less_cat, X_test_less_cat, y_train, y_test = \
train_test_split(data[use_cols].fillna(0), data.Survived,
test_size = 0.3, random_state=0)
I get an error here after the fit, saying I could not convert the string into a float.
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)
It seems like I need to convert one of the inputs back into float to use the random forest algorithms. This is despite the error not showing up in the tutorial video. If anyone could help me out, that'd be great.
here's fully working example - I've highlighted the bit that you are missing. You need to convert EVERY column to a number, not just "cabin".
!wget https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv
import pandas as pd
data = pd.read_csv("train.csv")
data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
data['Cabin_mapped'].unique(),0)}
data.loc[:,'Cabin_mapped'] = data.loc[:,'Cabin_mapped'].map(cabin_dict)
data[['Cabin_mapped', 'Cabin']].head()
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
## YOU ARE MISSING THIS BIT, some of your columns are still strings
## they need to be converted to numbers (ints OR floats)
for n,v in data.items():
if v.dtype == "object":
data[n] = v.factorize()[0]
## END of the bit you're missing
use_cols = data.drop("Survived",axis=1).columns
X_train_less_cat, X_test_less_cat, y_train, y_test = \
train_test_split(data[use_cols].fillna(0), data.Survived,
test_size = 0.3, random_state=0)
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)
Related
I'm new to python and trying to create a sentiment analysis using VADER
I pulled various artists (13) data into individual dataframes, converted the lyrics to words, found only the unique words, remove stopwords and all that then put it all into a single df
#for all the artists clean, get the single event of the word and place it in the list
df_allocate = []
for df in df_all:
df_clean = cleaning(df)
df_words = to_unique_words(df_clean)
df_allocate.append(df_words)
frames = df_allocate
# create the new column with the information of words lists
df_main = pd.concat(frames, ignore_index=True)
df_main = df_main.reset_index(drop=True)
Now I'm trying to train a logistic regression model, predict test results and get a confusion matrix.
I think I'm getting confused about how data frames work and also how to train_test_split the data correctly.
Right now, I have:
for column_name in df_all:
cv = CountVectorizer(max_features=100000)
X = cv.fit_transform(df_main['Artist']).toarray()
y = column_name.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, random_state=20)
classifier = LogisticRegression(random_state= 25)
classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)
print_confusionMatrix = confusion_matrix(y_test, y_predict)
print(print_confusionMatrix)
print("accuracy score : ", accuracy_score(y_test, y_predict))
When I debug the program, I see why it's complaining however, I don't know how to fix it. I looked over how to iterate through dataframe and tried doing
for df in df_all.index
but it didn't work.
The columns are Artist, Title, Album, Date, Lyric, Year, and sentiment. What I want to accomplish is to iterate through each artist (df_all has the data frames of each individual artist, and that is why I use it), and get a prediction of the sentiment analysis of their lyrics to build a confusion matrix for all the 13 artists.
Previous tries are changing x to, and y keep it as that, so it's:
X = cv.fit_transform(df_main).toarray()
y = df_main.sentiment
however, this is where I get the error that x and y must be the same size.
Please push me in the right direction. I'm quite lost.
I'm trying to learn ML on a tweet.
I convert the tweets via
df['vectorised_words'] = vectorizer.transform(df.tweet)
which gives me a pandas.core.series.Series and my vectorizer is CountVectorizer
My X and Y are the following:
X = df['vectorised_words']
y = df['is_hate_speech'].astype(int)
where X is a text (e.g. "This is a sample tweet") and Y is a boolean - True or False.
Then I want to run the following:
svc_1 = SVC(kernel='linear')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=46)
svc_1.fit(X_train, y_train)
However, the fit function gives the following error:
TypeError: float() argument must be a string or a number, not 'csr_matrix'
If I sum the array to a float, I think the conversion logic will be lost. What am I doing wrong?
Seems you are trying to put a whole sparse matrix into a single pandas dataframe column, which is not the way to go.
Simply define your X as
X = vectorizer.transform(df.tweet.values)
and you should be fine.
I want to pre-process the date and use it to train my model in python.
My date format is like this.
22-02-2026
The code I have developed so far is attached below
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
df=pd.read_csv('data.csv')
df['previous_date'] = pd.to_datetime(df['previous_date'])
df['current_date'] = pd.to_datetime(df['current_date'])
df['previous_date_day'] = df['previous_date'].dt.day
df['previous_date_month'] = df['previous_date'].dt.month
df['previous_date_year'] = df['previous_date'].dt.year
df['current_date_day'] = df['current_date'].dt.day
df['current_date_month'] = df['current_date'].dt.month
df['current_date_year'] = df['current_date'].dt.year
X=df.iloc[:,3:]
Y=df['value']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, np.ravel(y_train))
from sklearn.metrics import accuracy_score
y_pred=clf.predict(X_test)
acc_score=accuracy_score(y_test, y_pred)*100
print("Accuracy Score : " , acc_score)
Based on your comment, you need to convert a date to an ordinal number so that the algorithm can tell the order.
Here is one way to do it:
import datetime
origin = datetime.datetime(1970,1,1)
days = (datetime.datetime.strptime('22-02-2026', '%d-%m-%Y') - origin).days
In this case it's 20506.
I set the origin to Unix epoch, but you can modify it to your likeness. It doesn't really matter, since the purpose here is to tell the order. Majority of machine learning algorithms will be able to use feature in this format, but if it's the best way depends on the nature of the problem.
As there are many dates that need to be converted to numeric representation, the first thing to make sure is that the output list also has the same order as Lukas mentioned. The easiest way to do this is by adding weight to each unit (weight_year > weight_month > weight_day).
def date2num(date_time):
d, m, y = date_time.split('-')
num = int(d)*10 + int(m)*100 + int(y)*1000 # these weights can be anything as long as
# they are ordered
return num
Now, it's important to normalize the numeric values.
import numpy as np
date_features = []
for d in list(df['date_time']):
date_features.append(date2num(d))
date_features = np.array(date_features)
date_features_normalized = (date_features - np.min(date_features))/(np.max(date_features) - np.min(date_features))
You wrote in one of your comments to your post :
I just want to compare 2 dates. If the first date is bigger than the
second date i want to predict true else i want my prediction as
*false. So my question is how should I pre-process the date to train the Machine Learning model.
You do not need machine learning for this, you can solve this only with a if / else condition.
You really do not need to make things complicated when they are simple !
All you need is this :
if (first_date > second_date)
return True
else
return False
Or in your case:
def get_value_for_dates(row):
if row['first_column'] > row['second_column']:
return 1
else:
return 0
df['value'] = df.apply(get_value_for_dates, axis=1)
guys, I'm new to Data science and Python. I'm working on a regression problem. My question is when I'm trying to plot my test part of target variable im getting strange type of plot
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target =
train_test_split(features, target, test_size = 0.25, random_state = 42)
# Remove the labels from the dataset
plt.xlim(0,100)
plt.plot(test_target , 'g');
is it because of random indexes attached to test_target..?
how can i get continous curve like this
If index of the data is the problem then use:
df_train = df_train.reset_index()
If you want to reset and set it to another column of df lets say "A" then do:
df_train = df_train.reset_index().set_index('A')
I am using scikit-learn to build a classifier that predicts if two sentences are paraphrases or not (e.g. paraphrases: How tall was Einstein vs. What was Albert Einstein's length).
My data consists of 2 columns with strings (phrase pairs) and 1 target column with 0's and 1's (= no paraphrase, paraphrase). I want to try different algorithms.
I expect the last line of code below to fit the model. Instead, the pre-processing Pipeline keeps producing an error I cannot solve: "AttributeError: 'numpy.ndarray' object has no attribute 'lower'."
The code is below and I have isolated the error happening in the last line shown (for brevity I have excluded the rest). I suspect it is because the target column contains 0s and 1s, which cannot be turned lowercase.
I have tried the answers to similar questions on stackoverflow, but no luck so far.
How can you work around this?
question1 question2 is_paraphrase
How long was Einstein? How tall was Albert Einstein? 1
Does society place too How do sports contribute to the 0
much importance on society?
sports?
What is a narcissistic What is narcissistic personality 1
personality disorder? disorder?
======
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
para = "paraphrases.tsv"
df = pd.read_csv(para, usecols = [3, 5], nrows = 100, header=0, sep="\t")
y = df["is_paraphrase"].values
X = df.drop("is_paraphrase", axis=1).values
X = X.astype(str) # I have tried this
X = np.char.lower(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 21, stratify = y)
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)
The error is not because of the last column, it is because your Train xdataset will contain two columns question1 and question2. Now this will result in you X_train having each row as list of values. So when the CountVectorizer is trying to convert it into lower case, it is returning an error since a numpy.ndarray does not contain lower function.
To overcome this problem you need to split the dataset X_train into two parts, say X_train_pt1 and X_train_pt2. Then perform CountVectorizer on these indiviudally, followed by tfidfTransformer on each individual result. Also ensure that you same object for transformation on these datasets.
Finally you stack these two arrays together and give it as input to your classifier. You can find a similar implementation here.
Update :
I think the following should be of some help (I admit this code can be further improved for more efficiency):
def flat_list(my_list):
return [str(item) for sublist in my_list for item in sublist]
def transform_data(trans_obj_list,dataset_splits):
X_train = dataset_splits[0].astype(str)
X_train = flat_list(X_train)
for trfs in trans_obj_list:
transformed_vector = trfs().fit(X_train)
for x in xrange(0,len(dataset_splits)):
dataset_splits[x] =flat_list(dataset_splits[x].astype(str))
dataset_splits[x]=transformed_vector.transform(dataset_splits[x])
return dataset_splits
new_X_train,new_X_test = transform_data([CountVectorizer,TfidfTransformer],[X_train,X_test])