I am new to Machine Learning and python. Recently i have been working with Amazon fine food review data from kaggle and its code.
What i don't understand is how is the 'partiton' method used here ?
Moreover, What actually does last 3 lines of code do ?
%matplotlib inline
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
# using the SQLite Table to read data.
con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite')
#filtering only positive and negative reviews i.e.
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", con)
# Give reviews with Score>3 a positive rating, and reviews with a
score<3 a negative rating.
def partition(x):
if x < 3:
return 'negative'
return 'positive'
#changing reviews with score less than 3 to be positive vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative
creates an array called actualScore using the column Score from filtered_data
actualScore = filtered_data['Score']
creates array positiveNegative coding negative for values <3 and positive for >3
positiveNegative = actualScore.map(partition)
overwrites old column score with new coded values
filtered_data['Score'] = positiveNegative
I think Actually to replace Score column in table with positve or negetive, we use method called partition.Get the Score column as dataframe actualScore, then map the dataframe with replacing values of whether it is positive or negetive. Then replace values in score column with positiveNegative.
Related
I have code as below. I am converting a dataset to a dataframe and then back to dataset. I am repeating the process once with shuffled data and once with unshuffled data. When I compare data in case of shuffled data, I get false. But when I compare data in case of unshuffled data, I get True. Why is there this kind of discrepancy
import pandas as pd
import datasets
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
import os
train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'],
cache_dir='/media/data_files/github/website_tutorials/data')
train_data_s1, test_data_s1 = datasets.load_dataset('imdb', split =['train[0:500]', 'test[0:500]'],
cache_dir='/media/data_files/github/website_tutorials/data')
print (type (train_data_s1))
print (type (test_data_s1))
#shuffling adn taking first 500 from train_data and test_data
# Create a list in a range of 10-20
l1=[*range(0,499,1)]
# Print the list
print(l1)
train_data_s1_shuffled=train_data.shuffle(seed=2).select(l1)
test_data_s1_shuffled=test_data.shuffle(seed=3).select(l1)
print (type (train_data_s1_shuffled))
print (type (test_data_s1_shuffled))
Why do I get False when I compare data below? But when in the next block I get True?
from datasets import Dataset
import pandas as pd
#dataset_from_pandas = Dataset.from_pandas(df_pandas)
#https://discuss.huggingface.co/t/how-to-create-custom-classlabels/13650# "basic_sentiment holds values [-1,0,1]
from datasets import ClassLabel
x1=pd.DataFrame(train_data_s1_shuffled)
x2=Dataset.from_pandas(x1).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))
print (x2.data==train_data_s1_shuffled.data)#returns false
print (x2.features==train_data_s1_shuffled.features)
#not sure why data matches in this case but not in earlier case?
x11=pd.DataFrame(train_data)
#print (x11.head())
x21=Dataset.from_pandas(x11).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))
print (x21.data==train_data.data)#returns True
print (x21.features==train_data.features)
I want to divide my data into labels in that the first 6 columns determine the 7th column now I have selected the first 6 columns which is working perfectly
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Assign column names to the dataset
names=['buying', 'maint', 'doors', 'persons', 'lug_boot','safety', 'class']
# load the dataset in csv format into the pandas dataframe
cardata= pd.read_csv(r'C:\Users\user\Downloads\car.data', names=names)
X = cardata.iloc[:, 0:6]
The above code is working perfectly and when I run
print(X.head())
it prints the first 6 columns with exemption of the last column which is supposed to be predicted.
But this code below seems not to work as it outputs a similar behaviour to the one above
y = cardata.select_dtypes(include=[object])
print(y.head())
please help I need to assign the variable y to only the last column that is the seventh column
The output is the same which is not the case , I need when I run print(y.head()) it only prints the last column
Try this
X,y = cardata.iloc[:,:-1],cardata.iloc[:,-1]
This selects all rows and separates X and y based on the last column (index = -1). This should get you the result you are looking for
I have a dataset contains a set of article papers. I merged the metadata and the json files, and created a dataframe. Here is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(merged_df['Title'][39100])
print(X.shape)
query = "How to prevent covid19"
query_vec = vectorize.transform([query])
result = cosine_similarity(X,query_vec).reshape((-1,))
for i in result.argsort()[-10:][::-1]:
print(merged_df.iloc['Title'][i,0], "--", merged_df.iloc['Title'][i,1])
I want to calculate Title's TFIDF to handle the query, that helps me to find some relevant papers.
Why it prompts name "merged_df" is not defined?
Within your code merged_df is nowhere defined. The dataframe is never created, therefore undefined.
I am new to Python and I am trying to perform a spline interpolation. My data contains 3 columns with a number of rows having 'NaN' in one of the columns. I need to ignore/remove the NaN without reducing the length. I have tried a number of ways, but each time the length is reduced. Any help or advice would be grateful received.
import numpy as np
import pandas as pd
import scipy.linalg
import matplotlib.style
import math
data = pd.read_excel('prob_data.xlsx')
np.array(data['A'])
np.array(data['B'])
np.array(data['C'])
x = abun_data['A'][~np.isnan(abun_data['A'])]
print(len(x))
z = abun_data['B'][~np.isnan(abun_data['B'])]
print(len(z))
y = abun_data['C'][~np.isnan(abun_data['C'])]
print(len(y))
You can use SimpleInputer class:
from sklearn.impute import SimpleImputer
inputer = SimpleImputer(strategy='median')
data = pd.read_excel('prob_data.xlsx')
nice_data = pd.DataFrame(imputer.fit_transform(data))
I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)
pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)