name define on TF-IDF calculation - python

I have a dataset contains a set of article papers. I merged the metadata and the json files, and created a dataframe. Here is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(merged_df['Title'][39100])
print(X.shape)
query = "How to prevent covid19"
query_vec = vectorize.transform([query])
result = cosine_similarity(X,query_vec).reshape((-1,))
for i in result.argsort()[-10:][::-1]:
print(merged_df.iloc['Title'][i,0], "--", merged_df.iloc['Title'][i,1])
I want to calculate Title's TFIDF to handle the query, that helps me to find some relevant papers.
Why it prompts name "merged_df" is not defined?

Within your code merged_df is nowhere defined. The dataframe is never created, therefore undefined.

Related

What function do I use to get the cluster labels when running Gaussian Mixture Model with PySpark?

Question as above^
I'm using PySpark to find clusters in a dataset. I have used the following code so far:
# Vectorizing
from pyspark.ml.feature import VectorAssembler
assembler1 = VectorAssembler(
inputCols=["cons_fraud_prob", "merch_fraud_prob"],
outputCol= "features")
transformed_model = assembler1.transform(model_data)
from pyspark.ml.clustering import GaussianMixture
from pyspark.ml.linalg import Vectors
gmm = GaussianMixture().setK(3).setSeed(14)
model = gmm.fit(transformed_model)
print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)
But what I want to do now is include a column that has the cluster that each datapoint belongs to. How can I implement this?

Extracting text features from a dataframe and use them alongside other types of features (heterogenous data) for sklearn purposes: TypeError

I am attempting to extract some features from a dataframe that looks akin to this:
feature1:float feature2:float feature3:string succeeded:boolean
I'm far from an expert on the topic but I attempted the following:
from sklearn.feature_extraction.text import CountVectorizer
import scipy as sp
vectorizer = CountVectorizer()
vectorizer.fit(small_df.feature3)
X = sp.sparse.hstack( (vectorizer.transform(small_df.feature3),
small_df[['feature1', 'feature2']),
format='csr')
X_columns = vectorizer.get_feature_names() + df[cols].columns.tolist()
However, I end up with the following error:
TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))
Any help would be appreciated!
Solution:
X = sp.sparse.hstack( (vectorizer.transform(small_df.name),
small_df[cols].values.astype(np.float)))

Numpy Array for SVM model rather than a DataFrame

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Read the data.
data = np.asarray(pd.read_csv('data.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y.
X = data[:,0:2]
y = data[:,2]
# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC()
model.fit(X,y)
2 Questions:
the data goes into a numpy array from a pandas Dataframe (by pd.read_csv).
Is that better? Is there a good reason for that? why not stay with the DataFrame?
I do not understand this notation:
X = data[:,0:2]
y = data[:,2]
What does it do?
Thank you.
The data consists of a CSV file with many rows like this:
0.28917,0.65643,0
It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

Cross validation in random forest using anaconda

I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)
pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)

Trying to work out a python code

I am new to Machine Learning and python. Recently i have been working with Amazon fine food review data from kaggle and its code.
What i don't understand is how is the 'partiton' method used here ?
Moreover, What actually does last 3 lines of code do ?
%matplotlib inline
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
# using the SQLite Table to read data.
con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite')
#filtering only positive and negative reviews i.e.
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", con)
# Give reviews with Score>3 a positive rating, and reviews with a
score<3 a negative rating.
def partition(x):
if x < 3:
return 'negative'
return 'positive'
#changing reviews with score less than 3 to be positive vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative
creates an array called actualScore using the column Score from filtered_data
actualScore = filtered_data['Score']
creates array positiveNegative coding negative for values <3 and positive for >3
positiveNegative = actualScore.map(partition)
overwrites old column score with new coded values
filtered_data['Score'] = positiveNegative
I think Actually to replace Score column in table with positve or negetive, we use method called partition.Get the Score column as dataframe actualScore, then map the dataframe with replacing values of whether it is positive or negetive. Then replace values in score column with positiveNegative.

Categories

Resources