Need help with the error
NameError: name 'countVectorizer' is not defined in PyCharm
I am trying to execute the FEATURE EXTRACTION code from this source
https://github.com/chdoig/pytexas2015-ml
File Name: 1-Feature_extraction.ipynb
import numpy as np
import pandas as pd
train_data = pd.read_csv('labeledTrainData.tsv',sep='\t')
print(train_data)
print(train_data.iloc[1].review)
test_data = pd.read_csv('testData.tsv',sep = '\t')
print(test_data)
import matplotlib.pyplot as plt
import seaborn as sns
train_data['review_len'] = train_data.review.apply(len)
len_pl = plt.hist(train_data.review_len.values)
plt.show(len_pl)
#describe negative reviews
print(train_data[train_data.sentiment==0].describe())
print(train_data[train_data.sentiment==1].describe())
#inspecting outliers
print(train_data[train_data.review_len==52].review.all())
print(train_data[train_data.review_len==13708].review.all())
#word exrtaction
from sklearn.feature_extraction.text import CountVectorizer
vocab = ['awesome', 'terrible']
simple_vectorizer = countVectorizer(vocabulary=vocab)
bow = simple_vectorizer.fit_transform(train_data.review).todense()
print(bow)
Error/Warning:
C:\Users\hi\PycharmProjects\Practice2\venv\Scripts\python.exe C:/Users/hi/PycharmProjects/Practice2/P1.py
C:\Users\hi\PycharmProjects\Practice2\venv\lib\site-packages\sklearn\externals\joblib\externals\cloudpickle\cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Traceback (most recent call last):
File "C:/Users/hi/PycharmProjects/Practice2/P1.py", line 32, in
simple_vectorizer = countVectorizer(vocabulary=vocab)
NameError: name 'countVectorizer' is not defined
Process finished with exit code 1
You are importing CountVectorizer but referencing countVectorizer.
Related
I have been trying to split the dataset into train and test data for deployment using Streamlit.
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold,cross_val_score
from sklearn.cluster import KMeans
import xgboost as xgb
from xgboost import XGBClassifier
def load_dataset():
df = pd.read_csv('txn.csv')
return df
df = load_dataset()
#create X and y, X will be feature set and y is the label - LTV
X = df.drop(['LTVCluster','m1_Revenue'],axis=1)
y = df(['LTVCluster'])
But I,m getting this error while executing the file:
TypeError: 'DataFrame' object is not callable
Traceback:
File "c:\users\anish\anaconda3\lib\site-packages\streamlit\script_runner.py", line 333, in _run_script
exec(code, module.__dict__)
File "C:\Users\Anish\Desktop\myenv\P52 - Retail Ecommerce\new1.py", line 25, in <module>
y = df(['LTVCluster'],axis=1)
What can be the error??
You have a extra set of parentheses in your last line, so Python thinks you're calling df. To filter by columns in Pandas, you use square brackets, so remove the parentheses.
y = df['LTVCluster']
To select a column, remove the () from df(['LTVCluster']):
y = df['LTVCluster']
Im trying to impute NaN values but,first i want to check the best method to calculate this values. Im new using this methods, so im want to use a code i found to capare the differents regressors and choose the best. The original code is this:
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = fetch_california_housing(return_X_y=True)
# ~2k samples is enough for the purpose of the example.
Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape
fetch_california_housing is his Dataset.
So, when i try to adapt this code to my case i wrote this code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy import genfromtxt
data = genfromtxt('documents/datasets/df.csv', delimiter=',')
features = data[:, :2]
targets = data[:, 2]
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = data(return_X_y= True)
# ~2k samples is enough for the purpose of the example.
# Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape
I always get the same error:
AttributeError: 'numpy.ndarray' object is not callable
and before I used my DF as csv (df.csv) the error is the same
AttributeError: 'Dataset' object is not callable
the complete error is this:
ypeError Traceback (most recent call last) <ipython-input-8-3b63ca34361e> in <module>
3 rng = np.random.RandomState(0) 4
----> 5 X_full, y_full = df(return_X_y=True)
6 # ~2k samples is enough for the purpose of the example.
7 # Remove the following two lines for a slower run with different error bars.
TypeError: 'DataFrame' object is not callable
and i dont know how to solve one of both error to go away
I hope to explain well my problem cause my english is not very good
I have a CSV files with all numeric values except the header row. When trying to build tensors, I get the following exception:
Traceback (most recent call last):
File "pytorch.py", line 14, in <module>
test_tensor = torch.tensor(test)
ValueError: could not determine the shape of object type 'DataFrame'
This is my code:
import torch
import dask.dataframe as dd
device = torch.device("cuda:0")
print("Loading CSV...")
test = dd.read_csv("test.csv", encoding = "UTF-8")
train = dd.read_csv("train.csv", encoding = "UTF-8")
print("Converting to Tensor...")
test_tensor = torch.tensor(test)
train_tensor = torch.tensor(train)
Using pandas instead of Dask for CSV parsing produced the same error. I also tried to specify dtype=torch.float64 inside the call to torch.tensor(data), but got the same error again.
Try converting it to an array first:
test_tensor = torch.Tensor(test.values)
I think you're just missing .values
import torch
import pandas as pd
train = pd.read_csv('train.csv')
train_tensor = torch.tensor(train.values)
Newer version of pandas highly recommend to use to_numpy instead of values
train_tensor = torch.tensor(train.to_numpy())
Only using NumPy
import numpy as np
import torch
tensor = torch.from_numpy(
np.genfromtxt("train.csv", delimiter=",")
)
I have a CSV file with 2 columns as
actual,predicted
1,0
1,0
1,1
0,1
.,.
.,.
How do I read this file and plot a confusion matrix in Python?
I tried the following code from a program.
import pandas as pd
from sklearn.metrics import confusion_matrix
import numpy
CSVFILE='./mappings.csv'
test_df=pd.read_csv[CSVFILE]
actualValue=test_df['actual']
predictedValue=test_df['predicted']
actualValue=actualValue.values
predictedValue=predictedValue.values
cmt=confusion_matrix(actualValue,predictedValue)
print cmt
but it gives me this error.
Traceback (most recent call last):
File "confusionMatrixCSV.py", line 7, in <module>
test_df=pd.read_csv[CSVFILE]
TypeError: 'function' object has no attribute '__getitem__'
pd.read_csv is a function. You call a function in Python by using parenthesis.
You should use pd.read_csv(CSVFILE) instead of pd.read_csv[CSVFILE].
import pandas as pd
from sklearn.metrics import confusion_matrix
import numpy as np
CSVFILE = './mappings.csv'
test_df = pd.read_csv(CSVFILE)
actualValue = test_df['actual']
predictedValue = test_df['predicted']
actualValue = actualValue.values.argmax(axis=1)
predictedValue =predictedValue.values.argmax(axis=1)
cmt = confusion_matrix(actualValue, predictedValue)
print cmt
Here's a simple solution to calculate the accuracy and plot confusion matrix for the input in the format mentioned in the question.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
file=open("results.txt","r")
result=[]
actual=[]
i = 0
for line in file:
i+=1
sent=line.split("\t")
sent[0]=int(sent[0])
sent[1]=int(sent[1])
result.append(sent[1])
actual.append(sent[0])
cnf_mat=confusion_matrix(actual,result)
print cnf_mat
print('Test Accuracy:', accuracy_score(actual,result))
I am using python on clustering text documents which I have as a dataframe. This is what I am doing:
from __future__ import division
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
import pandas as pd
data_lst = data_rd['text'].values.tolist()
tfidf_vectorizer = TfidfVectorizer( max_features=200000, stop_words='english',use_idf=True, tokenizer=lambda x: x.split(' '), ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(data_lst)
print(tfidf_matrix.shape)
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
#(10193, 32757)
linkage_dist=ward(dist)
linkage_matrix = linkage(tfidf_matrix.todense(), 'ward')
dendrogram(linkage_matrix,truncate_mode="lastp",p=40,
show_leaf_counts=True,leaf_rotation=60.,leaf_font_size=8.,
show_contracted=True, )
is_valid_linkage(linkage_matrix)
is_valid_linkage(linkage_dist)
#False
#False
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/site-packages/scipy/cluster/hierarchy.py", line
2227, in dendrogram
is_valid_linkage(Z, throw=True, name='Z')
File "/usr/lib64/python2.6/site-packages/scipy/cluster/hierarchy.py", line
1421, in is_valid_linkage
% name_str)
ValueError: Linkage 'Z' uses the same cluster more than once.
is there any other way apart from fastcluster to solve this and why is this happening?
There is a one row in the column that is blank and has no text.