huggingface converting dataframe to dataset - python

I have code as below. I am converting a dataset to a dataframe and then back to dataset. I am repeating the process once with shuffled data and once with unshuffled data. When I compare data in case of shuffled data, I get false. But when I compare data in case of unshuffled data, I get True. Why is there this kind of discrepancy
import pandas as pd
import datasets
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
import os
train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'],
cache_dir='/media/data_files/github/website_tutorials/data')
train_data_s1, test_data_s1 = datasets.load_dataset('imdb', split =['train[0:500]', 'test[0:500]'],
cache_dir='/media/data_files/github/website_tutorials/data')
print (type (train_data_s1))
print (type (test_data_s1))
#shuffling adn taking first 500 from train_data and test_data
# Create a list in a range of 10-20
l1=[*range(0,499,1)]
# Print the list
print(l1)
train_data_s1_shuffled=train_data.shuffle(seed=2).select(l1)
test_data_s1_shuffled=test_data.shuffle(seed=3).select(l1)
print (type (train_data_s1_shuffled))
print (type (test_data_s1_shuffled))
Why do I get False when I compare data below? But when in the next block I get True?
from datasets import Dataset
import pandas as pd
#dataset_from_pandas = Dataset.from_pandas(df_pandas)
#https://discuss.huggingface.co/t/how-to-create-custom-classlabels/13650# "basic_sentiment holds values [-1,0,1]
from datasets import ClassLabel
x1=pd.DataFrame(train_data_s1_shuffled)
x2=Dataset.from_pandas(x1).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))
print (x2.data==train_data_s1_shuffled.data)#returns false
print (x2.features==train_data_s1_shuffled.features)
#not sure why data matches in this case but not in earlier case?
x11=pd.DataFrame(train_data)
#print (x11.head())
x21=Dataset.from_pandas(x11).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))
print (x21.data==train_data.data)#returns True
print (x21.features==train_data.features)

Related

Pandas preprocessing data and labelling

I want to divide my data into labels in that the first 6 columns determine the 7th column now I have selected the first 6 columns which is working perfectly
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Assign column names to the dataset
names=['buying', 'maint', 'doors', 'persons', 'lug_boot','safety', 'class']
# load the dataset in csv format into the pandas dataframe
cardata= pd.read_csv(r'C:\Users\user\Downloads\car.data', names=names)
X = cardata.iloc[:, 0:6]
The above code is working perfectly and when I run
print(X.head())
it prints the first 6 columns with exemption of the last column which is supposed to be predicted.
But this code below seems not to work as it outputs a similar behaviour to the one above
y = cardata.select_dtypes(include=[object])
print(y.head())
please help I need to assign the variable y to only the last column that is the seventh column
The output is the same which is not the case , I need when I run print(y.head()) it only prints the last column
Try this
X,y = cardata.iloc[:,:-1],cardata.iloc[:,-1]
This selects all rows and separates X and y based on the last column (index = -1). This should get you the result you are looking for

Numpy Array for SVM model rather than a DataFrame

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Read the data.
data = np.asarray(pd.read_csv('data.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y.
X = data[:,0:2]
y = data[:,2]
# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC()
model.fit(X,y)
2 Questions:
the data goes into a numpy array from a pandas Dataframe (by pd.read_csv).
Is that better? Is there a good reason for that? why not stay with the DataFrame?
I do not understand this notation:
X = data[:,0:2]
y = data[:,2]
What does it do?
Thank you.
The data consists of a CSV file with many rows like this:
0.28917,0.65643,0
It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

Trying to work out a python code

I am new to Machine Learning and python. Recently i have been working with Amazon fine food review data from kaggle and its code.
What i don't understand is how is the 'partiton' method used here ?
Moreover, What actually does last 3 lines of code do ?
%matplotlib inline
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
# using the SQLite Table to read data.
con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite')
#filtering only positive and negative reviews i.e.
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", con)
# Give reviews with Score>3 a positive rating, and reviews with a
score<3 a negative rating.
def partition(x):
if x < 3:
return 'negative'
return 'positive'
#changing reviews with score less than 3 to be positive vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative
creates an array called actualScore using the column Score from filtered_data
actualScore = filtered_data['Score']
creates array positiveNegative coding negative for values <3 and positive for >3
positiveNegative = actualScore.map(partition)
overwrites old column score with new coded values
filtered_data['Score'] = positiveNegative
I think Actually to replace Score column in table with positve or negetive, we use method called partition.Get the Score column as dataframe actualScore, then map the dataframe with replacing values of whether it is positive or negetive. Then replace values in score column with positiveNegative.

Error in my dataset

Currently, I am getting this error in my code
'ValueError: Input contains NaN, infinity or a value too large for dtype('float64')'
when I want to run this code
import pandas as pd
train=pd.read_csv('C:\Users\ABDILLAH\Desktop\datasets\Rails\RailsDataset.csv')
features_col=['Num_comments', 'Num_Commits','Changed_files']
X=train.loc[:,features_col]
y=train.classes
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X,y)`
So if you need a sample of my dataset to check what is realy happened please let me know
I've loaded the sample set and the code below ran in my computer:
import pandas as pd
from sklearn.linear_model import LogisticRegression
train = pd.read_csv('RailsDataset_bis.csv')
features_col = ['Num_Comments', 'Num_Commits', 'Changed_files']
X = train[features_col].dropna()
y = train['class'].dropna()
logreg = LogisticRegression()
logreg.fit(X, y)
I've have corrected issues such as:
There is no Num_comments column, there only is a Num_Comments
column as pandas is case-sensitive. This line
X=train.loc[:,features_col] didn't give you an error, but
generated a column full of NaN. Selecting columns like this X = train[features_col]will throw an error in case the column name doesn't exist.
There is no train.classes as the column name is class and not classes.
There was a line full of NaN on the bottom of the set that needed to be
removed with dropna().

How can I split a Dataset from a .csv file for Training and Testing?

I'm using Python and I need to split my .csv imported data in two parts, a training and test set, E.G 70% training and 30% test.
I keep getting various errors, such as 'list' object is not callable and so on.
Is there any easy way of doing this?
Thanks
EDIT:
The code is basic, I'm just looking to split the dataset.
from csv import reader
with open('C:/Dataset.csv', 'r') as f:
data = list(reader(f)) #Imports the CSV
data[0:1] ( data )
TypeError: 'list' object is not callable
You can use pandas:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
msk = np.random.rand(len(df)) <= 0.7
train = df[msk]
test = df[~msk]
Better practice and maybe more random is to use df.sample:
from numpy.random import RandomState
import pandas as pd
df = pd.read_csv('C:/Dataset.csv')
rng = RandomState()
train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]
You should use the read_csv () function from the pandas module. It reads all your data straight into the dataframe which you can use further to break your data into train and test. Equally, you can use the train_test_split() function from the scikit-learn module.

Categories

Resources