Use multiple csv files as test and training set for CNN - python

I am writing a code to detect minerals from their Raman Spectra data using CNN. I have data (RRUFF dataset) for different minerals written into different csv/text files each consisting of 2 columns: Intensity and corresponding Raman Shift value of the mineral.
How should I use these multiple files for training and testing my CNN?
Can I use flow_from_directory directly for csv files under Train and Test folders?
Total csv/txt files in dataset: 3696

def merge_data(csv_files, columns, output_file):
df = pandas.DataFrame(columns=columns)
for file in csv_files:
df = df.append(pandas.read_csv(file), sort=False)
return df
Now, call the function df = merge_data(['file1.csv', file2.csv], ['column1', 'column2'], 'all_data.csv')
Then, split the merged data into train and test set using train_test_split from sklearn.model_selection

Related

Keras preprocessing trading data

I have a problem with preprocessing my trading data from .csv so that it fits into sgd model neural network input/output.
I have imported the data using pandas lib but maybe theres a better way to do it?
I need to set column names, data inside needs to be double type, and convert it into tf.data.Dataset.
I have 2 data sets: testingdata.csv and trainingdata.csv
each have 4 columns: Open, max, min, close
'Open' column is a forecasting value Y, while 'max', 'min' and 'close' are X Inputs.
inside my .csv file
Also i have no idea what is 'metric' in keras and what metric should i use here
So my questions: what is the best way to do it and how to do it.
Thanks
using pd.read_csv is good way to import .csv files
import pandas as pd ​
​df= pd.read_csv('data.csv')
but you need to change column names to custom names so you can do this:
df = pd.read_csv('data.csv',
header=None,
names=["open", "max","min","close"],
encoding='utf-16')
you can see the imported .csv file head in dataframe:
df.head(5)
if you want Converting from Pandas dataframe to TensorFlow Dataset:
import tensorflow as tf
target = df.pop('Open')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))

I would like to consider a feature set(vector) for a data in python for my machine learning algorithm. How can I do it?

I have data in the following form
Class Feature set list
classlabel1 - [size,time] example:[6780.3,350.00]
classlabel2 - [size,time]
classlabel3 - [size,time]
classlabel4 - [size,time]
How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.
I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.
The dataframe is getting saved in csv file in the following way:
col 0 col1 col2
62309 396.5099154 label1
I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?
Firstly responding to your question:
I would like to train and test on the feature vector [size,time]
combined. Is it possible and is this a right way? If it is possible,
how can I do it?
Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.
Now let's move onto to next section:
How do I save this data in excel sheet and how can I train the model
using this feature set? Currently I am working on SVM classifier.
Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.
sample_data.csv
size,time,class_label
100,150,label1
200,250,label2
240,180,label1
Below is the code for reading the data from csv and training SVM :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
warn_bad_lines=True)
# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values
# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)
# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train)
# Using the predictor
y_pred = clf.predict(x_test)
Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, i.e.
# data.csv
size time label
6780.3 3,350.00 classLabel1
...
If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.
import pandas as pd
import ast
df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]
size = [x[0] for x in size_time]
time = [x[1] for x in size_time]
label = df["Class"]
new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
# size time label
# 6780.3 350.0 classlabel1
# Save DataFrame to csv
new_df.to_csv("data_fix.csv")
# Use it
x = new_df.drop("label", axis=1)
y = new_df.label
# Further data preparation, such as split the dataset
# into train and test set, etc.
...
Hope this helps

Getting an empty dataframe after splitting my dataset back into train and test datasets

First I used:
train['source']='train'
test['source']='test'
data = pd.concat([train,test],ignore_index=True)
2. Performed data exploration,cleaning on fulldata and now I wanted to split them back into train and test datsets.
I used the following code :
#Divide into test and train:
train = data.loc[data['source']=="train"]
test = data.loc[data['source']=="test"]
But after doing so train and test datasets are empty dataframes giving just column names. What am I doing wrong??

Looping scikit-learn machine learning datasets

How to use datasets.load_DATASET_NAME with every string from the Datasets array when looping to apply some ML algorithms on one dataset at a time.
I have the following sample program:
from sklearn import datasets
_Datasets_=['iris' , 'breast_cancer' , 'wine' , 'diabetes', 'linnerud' , 'boston' ]
for Dataset_name in _Datasets_:
# Load the dataset
Dataset = datasets.load_'DATASET_NAME'()
You could make a dictionary of names to function names. Then call as you are iterating
Datasets = {'data1':load_data_1}
Data = Datasets['data1']()

How to match the name column with result after classification scikit-learn

This is an example of my data:
filename,2,3,4,5,6,7,class
a.txt,0,0,0,0,0,0,0
b.txt,0,0,0,0,0,1,0
c.txt,0,0,0,0,1,0,0
d.txt,1,0,1,0,0,1,1
When I train my data, I just use the columns from 2 -> 7 as input, class as output. But when I test the model after it trained and save, I need to know that which files are belong to which class. I mean like how to know d.txt is class 1.
I use pandas to import the data from .csv file, I use train set and test set in 2 different csv files. In the train phase, I uses columns 2-7 as input, and column class as target, these columns are numerical. The filename class is just text. In the test phase, I need to know the filename with the predicted class. But I don't know how to do that.
Thanks
P/s: I used MLP,SVM, NB as classifier.
Assuming your data is in .csv format:
filename,2,3,4,5,6,7,class
a.txt,0,0,0,0,0,0,0
b.txt,0,0,0,0,0,1,0
c.txt,0,0,0,0,1,0,0
d.txt,1,0,1,0,0,1,1
You can output the corresponding filename to a predicted class using:
features=[1,0,1,0,0,1] #input
output=clf.predict([features])[0] #predicted class
print(df[df["class"]==output]["filename"]) #corresponding filename
Note that in your example you're facing the problem where the amount of features is greater than the amount of examples, therefore the classifier may deteriorate.
Hopefully you just gave a sample of your data. In this case you're likely to be good. Just watch out for what classifier to use.
Full code:
import numpy as np
import pandas as pd
from sklearn import svm
df=pd.read_csv('file.csv')
X = df.iloc[:,1:7].values
y = df.iloc[:,7:8].values
clf = svm.SVC() #using SVM as classifier
clf.fit(X, y)
features=[1,0,1,0,0,1]
output=clf.predict([features])[0]
print(df[df["class"]==output]["filename"])

Categories

Resources