I have a problem with preprocessing my trading data from .csv so that it fits into sgd model neural network input/output.
I have imported the data using pandas lib but maybe theres a better way to do it?
I need to set column names, data inside needs to be double type, and convert it into tf.data.Dataset.
I have 2 data sets: testingdata.csv and trainingdata.csv
each have 4 columns: Open, max, min, close
'Open' column is a forecasting value Y, while 'max', 'min' and 'close' are X Inputs.
inside my .csv file
Also i have no idea what is 'metric' in keras and what metric should i use here
So my questions: what is the best way to do it and how to do it.
Thanks
using pd.read_csv is good way to import .csv files
import pandas as pd
df= pd.read_csv('data.csv')
but you need to change column names to custom names so you can do this:
df = pd.read_csv('data.csv',
header=None,
names=["open", "max","min","close"],
encoding='utf-16')
you can see the imported .csv file head in dataframe:
df.head(5)
if you want Converting from Pandas dataframe to TensorFlow Dataset:
import tensorflow as tf
target = df.pop('Open')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
Related
How do I load a excel file from my system in SciKit for train and test.
I tried pandas for importing, i was able to import it using
data=pd.read_excel("IRIS.xlsx")
print(data.head())
but then am not able to perform this line
X,y=data(return_X_y=True)
giving me a dataframe error.
what should i do or import before this line so i am able to carry out train and test successfully
code and error
data is a pandas Dataframe and it does not have any method return_X_y=True. This is only present in most sklearn load data functions e.g. `sklearn.datasets.load_XXX```.
In your case you load the data manually thus, you just need to slice and select the desired columns.
y = data["target"].values
# select all columns except the target
X = data.loc[:, data.columns != "target"].values
Use .iloc:
data = pd.read_excel('IRIS.xlsx')
X, y = data.iloc[:, :-1], data.iloc[-1]
If you really want to use iris dataset, use sklearn directly:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris['data'], iris['target']
I'm working with a large dataset whose data I want to standardize to use with a CNN.
Does keras have a quick utility to standardize a block of numbers column-wise that you can use inside a Sequential model? I'm asking this as i expect eventually the data to be used on-line so ideally this standardization feature could be used on incoming data, ie a trailing moving average of mean and std to normalize the incoming data.
import numpy as np
import pandas as pd
np.random.seed(42)
col_names = ['Column' + str(x+1) for x in range(5)]
training_data = pd.DataFrame(np.random.randint(1,10 **6, 50).reshape(-1,5), columns = col_names)
I am not sure about online, but using sklearn's StandardScaler() should do the right thing, as described here, seems like the right thing.
We can do from sklearn
from sklearn.preprocessing import StandardScaler
training_data[:]= StandardScaler().fit_transform(training_data.T).T
I have data in the following form
Class Feature set list
classlabel1 - [size,time] example:[6780.3,350.00]
classlabel2 - [size,time]
classlabel3 - [size,time]
classlabel4 - [size,time]
How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.
I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.
The dataframe is getting saved in csv file in the following way:
col 0 col1 col2
62309 396.5099154 label1
I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?
Firstly responding to your question:
I would like to train and test on the feature vector [size,time]
combined. Is it possible and is this a right way? If it is possible,
how can I do it?
Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.
Now let's move onto to next section:
How do I save this data in excel sheet and how can I train the model
using this feature set? Currently I am working on SVM classifier.
Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.
sample_data.csv
size,time,class_label
100,150,label1
200,250,label2
240,180,label1
Below is the code for reading the data from csv and training SVM :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
warn_bad_lines=True)
# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values
# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)
# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train)
# Using the predictor
y_pred = clf.predict(x_test)
Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, i.e.
# data.csv
size time label
6780.3 3,350.00 classLabel1
...
If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.
import pandas as pd
import ast
df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]
size = [x[0] for x in size_time]
time = [x[1] for x in size_time]
label = df["Class"]
new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
# size time label
# 6780.3 350.0 classlabel1
# Save DataFrame to csv
new_df.to_csv("data_fix.csv")
# Use it
x = new_df.drop("label", axis=1)
y = new_df.label
# Further data preparation, such as split the dataset
# into train and test set, etc.
...
Hope this helps
I am writing a code to detect minerals from their Raman Spectra data using CNN. I have data (RRUFF dataset) for different minerals written into different csv/text files each consisting of 2 columns: Intensity and corresponding Raman Shift value of the mineral.
How should I use these multiple files for training and testing my CNN?
Can I use flow_from_directory directly for csv files under Train and Test folders?
Total csv/txt files in dataset: 3696
def merge_data(csv_files, columns, output_file):
df = pandas.DataFrame(columns=columns)
for file in csv_files:
df = df.append(pandas.read_csv(file), sort=False)
return df
Now, call the function df = merge_data(['file1.csv', file2.csv], ['column1', 'column2'], 'all_data.csv')
Then, split the merged data into train and test set using train_test_split from sklearn.model_selection
I'm using the LightGBM Package.
I have successfully created a new tree using "create_tree_digraph" but I face some trouble understanding the result.
There is "leaf_value" in a leaf node. I don't know what it means. Please, somebody help me understand this. Thanks. :)
I used this example code from here: https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/
#importing standard libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import graphviz
import lightgbm as lgb
#loading our training dataset 'adult.csv' with name 'data' using pandas
data=pd.read_csv('./adult.csv',header=None)
#Assigning names to the columns
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income']
# Label Encoding our target variable
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder()
l.fit(data.Income)
data.Income=Series(l.transform(data.Income)) #label encoding our target variable
#One Hot Encoding of the Categorical features
one_hot_workclass=pd.get_dummies(data.workclass)
one_hot_education=pd.get_dummies(data.education)
#removing categorical features
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True)
#Merging one hot encoded features with our dataset 'data'
data=pd.concat([data,one_hot_workclass,one_hot_education],axis=1)
#Here our target variable is 'Income' with values as 1 or 0.
#Separating our data into features dataset x and our target dataset y
x=data.drop('Income',axis=1)
y=data.Income
#Imputing missing values in our target variable
y.fillna(y.mode()[0],inplace=True)
#Now splitting our dataset into test and train
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
train_data=lgb.Dataset(x_train,label=y_train)
#setting parameters for lightgbm
param = {'num_leaves':150, 'objective':'binary','max_depth':3,'learning_rate':.05,'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']
#training our model using light gbm
num_round=50
lgbm=lgb.train(param,train_data,num_round)
graph = lgb.create_tree_digraph(lgbm)
graph.render(view=True)
Then I applied 'create_tree_digraph' function.
Pics
These are the raw predicted probabilities before the sigmoid function is applied. However, one thing to be aware of is your image is only showing 1 tree out of the entire model so it will not be the same as the actual outcome (unless your model is just this 1 tree).
This Image is showing what it would look like if you applied the sigmoid to the leaf values prior to creating the plots.