Using PCA on Train and testset

Using PCA on Train and testset - python

I'm trying to reduce the number of features I using in my Tensorflow model. For That I'm trying to use the PCA. here the code that I've wrote:
from sklearn.decomposition import PCA
from sklearn import preprocessing
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
number_of_PCA = 20
# Reading csv file
training_file = 'Training.csv'
testing_file = 'Test.csv'
dataframe_train = pd.read_csv(training_file)
dataframe_test = pd.read_csv(testing_file)
#Training values
features_labels_train = dataframe_train.columns.values[:-2]
class_labels_train = dataframe_train.iloc[:,-2]
feature_values_train = dataframe_train.iloc[:,:-2]
train_onehot = dataframe_train.iloc[:,-1]
#Test values
feature_labels_test = dataframe_test.columns.values[:-2]
class_labels_test = dataframe_test.iloc[:,-2]
features_values_test = dataframe_test.iloc[:,:-2]
test_onehot = dataframe_test.iloc[:,-1]
#values standardisation
stdsc = preprocessing.StandardScaler()
np_scaled_train = stdsc.fit_transform(feature_values_train)
np_scaled_test = stdsc.transform(features_values_test)
pca = PCA(n_components=number_of_PCA)
X_train_pca = pca.fit_transform(np_scaled_train) # This is the result
X_test_pca = pca.transform(np_scaled_test)
......................................................
When I run the Tensorflow on result, I get a big overfitting as shown bellow :
I'm assuming that the way I'm using the PCA on the test set is the the issue.
Does anyone here has an idea what I'm missing here ?

Related

python time series synthetic data using ydata-synthetic package - Time series GAN

hello like a title i try to using synthetic package for Time series GAN
at the first time i was thinking putting integer then output also numerical but it wasn't, output data are decimal number i using ydata-synthetic (https://github.com/ydataai/ydata-synthetic)
here is my code for make data please help me
#Importing the required libs for the exercise
from os import path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ydata_synthetic.synthesizers import ModelParameters
from ydata_synthetic.preprocessing.timeseries import processed_stock
from ydata_synthetic.synthesizers.timeseries import TimeGAN
import torch
arr_data = np.random.randint(0,600000,(100,1))
#Specific to TimeGANs
#stock_data
seq_len=20
n_seq = 1 #number of columns
hidden_dim=24
gamma=1
noise_dim = 32
dim = 128
batch_size = len(arr_data) - seq_len
log_step = 100
learning_rate = 5e-4
gan_args = ModelParameters(batch_size=batch_size,
lr=learning_rate,
noise_dim=noise_dim,
layers_dim=dim)
lst_temp = []
for i in range(0,len(arr_data) - seq_len):
_x = arr_data[i:i+20]
lst_temp.append(_x)
tens_rand_data = torch.tensor(lst_temp)
lst_rand_data = tens_rand_data.numpy()
synth = TimeGAN(model_parameters=gan_args, hidden_dim=24, seq_len=seq_len, n_seq=n_seq, gamma=1)
synth.train(lst_rand_data, train_steps=10)
synth_data = synth.sample(len(lst_rand_data))
print(synth_data.shape)
cols = ['Car price']
for j, col in enumerate(cols):
df = pd.DataFrame({'Real': lst_rand_data[-1][:, j],'Synthetic': synth_data[-1][:, j]})
df.plot(title = "Car price",secondary_y='Synthetic data', style=['-', '--'])
print(df)
enter image description here

Your input should be processed using a MinMaxScaler before fitting into TimeGAN, and you will always receive decimal output between 0 and 1 due to sigmoid activation on the last layer of its Generator.
You can change your code in 2 ways:
Change your input from integer to decimal range [0,1].
arr_data = np.random.randint(0,600000,(100,1))
into
arr_data = np.random.uniform(0,1,(100,1))
This way your dummy input doesn't need to be scaler since it's already in [0,1].
Use MinMaxScaler to scale your data
from sklearn.preprocessing import MinMaxScaler
arr_data = np.random.randint(0,600000,(100,1))
scaler = MinMaxScaler(feature_range = (0,1))
scaled_data = scaler.fit_transform(arr_data)
...
Please note that you will always receive decimal output from [0,1] when using TimeGAN. Now if you want to inverse synthetic data into integer, consider using inverse transform.

2D output on Lineal regression model

I'm getting the following error from my code:
ValueError: Expected 2D array, got scalar array instead:
array=99.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is the code used:
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
Physical_activity_df = pd.read_excel('C:/Users/Usuario/Desktop/LW_docs/Physical_activity_nopass.xlsx')
prediction_df = Physical_activity_df[['Activity_Score','Calories']]
prediction_df.plot(kind='scatter', x= 'Activity_Score', y= 'Calories')
plt.show()
#change to df variables
activity_score = pd.DataFrame(prediction_df['Activity_Score'])
calories = pd.DataFrame(prediction_df['Calories'])
lm = linear_model.LinearRegression()
model = lm.fit(activity_score,calories)
#predict new values for calories (FROM HERE COMES THE ERROR)
activity_score_new = 99
calories_predict = model.predict(activity_score_new)
calories_predict
Any idea about how to fix this issue? Thanks!

Python Matplotlib plotting prediction graph in wrong order

I'm trying to predict a future stock price using a linear regression model, but there is a problem regarding the plotting of the prediction graph. Every time I run the code, the original graph plot (blue) is correct, but the prediction (green) and valid graph (orange) are in the wrong order. Also, the valid graph is red when it is supposed to be orange.
import requests # For http request to https://marketstack.com
import pandas as pd # For pandas datatable
import numpy as np
# Api Key
params = {
'access_key': '****************************'
}
# Request Api Key Data
api_result = requests.get('https://api.marketstack.com/v1/eod?access_key=************************&symbols=FB&interval=1min&sort=DESC&limit=1000', params)
api_response = api_result.json()
# Sorts the data into a table
df = pd.DataFrame(api_response['data'])
print(df)
# Exports and then imports csv data
df.to_csv('Test_Sample.csv', index=False)
dataframe = pd.read_csv('Test_Sample.csv', header=0)
#Reverse data table
dataframe2 = dataframe.iloc[::-1]
print(dataframe2)
#Convert string to floats
dataframe2['symbol']=pd.to_numeric(dataframe2['symbol'], errors='coerce')
dataframe2['exchange']=pd.to_numeric(dataframe2['exchange'], errors='coerce')
dataframe2['date']=pd.to_numeric(dataframe2['date'], errors='coerce')
dataframe2.info()
#Create target volume
dataframe2['Price_up'] = np.where(dataframe2['close'].shift(-1) > dataframe2['close'], 1, 0)
data = dataframe2[["close"]]
print(data.head())
#The number of days for prediction
futureDays = 30
data["prediction"] = data[["close"]].shift(-futureDays)
print(data.head())
print(data.tail())
import numpy as np
x = np.array(data.drop(["prediction"], 1))[:-futureDays]
print(x)
y = np.array(data["prediction"])[:-futureDays]
print(y)
#75% training data and 25% testing data
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25)
# creating the Linear Regression model
from sklearn.linear_model import LinearRegression
linear = LinearRegression().fit(xtrain, ytrain)
xfuture = data.drop(["prediction"], 1)[:-futureDays]
xfuture = xfuture.tail(futureDays)
xfuture = np.array(xfuture)
print(xfuture)
linearPrediction = linear.predict(xfuture)
print("Linear regression Prediction =",linearPrediction)
#Correct chart appropriate way
sort_x = dataframe2['date']
sort_y = dataframe2['close']
#sort_x, sort_y = zip(*sorted(zip(sort_x, sort_y)))
#sort_x = dataframe2['date']
#sort_y = dataframe2['close']
sort_z = data['close']
sort_x, sort_y, sort_z = zip(*sorted(zip(sort_x, sort_y, sort_z)))
#Linear Regression Model
import matplotlib.pyplot as plt
predictions = linearPrediction
valid = data[x.shape[0]:]
valid["predictions"] = predictions
plt.figure(figsize=(10, 5))
plt.title("Financial Instrument Price Prediction Model (Linear Regression Model)")
plt.xlabel("Days")
plt.ylabel("Close Price")
plt.plot(sort_z)
plt.plot(sort_x, sort_y)
plt.plot(valid[["close", "predictions"]])
plt.legend(["Original", "Valid", "predictions"])
plt.show()
I've tried using the zip command (all the way at the button of the code) in order to reverse the graphs, but it only worked for the original one (blue). Does anyone have a suggesting on how I can fix this issue?
Thank you!

How to know the order of eigen values in PCA

I performed a Pca analysis in python. And, I got the eigenvalues for the analysis, but I don't know what variables of my dataset are represented in the components.
There are a way to know which components represent each variable of my data:
for example: 4.669473069609005 corresponds to sillas, etc...
here is the file:
https://storage.googleapis.com/min_ambiente/servi_acc/datos.csv
here is the code:
# I have libraries es for some other methods I Implemented here.
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
from google.cloud import bigquery
from sklearn.preprocessing import StandardScaler
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
from factor_analyzer.factor_analyzer import calculate_kmo
from factor_analyzer import FactorAnalyzer
%matplotlib inline
#load csv
from google.colab import files
uploaded = files.upload()
data = pd.read_csv("datos.csv")
data.fillna(0, inplace=True)
a,b = data.shape
X= data.iloc[:,0:b-1]
X.head()
enter image description here
#####################################################
###Estandarizar y realizar la matriz de covarianza###
#####################################################
#Standardize features by removing the mean and scaling to unit variance
#used for generating learning model parameters from training data and
#generate transformed data set
X_std = StandardScaler().fit_transform(X)
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
###Valores y vectores propios obtenidos de la matriz covarianza
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
dictionary = dict(zip(lst2, lst1))
print(dictionary)
###print from the highest to the lowest
eig_pairs.sort()
eig_pairs.reverse()
print('eigenvalues in descending order :')
for i in eig_pairs:
print(i[0])

Providing data to sklearn.svm.SVC()

I am trying to give those training data to sklearn.svm.SVC() but it returns the error ValueError: setting an array element with a sequence. when I try to clf.fit(v,v2). How do we process this data before giving it to SVC()?
from PIL import Image
from sklearn import svm
for i in xrange(1,55):
t = list(Image.open("train/"+str(i)+".png").getdata())
v.append(t)
v = np.asarray(v)
v2 = np.array(["1","F","9","D","E","E","E","9","0","D","0","3","C","B","F","9","A","E","B","8","A","8","7",
"9","9","3","C","6","1","E","6","6","C","C","F","A","8","0","1","F","F","E","9","4","6","0",
"7","2","D","9","A","C","7","E"])
clf = svm.SVC()

I think you are looking for something like this:
from scipy import misc
import glob
from sklearn import svm
filenames = glob.glob('train/*.png')
X = [misc.imread(each).flatten() for each in filenames]
y = ["1","F","9","D","E","E", ...]
model = svm.SVC().fit(X, y)
Notes:
X has the form (n_images, n_pixels) where n_pixels=width*height
y has length n_images (54 in your example)
This is just a start, you should try to feed the classifier with more meaningful features that single pixels.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using PCA on Train and testset - python

Related

python time series synthetic data using ydata-synthetic package - Time series GAN

2D output on Lineal regression model

Python Matplotlib plotting prediction graph in wrong order

How to know the order of eigen values in PCA

Providing data to sklearn.svm.SVC()

Categories

Resources