hello like a title i try to using synthetic package for Time series GAN
at the first time i was thinking putting integer then output also numerical but it wasn't, output data are decimal number i using ydata-synthetic (https://github.com/ydataai/ydata-synthetic)
here is my code for make data please help me
#Importing the required libs for the exercise
from os import path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ydata_synthetic.synthesizers import ModelParameters
from ydata_synthetic.preprocessing.timeseries import processed_stock
from ydata_synthetic.synthesizers.timeseries import TimeGAN
import torch
arr_data = np.random.randint(0,600000,(100,1))
#Specific to TimeGANs
#stock_data
seq_len=20
n_seq = 1 #number of columns
hidden_dim=24
gamma=1
noise_dim = 32
dim = 128
batch_size = len(arr_data) - seq_len
log_step = 100
learning_rate = 5e-4
gan_args = ModelParameters(batch_size=batch_size,
lr=learning_rate,
noise_dim=noise_dim,
layers_dim=dim)
lst_temp = []
for i in range(0,len(arr_data) - seq_len):
_x = arr_data[i:i+20]
lst_temp.append(_x)
tens_rand_data = torch.tensor(lst_temp)
lst_rand_data = tens_rand_data.numpy()
synth = TimeGAN(model_parameters=gan_args, hidden_dim=24, seq_len=seq_len, n_seq=n_seq, gamma=1)
synth.train(lst_rand_data, train_steps=10)
synth_data = synth.sample(len(lst_rand_data))
print(synth_data.shape)
cols = ['Car price']
for j, col in enumerate(cols):
df = pd.DataFrame({'Real': lst_rand_data[-1][:, j],'Synthetic': synth_data[-1][:, j]})
df.plot(title = "Car price",secondary_y='Synthetic data', style=['-', '--'])
print(df)
enter image description here
Your input should be processed using a MinMaxScaler before fitting into TimeGAN, and you will always receive decimal output between 0 and 1 due to sigmoid activation on the last layer of its Generator.
You can change your code in 2 ways:
Change your input from integer to decimal range [0,1].
arr_data = np.random.randint(0,600000,(100,1))
into
arr_data = np.random.uniform(0,1,(100,1))
This way your dummy input doesn't need to be scaler since it's already in [0,1].
Use MinMaxScaler to scale your data
from sklearn.preprocessing import MinMaxScaler
arr_data = np.random.randint(0,600000,(100,1))
scaler = MinMaxScaler(feature_range = (0,1))
scaled_data = scaler.fit_transform(arr_data)
...
Please note that you will always receive decimal output from [0,1] when using TimeGAN. Now if you want to inverse synthetic data into integer, consider using inverse transform.
Related
I am porting some distribution fitting code from R to Python and I noticed that the shape and scale parameter estimation in R and Python are different after 3 decimal places, and I wondered why this would be the case.
R code:
library(fitdistrplus)
library(ADGofTest)
set.seed(66)
weibull_sample <- rweibull(150, shape = 0.75, scale = 1)
weibull_fit <- fitdist(weibull_sample,"weibull",method="mle")
summary(weibull_fit) # shape = 0.888309653152, scale = 1.065783323933
gofstat(weibull_fit) #AD 0.9755906963522
plot(weibull_fit)
ad.test(weibull_sample, pweibull, shape = 0.888309653152, scale = 1.065783323933)
# AD = 0.9755906964
write.csv(weibull_sample, file = "weibull_sample.csv", row.names = FALSE, col.names = FALSE)
here is the python code
# Import libraries
import pandas as pd
import numpy as np
import math
from scipy import stats
import statistics as stat
# Read in a dataset from disk (n)
file_path = "weibull_sample.csv"
weibull_df = pd.read_csv(file_path)
weibull_df = weibull_df.sort_values(by=['Wait_Times'], ascending=True)
weibull_df = weibull_df.reset_index(drop=True)
# Find the parameters a Weibull Distribution based on the head dataset
weibull_fit = stats.weibull_min.fit(weibull_df['Wait_Times'], floc=0)
# Extract parameters to individual values
weibull_shape, wiebull_unused, weibull_scale = weibull_fit
# Print the values
print("Weibull Shape Parameter for Dataset", weibull_shape)
print("Weibull Scale Parameter for Dataset", weibull_scale)
#shape R = 0.888309653152
#shape Py = 0.8883784856851663
#scale R = 1.065783323933
#scale Py = 1.0659294522799554
Note I am using a flag in R to set the decimal places to 12 as follows options(digits = 12), Python appears to provide 16dp by default.
Of course, if I round to 3 decimal places I am none the wiser but I am wondering why there is a difference in the first place.
But more importantly how do I know which set of parameters is "correct"?
I'm trying to predict a future stock price using a linear regression model, but there is a problem regarding the plotting of the prediction graph. Every time I run the code, the original graph plot (blue) is correct, but the prediction (green) and valid graph (orange) are in the wrong order. Also, the valid graph is red when it is supposed to be orange.
import requests # For http request to https://marketstack.com
import pandas as pd # For pandas datatable
import numpy as np
# Api Key
params = {
'access_key': '****************************'
}
# Request Api Key Data
api_result = requests.get('https://api.marketstack.com/v1/eod?access_key=************************&symbols=FB&interval=1min&sort=DESC&limit=1000', params)
api_response = api_result.json()
# Sorts the data into a table
df = pd.DataFrame(api_response['data'])
print(df)
# Exports and then imports csv data
df.to_csv('Test_Sample.csv', index=False)
dataframe = pd.read_csv('Test_Sample.csv', header=0)
#Reverse data table
dataframe2 = dataframe.iloc[::-1]
print(dataframe2)
#Convert string to floats
dataframe2['symbol']=pd.to_numeric(dataframe2['symbol'], errors='coerce')
dataframe2['exchange']=pd.to_numeric(dataframe2['exchange'], errors='coerce')
dataframe2['date']=pd.to_numeric(dataframe2['date'], errors='coerce')
dataframe2.info()
#Create target volume
dataframe2['Price_up'] = np.where(dataframe2['close'].shift(-1) > dataframe2['close'], 1, 0)
data = dataframe2[["close"]]
print(data.head())
#The number of days for prediction
futureDays = 30
data["prediction"] = data[["close"]].shift(-futureDays)
print(data.head())
print(data.tail())
import numpy as np
x = np.array(data.drop(["prediction"], 1))[:-futureDays]
print(x)
y = np.array(data["prediction"])[:-futureDays]
print(y)
#75% training data and 25% testing data
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25)
# creating the Linear Regression model
from sklearn.linear_model import LinearRegression
linear = LinearRegression().fit(xtrain, ytrain)
xfuture = data.drop(["prediction"], 1)[:-futureDays]
xfuture = xfuture.tail(futureDays)
xfuture = np.array(xfuture)
print(xfuture)
linearPrediction = linear.predict(xfuture)
print("Linear regression Prediction =",linearPrediction)
#Correct chart appropriate way
sort_x = dataframe2['date']
sort_y = dataframe2['close']
#sort_x, sort_y = zip(*sorted(zip(sort_x, sort_y)))
#sort_x = dataframe2['date']
#sort_y = dataframe2['close']
sort_z = data['close']
sort_x, sort_y, sort_z = zip(*sorted(zip(sort_x, sort_y, sort_z)))
#Linear Regression Model
import matplotlib.pyplot as plt
predictions = linearPrediction
valid = data[x.shape[0]:]
valid["predictions"] = predictions
plt.figure(figsize=(10, 5))
plt.title("Financial Instrument Price Prediction Model (Linear Regression Model)")
plt.xlabel("Days")
plt.ylabel("Close Price")
plt.plot(sort_z)
plt.plot(sort_x, sort_y)
plt.plot(valid[["close", "predictions"]])
plt.legend(["Original", "Valid", "predictions"])
plt.show()
I've tried using the zip command (all the way at the button of the code) in order to reverse the graphs, but it only worked for the original one (blue). Does anyone have a suggesting on how I can fix this issue?
Thank you!
I would like to scale an array of size [192,4000] to a specific range. I would like each row (1:192) to be rescaled to a specific range e.g. (-840,840). I run a very simple code:
import numpy as np
from sklearn import preprocessing as sp
sample_mat = np.random.randint(-840,840, size=(192, 4000))
scaler = sp.MinMaxScaler(feature_range=(-840,840))
scaler = scaler.fit(sample_mat)
scaled_mat= scaler.transform(sample_mat)
This messes up my matrix range, even when max and min of my original matrix is exactly the same. I can't figure out what is wrong, any idea?
You can do this manually.
It is a linear transformation of the minmax normalized data.
interval_min = -840
interval_max = 840
scaled_mat = (sample_mat - np.min(sample_mat) / (np.max(sample_mat) - np.min(sample_mat)) * (interval_max - interval_min) + interval_min
MinMaxScaler support feature_range argument on initialization that can produce the output in a certain range.
scaler = MinMaxScaler(feature_range=(1, 2)) will yield output in the (1,2) range
I have this ML model trained and dumped so I can use it anywhere. And I need to get not just the score, predict values, but also I need predict_proba value as well.
I could get that but the problem is, I was expecting the probabilities to be between 0 and 1, but I get something else like below.
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
And this is the python code I am using.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# dataframe = pd.read_csv("hr_dataset.csv")
dataframe = pd.read_csv("formodel.csv")
dataframe.head(2)
# spare input and target variables
inputs = dataframe.drop('PerformanceRating', axis='columns')
target = dataframe['PerformanceRating']
MaritalStatus_ = LabelEncoder()
JobRole_ = LabelEncoder()
Gender_ = LabelEncoder()
EducationField_ = LabelEncoder()
Department_ = LabelEncoder()
BusinessTravel_ = LabelEncoder()
Attrition_ = LabelEncoder()
OverTime_ = LabelEncoder()
Over18_ = LabelEncoder()
inputs['MaritalStatus_'] = MaritalStatus_.fit_transform(inputs['MaritalStatus'])
inputs['JobRole_'] = JobRole_.fit_transform(inputs['JobRole'])
inputs['Gender_'] = Gender_.fit_transform(inputs['Gender'])
inputs['EducationField_'] = EducationField_.fit_transform(inputs['EducationField'])
inputs['Department_'] = Department_.fit_transform(inputs['Department'])
inputs['BusinessTravel_'] = BusinessTravel_.fit_transform(inputs['BusinessTravel'])
inputs['Attrition_'] = Attrition_.fit_transform(inputs['Attrition'])
inputs['OverTime_'] = OverTime_.fit_transform(inputs['OverTime'])
inputs['Over18_'] = Over18_.fit_transform(inputs['Over18'])
inputs.drop(['MaritalStatus', 'JobRole', 'Attrition' , 'OverTime' , 'EmployeeCount', 'EmployeeNumber',
'Gender', 'EducationField', 'Department', 'BusinessTravel', 'Over18'], axis='columns', inplace=True)
inputsNew = inputs
inputs.head(2)
# inputs = scaled_df
X_train, X_testt, y_train, y_testt = train_test_split(inputs, target, test_size=0.2)
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_testt, y_testt)
print(result)
loaded_model.predict_proba(inputs) // this produces above result, will put it below as well
outpu produces by the loaded_model.predict_proba(inputs)
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
How can I convert these values or get an output like a percentage? (eg: 12%, 50%, 96%)
loaded_model.predict_proba(inputs) outputs the probability of 1st class as well as 2nd class (as you have 2 classes). That's why you see 2 outputs for each occurrence of the data. The total probability for each occurrence sums up to 1.
Let's say if you just care about the probability of second class you can use the below line to fetch the probability of second class.
loaded_model.predict_proba(inputs)[:,1]
I am not sure if this is what you are looking for, apologies if I misunderstood your question.
To convert the probability array from decimal to percentage, you can write (loaded_model.predict_proba(inputs)) * 100.
EDIT: The format that is outputted by loaded_model.predict_proba(inputs) is just scientific notation, i.e. all of those numbers are between 0 and 1, but many of them are extremely small probabilities and so are represented in scientific notation.
The reason that you see such small probabilities is that loaded_model.predict_proba(inputs)[:,0] (the first column of the probability array) represents the probabilities of the data belonging to one class, and loaded_model.predict_proba(inputs)[:,1] represents the probabilities of the data belonging to the other class.
In other words, this means that each row of the probability array should add up to 1.
I hope this helps!
Check this out if the result is distributed in a different class and for the right class only you want probability in percentage.
pred_prob = []
pred_labels = loaded_model.predict_proba(inputs)
for each_pred in pred_labels:
each_pred_max = max(each_pred)*100
pred_bools.append(pred_item)
probability_list = [item*100 for item in pred_prob]
I am trying to give those training data to sklearn.svm.SVC() but it returns the error ValueError: setting an array element with a sequence. when I try to clf.fit(v,v2). How do we process this data before giving it to SVC()?
from PIL import Image
from sklearn import svm
for i in xrange(1,55):
t = list(Image.open("train/"+str(i)+".png").getdata())
v.append(t)
v = np.asarray(v)
v2 = np.array(["1","F","9","D","E","E","E","9","0","D","0","3","C","B","F","9","A","E","B","8","A","8","7",
"9","9","3","C","6","1","E","6","6","C","C","F","A","8","0","1","F","F","E","9","4","6","0",
"7","2","D","9","A","C","7","E"])
clf = svm.SVC()
I think you are looking for something like this:
from scipy import misc
import glob
from sklearn import svm
filenames = glob.glob('train/*.png')
X = [misc.imread(each).flatten() for each in filenames]
y = ["1","F","9","D","E","E", ...]
model = svm.SVC().fit(X, y)
Notes:
X has the form (n_images, n_pixels) where n_pixels=width*height
y has length n_images (54 in your example)
This is just a start, you should try to feed the classifier with more meaningful features that single pixels.