I am trying to give those training data to sklearn.svm.SVC() but it returns the error ValueError: setting an array element with a sequence. when I try to clf.fit(v,v2). How do we process this data before giving it to SVC()?
from PIL import Image
from sklearn import svm
for i in xrange(1,55):
t = list(Image.open("train/"+str(i)+".png").getdata())
v.append(t)
v = np.asarray(v)
v2 = np.array(["1","F","9","D","E","E","E","9","0","D","0","3","C","B","F","9","A","E","B","8","A","8","7",
"9","9","3","C","6","1","E","6","6","C","C","F","A","8","0","1","F","F","E","9","4","6","0",
"7","2","D","9","A","C","7","E"])
clf = svm.SVC()
I think you are looking for something like this:
from scipy import misc
import glob
from sklearn import svm
filenames = glob.glob('train/*.png')
X = [misc.imread(each).flatten() for each in filenames]
y = ["1","F","9","D","E","E", ...]
model = svm.SVC().fit(X, y)
Notes:
X has the form (n_images, n_pixels) where n_pixels=width*height
y has length n_images (54 in your example)
This is just a start, you should try to feed the classifier with more meaningful features that single pixels.
Related
hello like a title i try to using synthetic package for Time series GAN
at the first time i was thinking putting integer then output also numerical but it wasn't, output data are decimal number i using ydata-synthetic (https://github.com/ydataai/ydata-synthetic)
here is my code for make data please help me
#Importing the required libs for the exercise
from os import path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ydata_synthetic.synthesizers import ModelParameters
from ydata_synthetic.preprocessing.timeseries import processed_stock
from ydata_synthetic.synthesizers.timeseries import TimeGAN
import torch
arr_data = np.random.randint(0,600000,(100,1))
#Specific to TimeGANs
#stock_data
seq_len=20
n_seq = 1 #number of columns
hidden_dim=24
gamma=1
noise_dim = 32
dim = 128
batch_size = len(arr_data) - seq_len
log_step = 100
learning_rate = 5e-4
gan_args = ModelParameters(batch_size=batch_size,
lr=learning_rate,
noise_dim=noise_dim,
layers_dim=dim)
lst_temp = []
for i in range(0,len(arr_data) - seq_len):
_x = arr_data[i:i+20]
lst_temp.append(_x)
tens_rand_data = torch.tensor(lst_temp)
lst_rand_data = tens_rand_data.numpy()
synth = TimeGAN(model_parameters=gan_args, hidden_dim=24, seq_len=seq_len, n_seq=n_seq, gamma=1)
synth.train(lst_rand_data, train_steps=10)
synth_data = synth.sample(len(lst_rand_data))
print(synth_data.shape)
cols = ['Car price']
for j, col in enumerate(cols):
df = pd.DataFrame({'Real': lst_rand_data[-1][:, j],'Synthetic': synth_data[-1][:, j]})
df.plot(title = "Car price",secondary_y='Synthetic data', style=['-', '--'])
print(df)
enter image description here
Your input should be processed using a MinMaxScaler before fitting into TimeGAN, and you will always receive decimal output between 0 and 1 due to sigmoid activation on the last layer of its Generator.
You can change your code in 2 ways:
Change your input from integer to decimal range [0,1].
arr_data = np.random.randint(0,600000,(100,1))
into
arr_data = np.random.uniform(0,1,(100,1))
This way your dummy input doesn't need to be scaler since it's already in [0,1].
Use MinMaxScaler to scale your data
from sklearn.preprocessing import MinMaxScaler
arr_data = np.random.randint(0,600000,(100,1))
scaler = MinMaxScaler(feature_range = (0,1))
scaled_data = scaler.fit_transform(arr_data)
...
Please note that you will always receive decimal output from [0,1] when using TimeGAN. Now if you want to inverse synthetic data into integer, consider using inverse transform.
Python t-sne implementation from this resource: https://lvdmaaten.github.io/tsne/
Btw I'm a beginner to scRNA-seq.
What I am trying to do: Use a scRNA-seq data set and run t-SNE on it but with using previously calculated PCAs (I have PCA.score and PCA.load files)
Q1: I should be able to use my selected calculated PCAs in the tSNE, but which file do I use the pca.score or pca.load when running Y = tsne.tsne(X)?
Q2: I've tried removing/replacing parts of the PCA calculating code to attempt to remove PCA preprocessing but it always seems to give an error. What should I change for it to properly use my already PCA data and not calculate PCA from it again?
The piece of PCA processing code is this in its raw form:
def pca(X=np.array([]), no_dims=50):
"""
Runs PCA on the NxD array X in order to reduce its dimensionality to
no_dims dimensions.
"""
print("Preprocessing the data using PCA...")
(n, d) = X.shape
X = X - np.tile(np.mean(X, 0), (n, 1))
(l, M) = X #np.linalg.eig(np.dot(X.T, X))
Y = np.dot(X, M[:, 0:no_dims])
return Y
You should use the PCA score.
As for not running pca, you can just comment out this line:
X = pca(X, initial_dims).real
What I did is to add a parameter do_pca and edit the function such:
def tsne(X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0,do_pca=True):
"""
Runs t-SNE on the dataset in the NxD array X to reduce its
dimensionality to no_dims dimensions. The syntaxis of the function is
`Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array.
"""
# Check inputs
if isinstance(no_dims, float):
print("Error: array X should have type float.")
return -1
if round(no_dims) != no_dims:
print("Error: number of dimensions should be an integer.")
return -1
# Initialize variables
if do_pca:
X = pca(X, initial_dims).real
(n, d) = X.shape
max_iter = 50
[.. rest stays the same..]
Using an example dataset, without commenting out that line:
import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import sys
import os
from tsne import *
X,y = load_digits(return_X_y=True,n_class=3)
If we run the default:
res = tsne(X=X,initial_dims=20,do_pca=True)
plt.scatter(res[:,0],res[:,1],c=y)
If we pass it a pca :
pc = pca(X)[:,:20]
res = tsne(X=pc,initial_dims=20,do_pca=False)
plt.scatter(res[:,0],res[:,1],c=y)
I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))
Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.
I am trying to make a custom kernel in python. This is my code :
from sklearn import datasets
from sklearn.svm import SVC
from sklearn import svm
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
import PIL
from PIL import Image
import pylab as pl
import math
digits = datasets.load_digits()
X = digits.data[:-200]
Y = digits.target[:-200]
def kernal6(x,y):
d=np.linalg.norm(x-y)
Xn=np.linalg.norm(x)
Yn=np.linalg.norm(y)
return (Xn+Yn-d)/np.sqrt(Xn*Yn)
clf5 = svm.SVC(kernel=kernal6)
clf5.fit(X,Y)
but I keep getting this error :
IndexError: tuple index out of range
You are returning the wrong value. The kernel function should return a matrix. Have a look at this to see an example of a proper kernel function
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
digits = datasets.load_digits()
X = digits.data[:-200, :2] #You were doing this wrong too
Y = digits.target[:-200]
def my_kernel(x, y):
M = np.array([[2, 0], [0, 1.0]])
return np.dot(np.dot(x, M), y.T) #returns a matrix
def kernal6(x,y):
d=np.linalg.norm(x-y)
Xn=np.linalg.norm(x)
Yn=np.linalg.norm(y)
return (Xn+Yn-d)/np.sqrt(Xn*Yn) #returns a float
print "Testing SVC with my_kernel"
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y) #works fine
print "Success!"
print "Testing kernal6"
print "kernal6 direct call:",kernal6(X, X) #will return a result
clf = svm.SVC(kernel=kernal6)
try:
clf.fit(X, Y)#fails
except:
print "Failed to fit with kernal6"
IndexError means that you are trying to access an array/tuple's value of an index that is not defined. The only time where you try to access a tuple (it says tuple index out of range) is when you declare X and Y. Therefore, it must be a problem with the slicing notation. I think that the reason is because the tuple does not have more or equal than 200 elements (the array[:-200] returns len(array) - 200, which may be a negative integer again); however, I cannot run your code because my interpreter throws an error, so I am sorry if I am wrong.
I am following the book Building Machine Learning Systems with Python. After loading the dataset from scipy I need to extract index of all features belonging to setosa. But I am unable to extract. Probably because I am not using a numpy array. can someone please help me in extracting index numbers? Code below
from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
import numpy as np
# We load the data with load_iris from sklearn
data = load_iris()
features = data['data']
feature_names = data['feature_names']
target = data['target']
for t,marker,c in zip(xrange(3),">ox","rgb"):
# We plot each class on its own to get different colored markers
plt.scatter(features[target == t,0], features[target == t,1],
marker=marker, c=c)
plength = features[:, 2]
# use numpy operations to get setosa features
is_setosa = (labels == 'setosa')
# This is the important step:
max_setosa = plength[is_setosa].max()
min_non_setosa = plength[~is_setosa].min()
print('Maximum of setosa: {0}.'.format(max_setosa))
print('Minimum of others: {0}.'.format(min_non_setosa))
Define labels before the problem line.
target_names = data['target_names']
labels = target_names[target]
Now these lines will work fine:
is_setosa = (labels == 'setosa')
setosa_petal_length = plength[is_setosa].
Extra.
Data Bunch from sklearn ( data = load_iris() ) consists of target array with numbers 0-2 which are related to features and means sort of the flower. Using that you can extract all features belonged to setosa (where target equals 0) like:
petal_length = features[:, 2]
setosa_petal_length = petal_length[target == 0]
Confront this with data['target_names'] and you will get two lines on the top which are solution to your question. By the way all arrays from the data are ndarrays from NumPy.