Why are full stops appearing in the printed statement of an array? - python

I am currently learning how to use python and jupyter notebook. I want to create my own dataset. The code for that is as follows (which was taken from this website: How to create my own datasets using in scikit-learn?):
import numpy as np
import csv
from sklearn.datasets.base import Bunch
def load_movies_dataset():
with open('Documents/movies_dataset.csv') as csv_file:
data_file = csv.reader(csv_file)
temp = next(data_file)
n_samples = int(temp[0])
n_features = int(temp[1])
data = np.empty((n_samples, n_features))
target = np.empty((n_samples,), dtype=np.int)
for i, sample in enumerate(data_file):
data[i] = np.asarray(sample[:-1], dtype=np.int)
target[i] = np.asarray(sample[-1], dtype=np.int)
return Bunch(data=data, target=target)
This is the csv file that I'm using:
"6","2","numKicks","numKisses"
"3","104","0"
"2","100","0"
"1","81","0"
"101","10","1"
"99","5","1"
"98","2","1"
This example determines if a movie is a romance(0) or action(1) based on the number of kicks and number of kisses.
This is the code I'm using to test the creation of the dataset:
md = load_movies_dataset()
X = md.data
y = md.target
X
And this is the output:
array([[ 3., 104.],
[ 2., 100.],
[ 1., 81.],
[101., 10.],
[ 99., 5.],
[ 98., 2.]])
My question is, why are there full stops in the array display?

They are decimal points. It is an array of floats:
>>> x
array([[ 3., 104.],
[ 2., 100.],
[ 1., 81.],
[101., 10.],
[ 99., 5.],
[ 98., 2.]])
>>> y
array([0, 0, 0, 1, 1, 1])
>>> x.dtype
dtype('float64')
>>>
The default data-type for numpy.empty is numpy.float64.

Related

python PCA dimensionality reduction

i'm learning using PCA to finish dimensionality reduction (Python3.6) but i've got very similar but different results when using different methods here's my code
from numpy import *
from sklearn.decomposition import PCA
data_set = [[-1., -2.],
[-1., 0.],
[0., 0.],
[2., 1.],
[0., 1.]]
# 1
pca_sk = PCA(n_components=1)
newmat = pca_sk.fit_transform(data_set)
print(newmat)
# 2
meanVals = mean(data_set, axis=0)
meanRemoved = data_set - meanVals
covMat = cov(meanRemoved, rowvar=0)
eigVals, eigVects = linalg.eig(mat(covMat))
eigValInd = argsort(eigVals)
eigValInd = eigValInd[:-(1 + 1):-1]
redEigVects = eigVects[:, eigValInd]
lowDDataMat = meanRemoved * redEigVects
print(lowDDataMat)
the first one output
[[ 2.12132034]
[ 0.70710678]
[-0. ]
[-2.12132034]
[-0.70710678]]
but anothor output
[[-2.12132034]
[-0.70710678]
[ 0. ]
[ 2.12132034]
[ 0.70710678]]
why dose it happen

How do I to average irregularly spaced x & y coordinate tensor into a grid with a specific cell size?

I have an algorithm that generates a tensor of irregularly spaced x and y coordinates (ex: torch.size([3600, 2])), and I need to average the points into grid cells of a specific size (ex: 8 by 8). The resulting grid needs to be either an array or tensor.
It's not required, but I would also like to be able to determine if any of the resulting cells have less than a specified number of points in them.
For example I can graph the tensor using matplotlib's plt.scatter, and it looks like this:
In the above example, 100,000 points exist but the number of points can sometimes be in the tens of millions.
I've tried using histogram approaches, and most of them use a specific number of cells vs a specific cell size. Matplotlib can seemingly do it in a graph, but that doesn't help me get an array or tensor.
Edit:
This code might work, if it can be made to work properly.
def grid_torch(x_coords, y_coords, grid_size=(8,8), x_extent=(0., 1.), y_extent=(0., 1.)):
x_coords = ((x_coords - x_extent[0]) / (x_extent[1] - x_extent[0])) * grid_size[0]
y_coords = ((y_coords - y_extent[0]) / (y_extent[1] - y_extent[0])) * grid_size[1]
x_list = []
for x in range(grid_size[0]):
x = torch.ones_like(x_coords) * x
y_list = []
for y in range(grid_size[1]):
y = torch.ones_like(y_coords) * y
in_bounds_x = torch.logical_and(x <= x_coords, x_coords <= x + 1)
in_bounds_y = torch.logical_and(y <= y_coords, y_coords <= y + 1)
in_bounds = torch.logical_and(in_bounds_x, in_bounds_y)
in_bounds_indices = torch.where(in_bounds)
print(in_bounds_indices)
y_list.append(in_bounds_indices)
x_list.append(torch.stack(y_list))
return torch.stack(x_list)
out = grid_torch(xy_tensor[:,0], xy_tensor[:,1])
print(out.shape)
def create_grid(grid_layout, activ, grid_size=(8,8), min_density=8):
cells = []
for x in range(grid_size[0]):
for y in range(grid_size[1]):
indices = grid_layout[x, y]
if len(indices) > min_density:
average_activation = torch.mean(activ[indices])
cells.append((average_activation, x, y))
print(average_activation, x, y)
return torch.stack(cells)
grid_test = create_grid(out, xy_tensor, grid_size=(8,8))
I think this code would give you a good starting point.
def grid_torch(x_coords, y_coords, grid_size=(8,8), x_extent=(0., 1.), y_extent=(0., 1.)):
# This part converts coordinates to bin numbers (like (2,5), (7,7) etc)
x_bin = (((x_coords - x_extent[0]) / (x_extent[1] - x_extent[0])) * grid_size[0]).int()
y_bin = (((y_coords - y_extent[0]) / (y_extent[1] - y_extent[0])) * grid_size[1]).int()
counts = torch.zeros(grid_size)
means = torch.zeros(list(grid_size) + [2])
for x in range(grid_size[0]):
for y in range(grid_size[1]):
# these tensors are 1 where (x_bin == x and y_bin == y), 0 else where
x_where = 1 * (x_bin == x)
y_where = 1 * (y_bin == y)
p_where = (x_where * y_where)
cnt = p_where.sum()
counts[x, y] = cnt
# we'll average both x and y coords seperately.
# you can embed min_density logic here.
if cnt > 0:
means[x, y, 0] = (x_coords * p_where).sum() / p_where.sum()
means[x, y, 1] = (y_coords * p_where).sum() / p_where.sum()
return counts, means
# Generate sample points
points = torch.tensor(np.concatenate([
np.random.normal(loc=0.2, scale=0.1, size=(1000, 2)),
np.random.normal(loc=0.6, scale=0.1, size=(1000, 2))
]).clip(0,1)).float()
# plt.scatter(points[:,0], points[:,1])
# plt.grid()
counts, means = grid_torch(points[:,0], points[:,1])
counts
>>>
tensor([[ 47., 114., 75., 10., 0., 0., 0., 0.],
[102., 204., 141., 27., 0., 0., 0., 0.],
[ 60., 101., 74., 16., 7., 4., 1., 0.],
[ 5., 17., 9., 23., 72., 51., 10., 0.],
[ 1., 1., 4., 54., 186., 141., 28., 3.],
[ 0., 0., 3., 47., 154., 117., 14., 0.],
[ 0., 0., 0., 9., 37., 24., 4., 0.],
[ 0., 0., 0., 2., 0., 1., 0., 0.]])

numpy concatenate not appending new array to empty multidimensional array

I bet I am doing something very simple wrong. I want to start with an empty 2D numpy array and append arrays to it (with dimensions 1 row by 4 columns).
open_cost_mat_train = np.matrix([])
for i in xrange(10):
open_cost_mat = np.array([i,0,0,0])
open_cost_mat_train = np.vstack([open_cost_mat_train,open_cost_mat])
my error trace is:
File "/Users/me/anaconda/lib/python2.7/site-packages/numpy/core/shape_base.py", line 230, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly
What am I doing wrong? I have tried append, concatenate, defining the empty 2D array as [[]], as [], array([]) and many others.
You need to reshape your original matrix so that the number of columns match the appended arrays:
open_cost_mat_train = np.matrix([]).reshape((0,4))
After which, it gives:
open_cost_mat_train
# matrix([[ 0., 0., 0., 0.],
# [ 1., 0., 0., 0.],
# [ 2., 0., 0., 0.],
# [ 3., 0., 0., 0.],
# [ 4., 0., 0., 0.],
# [ 5., 0., 0., 0.],
# [ 6., 0., 0., 0.],
# [ 7., 0., 0., 0.],
# [ 8., 0., 0., 0.],
# [ 9., 0., 0., 0.]])
If open_cost_mat_train is large I would encourage you to replace the for loop by a vectorized algorithm. I will use the following funtions to show how efficiency is improved by vectorizing loops:
def fvstack():
import numpy as np
np.random.seed(100)
ocmt = np.matrix([]).reshape((0, 4))
for i in xrange(10):
x = np.random.random()
ocm = np.array([x, x + 1, 10*x, x/10])
ocmt = np.vstack([ocmt, ocm])
return ocmt
def fshape():
import numpy as np
from numpy.matlib import empty
np.random.seed(100)
ocmt = empty((10, 4))
for i in xrange(ocmt.shape[0]):
ocmt[i, 0] = np.random.random()
ocmt[:, 1] = ocmt[:, 0] + 1
ocmt[:, 2] = 10*ocmt[:, 0]
ocmt[:, 3] = ocmt[:, 0]/10
return ocmt
I've assumed that the values that populate the first column of ocmt (shorthand for open_cost_mat_train) are obtained from a for loop, and the remaining columns are a function of the first column, as stated in your comments to my original answer. As real costs data are not available, in the forthcoming example the values in the first column are random numbers, and the second, third and fourth columns are the functions x + 1, 10*x and x/10, respectively, where x is the corresponding value in the first column.
In [594]: fvstack()
Out[594]:
matrix([[ 5.43404942e-01, 1.54340494e+00, 5.43404942e+00, 5.43404942e-02],
[ 2.78369385e-01, 1.27836939e+00, 2.78369385e+00, 2.78369385e-02],
[ 4.24517591e-01, 1.42451759e+00, 4.24517591e+00, 4.24517591e-02],
[ 8.44776132e-01, 1.84477613e+00, 8.44776132e+00, 8.44776132e-02],
[ 4.71885619e-03, 1.00471886e+00, 4.71885619e-02, 4.71885619e-04],
[ 1.21569121e-01, 1.12156912e+00, 1.21569121e+00, 1.21569121e-02],
[ 6.70749085e-01, 1.67074908e+00, 6.70749085e+00, 6.70749085e-02],
[ 8.25852755e-01, 1.82585276e+00, 8.25852755e+00, 8.25852755e-02],
[ 1.36706590e-01, 1.13670659e+00, 1.36706590e+00, 1.36706590e-02],
[ 5.75093329e-01, 1.57509333e+00, 5.75093329e+00, 5.75093329e-02]])
In [595]: np.allclose(fvstack(), fshape())
Out[595]: True
In order for the calls to fvstack() and fshape() produce the same results, the random number generator is initialized in both functions through np.random.seed(100). Notice that the equality test has been performed using numpy.allclose instead of fvstack() == fshape() to avoid the round off errors associated to floating point artihmetic.
As for efficiency, the following interactive session shows that initializing ocmt with its final shape is significantly faster than repeatedly stacking rows:
In [596]: import timeit
In [597]: timeit.timeit('fvstack()', setup="from __main__ import fvstack", number=10000)
Out[597]: 1.4884241055042366
In [598]: timeit.timeit('fshape()', setup="from __main__ import fshape", number=10000)
Out[598]: 0.8819408006311278

Python Numpy Error: ValueError: setting an array element with a sequence

I am trying to build a dataset similar to mnist.pkl.gz provided in theano logistic_sgd.py implementation. Following is my code snippet.
import numpy as np
import csv
from PIL import Image
import gzip, cPickle
import theano
from theano import tensor as T
def load_dir_data(csv_file=""):
print(" reading: %s" %csv_file)
dataset=[]
labels=[]
cr=csv.reader(open(csv_file,"rb"))
for row in cr:
print row[0], row[1]
try:
image=Image.open(row[0]+'.jpg').convert('LA')
pixels=[f[0] for f in list(image.getdata())]
dataset.append(pixels)
labels.append(row[1])
del image
except:
print("image not found")
ret_val=np.array(dataset,dtype=theano.config.floatX)
return ret_val,np.array(labels).astype(float)
def generate_pkl_file(csv_file=""):
Data, y =load_dir_data(csv_file)
train_set_x = Data[:1500]
val_set_x = Data[1501:1750]
test_set_x = Data[1751:1900]
train_set_y = y[:1500]
val_set_y = y[1501:1750]
test_set_y = y[1751:1900]
# Divided dataset into 3 parts. I had 2000 images.
train_set = train_set_x, train_set_y
val_set = val_set_x, val_set_y
test_set = test_set_x, val_set_y
dataset = [train_set, val_set, test_set]
f = gzip.open('file.pkl.gz','wb')
cPickle.dump(dataset, f, protocol=2)
f.close()
if __name__=='__main__':
generate_pkl_file("trainLabels.csv")
Error Message:
Traceback (most recent call last):
File "convert_dataset_pkl_file.py", line 50, in <module>
generate_pkl_file("trainLabels.csv")
File "convert_dataset_pkl_file.py", line 29, in generate_pkl_file
Data, y =load_dir_data(csv_file)
File "convert_dataset_pkl_file.py", line 24, in load_dir_data
ret_val=np.array(dataset,dtype=theano.config.floatX)
ValueError: setting an array element with a sequence.
csv file contains two fields.. image name, classification label
when is run this in python interpreter, it seems to be working for me.. as follows.. I dont get error saying setting an array element with a sequence here..
---------python interpreter output----------
image=Image.open('sample.jpg').convert('LA')
pixels=[f[0] for f in list(image.getdata())]
dataset=[]
dataset.append(pixels)
dataset.append(pixels)
dataset.append(pixels)
dataset.append(pixels)
dataset.append(pixels)
b=numpy.array(dataset,dtype=theano.config.floatX)
b
array([[ 2., 0., 0., ..., 0., 0., 0.],
[ 2., 0., 0., ..., 0., 0., 0.],
[ 2., 0., 0., ..., 0., 0., 0.],
[ 2., 0., 0., ..., 0., 0., 0.],
[ 2., 0., 0., ..., 0., 0., 0.]])
Even though i am running same set of instruction (logically), when i run sample.py, i get valueError: setting an array element with a sequence.. I trying to understand this behavior.. any help would be great..
The problem is probably similar to that of this question.
You're trying to create a matrix of pixel values with a row per image. But each image has a different size so the number of pixels in each row is different.
You can't create a "jagged" float typed array in numpy -- every row must be of the same length.
You'll need to pad each row to the length of the largest image.

OpenCV Python findHomography srcPoint input not compatible

I am trying to find the transformation of an loaded image to a plane detected off of a marker so that I can transform it to appear perpendicular to the marker plane. I am having trouble putting inputs to cv2.findHomography. Please help to change the input format to this function
Here is my code that is causing issue:
muffinImg = cv2.imread('muffin.jpg',0)
muffinCoords = np.zeros((4,2), np.float32)
muffheight, muffwidth = muffinImg.shape
muffinCoords[0] = (0,muffwidth)
muffinCoords[1] = (muffwidth,muffheight)
muffinCoords[2] = (0,muffheight)
muffinCoords[3] = (0,0)
found, corners = cv2.findChessboardCorners(frameLeft, (5,4),None)
if (found):
corners2 = cv2.cornerSubPix(grayframeLeft,corners,(11,11),(-1,-1),criteria)
q = [(0,0)]*4
q[0] = corners[0][0]
q[1] = corners[3][0]
q[2] = corners[19][0]
q[3] = corners[16][0]
retvalHomography, mask = cv2.findHomography(q, muffinCoords, cv2.RANSAC)
cv2.warpPerspective(muffinImg, retvalHomography, (400, 500), muffinImg, cv2.INTER_NEAREST, cv2.BORDER_CONSTANT, 0)
I am getting this error on the cv2.findHomography line: srcPoints is not a numpy array, neither a scalar
Here is what Microsoft Visual object inspection tool gives me
q:
muffin:
EDIT: I have some additional info about the inputs but I don't see how they are different, maybe I am just making a noob mistake: from here http://opencv-users.1802565.n2.nabble.com/Anyone-have-a-Python2-example-using-estimateRigidTransform-td7322817.html.
One quick answer which might help (depending on what you are trying to do) is that the cv2.findHomography function does work from python. It returns 3x3 rather than 2x3 matrix but you will find coefficients in the bottom row close to either zero or one if the transform really is rigid so slice them off.
a=np.array([0,0,1,0,0,1,1,1],np.float32).reshape(-1,2) #small square
b=a*2 #scale x2
b+=0.5 #translate across and down
H,matches=cv2.findHomography(a,b,cv2.RANSAC)
Your variable q is not a numpy array. Try converting it to an array before passing it to cv2.findHomography().
I don't know the cv2 API well enough to be sure, but I think you should change this:
q = [(0,0)]*4
q[0] = corners[0][0]
q[1] = corners[3][0]
q[2] = corners[19][0]
q[3] = corners[16][0]
to something like this:
q = np.zeros((4,2), dtype=np.float32)
q[0] = corners[0][0]
q[1] = corners[3][0]
q[2] = corners[19][0]
q[3] = corners[16][0]
After a brief look at the cv2 docs, I think corners is an array with shape n x 2, so those assignments don't make much sense to me. corners[0][0] (which could be written more succinctly as corner[0,0]) is the first coordinate of the first corner, i.e. corner[0][0] is a scalar. Why are you assigning only the first coordinate to q[0]? What is the intent of that code? I suspect it could be simplified to:
q = corners[[0, 3, 19, 16]]
For example:
In [12]: corners = np.arange(40).reshape(20,2).astype(np.float32)
In [13]: corners
Out[13]:
array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.],
[ 6., 7.],
[ 8., 9.],
[ 10., 11.],
[ 12., 13.],
[ 14., 15.],
[ 16., 17.],
[ 18., 19.],
[ 20., 21.],
[ 22., 23.],
[ 24., 25.],
[ 26., 27.],
[ 28., 29.],
[ 30., 31.],
[ 32., 33.],
[ 34., 35.],
[ 36., 37.],
[ 38., 39.]], dtype=float32)
In [14]: q = corners[[0, 3, 19, 16]]
In [15]: q
Out[15]:
array([[ 0., 1.],
[ 6., 7.],
[ 38., 39.],
[ 32., 33.]], dtype=float32)

Categories

Resources