load data from csv into Scikit learn SVM - python

I want to train a SVM to perform a classification of samples. I have a csv file with me that has 3 columns with headers: feature 1,feature 2, class label and 20 rows(= number of samples).
Now I quote from the Scikit-Learn documentation
" As other classifiers, SVC, NuSVC and LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:"
I understand that I need to obtain two arrays(one 2d & one 1d array) in order to feed data into the SVM. However I am unable to understand how to obtain the required array from the csv file.
I have tried the following code
import numpy as np
data = np.loadtxt('test.csv', delimiter=',')
print data
However it is showing an error
"ValueError: could not convert string to float: ��ࡱ�"
There are no column headers in the csv. Am I making any mistake in calling the function np.loadtxt or should something else be used?
Update:
Here's how my .csv file looks like.
12 122 34
12234 54 23
23 34 23

You passed the param delimiter=',' but your csv was not comma separated.
So the following works:
In [378]:
data = np.loadtxt(path_to_data)
data
Out[378]:
array([[ 1.20000000e+01, 1.22000000e+02, 3.40000000e+01],
[ 1.22340000e+04, 5.40000000e+01, 2.30000000e+01],
[ 2.30000000e+01, 3.40000000e+01, 2.30000000e+01]])
The docs show that by default the delimiter is None and so treats whitespace as the delimiter:
delimiter : str, optional The string used to separate values. By
default, this is any whitespace.

The issue was with the csv file rather than the loadtxt() function. The format in which I saved was not giving a proper .csv file(dont know why!-maybe I didnt save it at all). But there is a way to verify whether the csv file is saved in the right format or not. Open the .csv file using notepad. If the data has commas between them, then it is saved properly. And loadtxt() will work. If it shows some gibberish, then create it again and then check.

Related

How do i make my data 1-dimensional? - neural network

When I output this below code through some other functions (sigmoid/weight functions etc). I get the output that my data 'must be one dimensional'.
The data is from a csv that is 329 X 31, I have split this as I need the first column as my 'y' value, and then the remaining 30 columns and all its rows will be my 'x'. How do I go about making this 1 dimensional for my functions?
Is this section of code where I process my data even the issue? could it be an issue from a later functional call? im new to python so im not sure what the issue could be caused by, I was wondering if I converted my data into an array correctly.
df = pd.read_csv('data.csv', header=None)
#splitting dataframe into 70/30 split
trainingdata = df.sample(frac=0.7)
testingdata = df.drop(trainingdata.index)
#splitting very first column to 'y' value
y = trainingdata.loc[:,0]
#splitting rest of columns to 'X' value
X = trainingdata.loc[:,1:]
#printing shape for testing
print(X.shape, y.shape)
if I understand your question correctly, you can flatten the array using the flatten(), or you can use reshape() for more information, read the documentation
y=y.flatten()
print(y.ndim)
doc

Can tf.contrib.layers.sparse_column_with_integerized_feature handle categorical features with multiple inputs within one column?

I'm just using Tensorflow and its tf.learn api to create and train a DNNRegressor model. I have an integer feature column that is multivalent (I can have more than one integer value in that column for each row) and I use tf.contrib.layers.sparse_column_with_integerized_feature for this feature column.
now my question is what is the right delimeter for the multivalent feature column in csv file.
for example supose I have a csv that col2 is multivalent feature and its not one hot feature:
1, 2, 1:2:3:4, 5
2, 1, 4:5, 6
as you see I use ':' for seperating integer feature valuse in col2 but it seems its not right and I got this error while running DNNRegressor with declaring this feature column as tf.contrib.layers.sparse_column_with_integerized_feature:
'Value passed to parameter 'x' has DataType string not in list of allowed
values: int32, int64, float32, float64'.
I really appreciate your help
tf.contrib.layers.sparse_column_with_integerized_feature is for int32 or int64 values only, so it won't work exactly as you want.
But tensorflow supports multi-dimensions in numerical columns, so you can work with tf.feature_column.numeric_column and specify the shape that you have. Note that tensorflow will expect that all of those shapes are the same, so you'll need to pad all of your values to a common shape.
The colon ':' delimeter is fine for multivalent columns, here's an example how to read multiple values into a DataFrame with pandas (the question is about XML, but the same works for CSV). This data frame you can pass into model.train() function as input_fn.

How to convert dataframe to 1D array ?

First of all apologies. I am very new to pandas, scikit learn and python. So I am sure I am doing something silly. Let me give a little background.
I am trying to run KNeighborsClassifier from scikit learn (python)
Following is my strategy
#Reading the Training set
data = pd.read_csv('Path_TO_File\\Train_Set.csv', sep=',') # reading CSV File
X = data[['Attribute 1','Attribute 2']]
y = data['Target_Column'] # the output is a Dataframe of single column with many rows
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X,y)
Next I try to read Test data
test = pd.read_csv('PATH_TO_FILE\\Test.csv', sep=',')
t = test[['Attribute 1','Attribute 2']]
pred = neigh.predict(t)
actual = test['Target_Column']
Next I try to check the accuracy by following function which is throwing error.
accuracy=neigh.score(actual,pred)
ERROR: ValueError: could not convert string to float: N
I checked actual and pred both and they are of following data type and content
actual
Out[161]:
Target_Column
0 Y
1 N
:
[614 rows x 1 columns]
pred
Out[162]:
array(['Y', 'N', .....'N'], dtype=object)
N.B : pred has 614 values.
I tried to convert "actual" variable to 1D array I might be able to execute the function however, I am not successful.
I think I need to do following two things, however, was not able to do so (after googling it)
1) Convert actual into 1Dimen array
2) Making a transpose of the 1Dimen array since the pred has 614 columns.
Please let me know how to correct the function.
Thanks in advance !
Raj
Thanks Vivek and Thornhale
Indeed I was doing two wrong things.
As pointed by you guys, I should have been using 1, 0 in stead of Y,
N.
I was giving wrong parameters to the function score. It should be
accuracy=neigh.score(t, actual) , where t is test feature set and
actual is test label information.
You could convert your series which is what you get when you do "test[COLUMN_NAME]" into an array like so:
actual = np.array(test['Target_Column'])
To then reshape an np array, you would emply this command:
actual.reshape(1, 612) # <- Could be the other way around as well.
Your main issue though is that your Series needs to be boolean (as in 0,1).

h5py cannot convert element 0 to hsize_t

I have a boatload of images in a hdf5-file that I would like to load and analyse. Each image is 1920x1920 uint16 and loading all off them into the memory crashes the computer. I have been told that others work around that by slicing the image, e.g. if the data is 1920x1920x100 (100 images) then they read in the first 80 rows of each images and analyse that slice, then move to next slice. This I can do without problems, but when I try to create a dataset in the hdf5 file, it get a TypeError: Can't convert element 0 ... to hsize_t
I can recreate the problem with this very simplified code:
with h5py.File('h5file.hdf5','w') as f:
data = np.random.randint(100, size=(15,15,20))
data_set = f.create_dataset('data', data, dtype='uint16')
which gives the output:
TypeError: Can't convert element 0 ([[29 50 75...4 50 28 36 13 72]]) to hsize_t
I have also tried omitting the "data_set =" and the "dtype='uint16'", but I still get the same error. The code is then:
with h5py.File('h5file.hdf5','w') as f:
data = np.random.randint(100, size=(15,15,20))
f.create_dataset('data', data)
Can anyone give me any hints to what the problem is?
Cheers!
The second parameter of create_dataset is the shape parameter (see the docs), but you pass the entire array. If you want to initialize the dataset with an existing array, you must specify this with the data keyword, like this:
data_set = f.create_dataset('data', data=data, dtype="uint16")

numpy loadtxt not resulting in array

Seem to be hitting a wall with a simple problem. I'm trying to read in an array in a file. The columns are a mix of integer and strings; only interested in columns 0,2,3.
import numpy as np
network = np.loadtxt('temp.biflows',skiprows=1, usecols=(0,2,3), delimiter = '\t', dtype=[('ts','i10'), ('sndr','|S14'), ('recr', '|S14')])
print network.shape
A sample of the input file; columns are separate by tabs \t:
1441087368 1441087365 186.251.68.208 186.251.68.145 17 137 137 3 0 150 0
1441087342 1441087341 125.144.214.126 125.144.195.105 17 137 137 2 0 100 0
1441087370 1441087370 186.251.139.178 170.85.175.203 17 35905 161 2 2 760 850
There are actually 30104 lines. The resulting shape of network is network.shape = (30104,). What I am look for is for network to be an array with shape (30104,3).
FWIW my goal is to sort the lines based on the first column (a timestamp).
Any suggestions as to what I might be doing wrong would be greatly appreciated (as well as suggestions for how to do the sort).
You can't create a numpy array with shape (n, 3) where each column has a different type. What you can create (and what you did when you used loadtxt with dtype=[('ts','i10'), ('sndr','|S14'), ('recr', '|S14')]) is create a structured array, where each element in the array is a structure composed of several fields. In your case, you have three fields: one is an integer and two are strings. The array created by loadtxt is a one-dimensional array. Each element in the array is a structure with three fields. You can access the fields (which you can interpret as "columns") as network['ts'], network['sndr'] and network['recr'].
See http://docs.scipy.org/doc/numpy/user/basics.rec.html for more information. There is probably a lot of related information here on SO, too. For example, Access Columns of Numpy Array? Errors Trying to Do by Transpose or by Column Access

Categories

Resources