I have two dataset in csv and arff format which I have been using in classification models in weka. I was wondering if this formats can be used in scikit to try others classification methods in python.
This is how my dataset looks like:
ASSAY_CHEMBLID...MDEN.23...MA,TARGET_TYPE...No...MA,TARGET_TYPE...apol...MA,TARGET_TYPE...ATSm5...MA,TARGET_TYPE...SCH.6...MA,TARGET_TYPE...SPC.6...MA,TARGET_TYPE...SP.3...MA,TARGET_TYPE...MDEN.12...MA,TARGET_TYPE...MDEN.22...MA,TARGET_TYPE...MLogP...MA,TARGET_TYPE...R...MA,TARGET_TYPE...G...MA,TARGET_TYPE...I...MA,ORGANISM...No...MA,ORGANISM...C2SP1...MA,ORGANISM...VC.6...MA,ORGANISM...ECCEN...MA,ORGANISM...khs.aasC...MA,ORGANISM...MDEC.12...MA,ORGANISM...MDEC.13...MA,ORGANISM...MDEC.23...MA,ORGANISM...MDEC.33...MA,ORGANISM...MDEO.11...MA,ORGANISM...MDEN.22...MA,ORGANISM...topoShape...MA,ORGANISM...WPATH...MA,ORGANISM...P...MA,Lij
0.202796,0.426972,0.117596,0.143818,0.072542,0.158172,0.136301,0.007245,0.016986,0.488281,0.300438,0.541931,0.644161,0.048149,0.02002,0,0.503415,0.153457,0.288099,0.186024,0.216833,0.184642,0,0.011592,0.00089,0,0.209406,0
where Lij is my class identificator (0 or 1). I was wondering if a previous transformation with numpy is needed.
To read ARFF files, you'll need to install liac-arff. see the link for details.
once you have that installed, then use the following code to read the ARFF file
import arff
import numpy as np
# read arff data
with open("file.arff") as f:
# load reads the arff db as a dictionary with
# the data as a list of lists at key "data"
dataDictionary = arff.load(f)
f.close()
# extract data and convert to numpy array
arffData = np.array(dataDictionary['data'])
There are several ways in which csv data can be read, I found that the easiest is using the function read_csv from the Python's module Pandas. See the link for details regarding installation.
The code for reading a csv data file is below
# read csv data
import pandas as pd
csvData = pd.read_csv("filename.csv",sep=',').values
In either cases, you'll have a numpy array with your data. since the last column represents the (classes/target /ground truth/labels). you'll need to separate the data to a features array X and target vector y. e.g.
X = arffData[:, :-1]
y = arffData[:, -1]
where X contains all the data in arffData except for the last column and y contains the last column in arffData
Now you can use any supervised learning binary classifier from scikit-learn.
Related
So I've been tasked with creating a suitable 2D array to contain all of the data from a csv with data on rainfall from the whole year. In the csv file, the rows represent the weeks of the year and the columns represent the day of the week.
I'm able to display the date I want using the following code.
import csv
data = list(csv.reader(open("rainfall.csv")))
print(data[1][2])
My issue is I'm not sure how to store this data in a 2D array.
I'm not sure how to go about doing this. Help would be appreciated, thanks!
You could use numpy for that. It seems to me, that you have created a list of lists in data. With that you can directly create a 2D numpy-array:
import numpy as np
2d_data = np.array(data)
Or you could even try to directly read the file with numpy:
import numpy as np
# Use the appropriate delimiter here
2d_data = np.genfromtxt("rainfall.csv", delimiter=",")
With pandas:
import pandas as pd
# Use the appropriate delimiter here
2d_data = pd.read_csv("rainfall.csv")
I need to load a cell array generated in Matlab into Python. Each element in the cell is 2D matrix, and varies in the matrix size.
I tried both scipy.io.loadmat and also mat2py.loadmat, both cannot give desired results (e.g., a list of numpy arrays). With the former, the resulting data is of object type, and the latter gives a list but does not maintain the shape of array elements in the cell.
in matlab, save the data as JSON using JSONLab: https://github.com/fangq/jsonlab
or save the data as HDF5 using EasyH5: https://github.com/fangq/easyh5
then, open python, import the json file using
import json
with open('mydata.json', 'r') as fid:
data=json.load(fid, strict=false);
or
import the hdf5 file using
import h5py
covid19=h5py.File('mydata.h5','r');
if the exported json file contains JData structures, you need to install pyjdata (https://pypi.org/project/jdata/) via
pip install jdata
and then load the .json file using
import jdata as jd
import numpy as np
newdata=jd.load('mydata.json')
Is it possible to retrieve index point from PCL pointcloud file?
I have pointcloud data in txt file with XYZ and some other colum information. I use the following code to convert the txt file into pcl cloud file:
import pandas as pd
import numpy as np
import pcl
data = pd.read_csv('data.txt', usecols=[0,1,2], delimiter=' ')
pcl_cloud = pcl.PointCLoud()
cloud = pcl_cloud.from_array(np.array(data, dtype = np.float32))
As I know, the module from_array only need the XYZ column. After some processing (eg. filtering), the number of raw and result most probably different. Is it possible to know which point number from the result file, so I can mix it with another information from the raw data?
I tried to filter by comparing the coordinates, but it doesn't work because the coordinate slightly changes during the converting from double to float.
Any idea? Thank you very much
I just got the answer, by using extract indices.
eg:
filter = pcl.RadiusOutlierRemoval(data)
indeces = filter.Extract()
Thanks
I save a number of numpy arrays into a h5py file with different names corresponding to different datasets. Assuming I don't know those dataset names, how to access the saved data after reading the h5py file. For example:
f = h5py.file('filename','w')
f.create_dataset('file1',data=data1)
....
F = h5py.file('filename','r')
#next how to read out all the datasets without knowing their names in a prior
filenames = list(F.keys()) #which contains all the dataset names
data1 = F[filenames[0]].value.astype('float32')
...
See also the post How to know HDF5 dataset name in python
I have a big 17 GB JSON file placed in hdfs . I need to read that file and convert into nummy array which is then passed into K-Means clustering algorithm. I tried many ways but system slows down and getting a memory error or kernel dies.
the code i tried is
from hdfs3 import HDFileSystem
import pandas as pd
import numpy as nm
import json
hdfs = HDFileSystem(host='hostname', port=8020)
with hdfs.open('/user/iot_all_valid.json/') as f:
for line in f:
data = json.loads(line)
df = pd.DataFrame(data)
dataset= nm.array(df)
I tried using ijson but still not sure which is the right way to do this in faster way.
I would stay away from both numpy and Pandas, since you will get memory issues in both cases. I'd rather stick with SFrame or the Blaze ecosystem, which are designed specifically to handle this kind of "big data" cases. Amazing tools!
Because the data types are all going to be different per column a pandas dataframe would be a more appropriate data structure to keep it in. You can still manipulate the data with numpy functions.
import pandas as pd
data = pd.read_json('/user/iot_all_valid.json', dtype={<express converters for the different types here>})
In order to avoid the crashing issue, try running the k-means on a small sample on the data set. Make sure that works like expected. Then you can increase the data size till you feel comfortable with the whole data set.
In order to deal with a numpy array potentially larger than available ram I would use a memory mapped numpy array. On my machine ujson was 3.8 times faster than builtin json. Assuming rows is the number of lines of json:
from hdfs3 import HDFileSystem
import numpy as nm
import ujson as json
rows=int(1e8)
columns=4
# 'w+' will overwrite any existing output.npy
out = np.memmap('output.npy', dtype='float32', mode='w+', shape=(rows,columns))
with hdfs.open('/user/iot_all_valid.json/') as f:
for row, line in enumerate(f):
data = json.loads(line)
# convert data to numerical array
out[row] = data
out.flush()
# memmap closes on delete.
del out