My question is about preprocessing csv files before inputing them into a neural network.
I want to build a deep neural network for the famous iris dataset using tflearn in python 3.
Dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
I'm using tflearn to load the csv file. However, the classes column of my data set has words such as iris-setosa, iris-versicolor, iris-virginica.
Nueral networks work only with numbers. So, I have to find a way to change the classes from words to numbers. Since it is a very small dataset, I can do it manually using Excel/text editor. I manually assigned numbers for different classes.
But, I can't possibly do it for every dataset I work with. So, I tried using pandas to perform one hot encoding.
preprocess_data = pd.read_csv("F:\Gautam\.....\Dataset\iris_data.csv")
preprocess_data = pd.get_dummies(preprocess_data)
But now, I can't use this piece of code:
data, labels = load_csv('filepath', categorical_labels=True,
n_classes=3)
'filepath' should only be a directory to the csv file, not any variable like preprocess_data.
Original Dataset:
Sepal Length Sepal Width Petal Length Petal Width Class
89 5.5 2.5 4.0 1.3 iris-versicolor
85 6.0 3.4 4.5 1.6 iris-versicolor
31 5.4 3.4 1.5 0.4 iris-setosa
52 6.9 3.1 4.9 1.5 iris-versicolor
111 6.4 2.7 5.3 1.9 iris-virginica
Manually modified dataset:
Sepal Length Sepal Width Petal Length Petal Width Class
89 5.5 2.5 4.0 1.3 1
85 6.0 3.4 4.5 1.6 1
31 5.4 3.4 1.5 0.4 0
52 6.9 3.1 4.9 1.5 1
111 6.4 2.7 5.3 1.9 2
Here's my code which runs perfectly, but, I have modified the dataset manually.
import numpy as np
import pandas as pd
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from tflearn.data_utils import load_csv
data_source = 'F:\Gautam\.....\Dataset\iris_data.csv'
data, labels = load_csv(data_source, categorical_labels=True,
n_classes=3)
network = input_data(shape=[None, 4], name='InputLayer')
network = fully_connected(network, 9, activation='sigmoid', name='Hidden_Layer_1')
network = fully_connected(network, 3, activation='softmax', name='Output_Layer')
network = regression(network, batch_size=1, optimizer='sgd', learning_rate=0.2)
model = tflearn.DNN(network)
model.fit(data, labels, show_metric=True, run_id='iris_dataset', validation_set=0.1, n_epoch=2000)
I want to know if there's any other built-in function in tflearn (or in any other module, for that matter) that I can use to modify the value of my classes from words to numbers. I don't think manually modifying the datasets would be productive.
I'm a beginner in tflearn and neural networks also. Any help would be appreciated. Thanks.
Use label encoder from sklearn library:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df = pd.read_csv('iris_data.csv',header=None)
df.columns=[Sepal Length,Sepal Width,Petal Length,Petal Width,Class]
enc=LabelEncoder()
df['Class']=enc.fit_transform(df['Class'])
print df.head(5)
if you want One-hot encoding then first you need to labelEncode then do OneHotEncoding :
enc=LabelEncoder()
enc_1=OneHotEncoder()
df['Class']=enc.fit_transform(df['Class'])
df['Class']=enc_1.fit_transform([df['Class']]).toarray()
print df.head(5)
These encoders first sort the words in alphabetical order then assign them labels. If you want to see which label is assigned to which class, do:
for k in list(enc.classes_) :
print 'name ::{}, label ::{}'.format(k,enc.transform([k]))
If you want to save this dataframe as a csv file, do:
df.to_csv('Processed_Irisdataset.csv',sep=',')
The simpliest solution is map by dict of all possible values:
df['Class'] = df['Class'].map({'iris-versicolor': 1, 'iris-setosa': 0, 'iris-virginica': 2})
print (df)
Sepal Length Sepal Width Petal Length Petal Width Class
0 89 5.5 2.5 4.0 1.3 1
1 85 6.0 3.4 4.5 1.6 1
2 31 5.4 3.4 1.5 0.4 0
3 52 6.9 3.1 4.9 1.5 1
4 111 6.4 2.7 5.3 1.9 2
If want generate dictionary by all unique values:
d = {v:k for k, v in enumerate(df['Class'].unique())}
print (d)
{'iris-versicolor': 0, 'iris-virginica': 2, 'iris-setosa': 1}
df['Class'] = df['Class'].map(d)
print (df)
Sepal Length Sepal Width Petal Length Petal Width Class
0 89 5.5 2.5 4.0 1.3 0
1 85 6.0 3.4 4.5 1.6 0
2 31 5.4 3.4 1.5 0.4 1
3 52 6.9 3.1 4.9 1.5 0
4 111 6.4 2.7 5.3 1.9 2
Related
I have a dataset where each sample/row is a unique protein and that protein is quantified across 7 features/columns. This dataset includes thousands of proteins and will be classified by machine learning (Support Vector Machine). To give an example of the data:
Protein
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
Feature 6
Feature 7
Protein 1
10.0
8.7
5.4
28.0
7.9
11.3
5.3
Protein 2
6.5
9.3
4.8
2.7
12.3
14.2
0.7
...
...
...
...
...
...
...
...
Protein N
8.0
6.8
4.9
6.2
10.0
19.3
4.8
In addition to this dataset, I also have 2 more replicates that are structured the exact same and have the same proteins for a total of 3 replicates. Normally if I wanted to visualize one of these datasets, I could transform my 7 features using PCA and plot the first two principal components with each point/protein colored by its classification. However, is there a way that I can take my 3 replicates and get some sort of "consensus" PCA plot for them?
I've seen two possible solutions for handling this:
Average each feature for each protein to get a single dataset with N rows and 7 columns, then PCA transform and plot
Concatenate the 3 replicates into a single dataset such that each row now has 7x3 columns, then PCA transform and plot
To clarify what's being said in solution 2, let's call Feature 1 from replicate 1 Feature 1.1, Feature 1 from replicate 2 Feature 1.2, etc.:
Protein
Feature 1.1
...
Feature 7.1
Feature 1.2
...
Feature 7.2
Feature 1.3
...
Feature 7.3
Protein 1
10.0
...
5.3
8.4
...
5.9
9.7
...
5.2
Protein 2
6.5
...
0.7
6.8
...
0.8
6.3
...
0.7
...
...
...
...
...
...
...
...
...
...
Protein N
8.0
...
4.8
7.9
...
4.9
8.1
...
4.7
What I'm looking for is if there's an accepted solution for such a problem or if there's a solution that's more statistically sound. Thanks in advance!
Background:
I'm currently developing some data profiling in SQL Server. This consists of calculating aggregate statistics on the values in targeted columns.
I'm using SQL for most of the heavy lifting, but calling Python for some of the statistics that SQL is poor at calculating. I'm leveraging the Pandas package through SQL Server Machine Language Services.
However,
I'm currently developing this script on Visual Studio. The SQL portion is irrelevant other than as background.
Problem:
My issue is that when I call one of the Python statistics functions, it produces the output as a series with the labels seemingly not part of the data. I cannot access the labels at all. I need the values of these labels, and I need to normalize the data and insert a column with static values describing which calculation was performed on that row.
Constraints:
I will need to normalize each statistic so I can union the datasets and pass the values back to SQL for further processing. All output needs to accept dynamic schemas, so no hardcoding labels etc.
Attempted solutions:
I've tried explicitly coercing output to dataframes. This just results in a series with label "0".
I've also tried adding static values to the columns. This just adds the target column name as one of the inaccessible labels, and the intended static value as part of the series.
I've searched many times for a solution, and couldn't find anything relevant to the problem.
Code and results below. Using the iris dataset as an example.
###########################
## AGG STATS TEST SCRIPT
##
###########################
#LOAD MODULES
import pandas as pds
#GET SAMPLE DATASET
iris = pds.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
#CENTRAL TENDENCY
mode1 = iris.mode()
stat_mode = pds.melt(
mode1
)
stat_median = iris.median()
stat_median['STAT_NAME'] = 'STAT_MEDIAN' #Try to add a column with the value 'STAT_MEDIAN'
#AGGREGATE STATS
stat_describe = iris.describe()
#PRINT RESULTS
print(iris)
print(stat_median)
print(stat_describe)
###########################
## OUTPUT
##
###########################
>>> #PRINT RESULTS
... print(iris) #ORIGINAL DATASET
...
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
>>> print(stat_median) #YOU CAN SEE THAT IT INSERTED COLUMN INTO ROW LABELS, VALUE INTO RESULTS SERIES
sepal_length 5.8
sepal_width 3
petal_length 4.35
petal_width 1.3
STAT_NAME STAT_MEDIAN
dtype: object
>>> print(stat_describe) #BASIC DESCRIPTIVE STATS, NEED TO LABEL THE STATISTIC NAMES TO UNPIVOT THIS
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>>>
Any assistance is greatly appreciated. Thank you!
I figured it out. There's a function called reset_index that will convert the index to a column, and create a new numerical index.
stat_median = pds.DataFrame(stat_median)
stat_median.reset_index(inplace=True)
stat_median = stat_median.rename(columns={'index' : 'fieldname', 0: 'value'})
stat_median['stat_name'] = 'median'
I'm trying to figure out if there is a good way to manage units in my pandas data. For example, I have a DataFrame that looks like this:
length (m) width (m) thickness (cm)
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
Currently, the measurement units are encoded in column names. Downsides include:
column selection is awkward -- df['width (m)'] vs. df['width']
things will likely break if the units of my source data change
If I wanted to strip the units out of the column names, is there somewhere else that the information could be stored?
There isn't any great way to do this right now, see github issue here for some discussion.
As a quick hack, could do something like this, maintaining a separate dict with the units.
In [3]: units = {}
In [5]: newcols = []
...: for col in df:
...: name, unit = col.split(' ')
...: units[name] = unit
...: newcols.append(name)
In [6]: df.columns = newcols
In [7]: df
Out[7]:
length width thickness
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
In [8]: units['length']
Out[8]: '(m)'
As I was searching for this, too. Here is what pint and the (experimental) pint_pandas is capable of today:
import pandas as pd
import pint
import pint_pandas
ureg = pint.UnitRegistry()
ureg.Unit.default_format = "~P"
pint_pandas.PintType.ureg.default_format = "~P"
df = pd.DataFrame({
"length": pd.Series([1.2, 7.8, 3.4], dtype="pint[m]"),
"width": pd.Series([3.4, 9.0, 5.6], dtype="pint[m]"),
"thickness": pd.Series([5.6, 1.2, 7.8], dtype="pint[cm]"),
})
print(df.pint.dequantify())
length width thickness
unit m m cm
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
df['width'] = df['width'].pint.to("inch")
print(df.pint.dequantify())
length width thickness
unit m in cm
0 1.2 133.858268 5.6
1 7.8 354.330709 1.2
2 3.4 220.472441 7.8
Offer you some methods:
pands-units-extension: janpipek/pandas-units-extension: Units extension array for pandas based on astropy
pint-pandas: hgrecco/pint-pandas: Pandas support for pint
you can also extend the pandas by yourself following this Extending pandas — pandas 1.3.0 documentation
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
import matplotlib.pyplot as plt
iris_df = DataFrame()
iris_data_path = 'Z:\WORK\Programming\Python\irisdata.csv'
iris_df = pd.read_csv(iris_data_path,index_col=False,header=None,encoding='utf-8')
iris_df.columns = ['sepal length','sepal width','petal length','petal width','class']
print iris_df.columns.values
print iris_df.head()
print iris_df.tail()
irisX = irisdata[['sepal length','sepal width','petal length','petal width']]
print irisX.tail()
irisy = irisdata['class']
print irisy.head()
print irisy.tail()
colors = ['red','green','blue']
markers = ['o','>','x']
irisyn = np.where(irisy=='Iris-setosa',0,np.where(irisy=='Iris-virginica',2,1))
Col0 = irisdata['sepal length']
Col1 = irisdata['sepal width']
Col2 = irisdata['petal length']
Col3 = irisdata['petal width']
plt.figure(num=1,figsize=(16,10))
plt.subplot(2,3.1)
for i in range(len(colors)):
xs = Col0[irisyn==i]
xy = Col1[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.legend( ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica') )
plt.xlabel(irisdata.columns[0])
plt.ylabel(irisdata.columns[1])
plt.subplot(2,3,2)
for i in range(len(colors)):
xs = Col0[irisyn==i]
xy = Col2[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[0])
plt.ylabel(irisdata.columns[2])
plt.subplot(2,3,3)
for i in range(len(colors)):
xs = Col0[irisyn==i]
xy = Col3[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[0])
plt.ylabel(irisdata.columns[3])
plt.subplot(2,3,4)
for i in range(len(colors)):
xs = Col1[irisyn==i]
xy = Col2[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[1])
plt.ylabel(irisdata.columns[2])
plt.subplot(2,3,5)
for i in range(len(colors)):
xs = Col1[irisyn==i]
xy = Col3[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[1])
plt.ylabel(irisdata.columns[3])
plt.subplot(2,3,6)
for i in range(len(colors)):
xs = Col2[irisyn==i]
xy = Col3[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[2])
plt.ylabel(irisdata.columns[3])
plt.show()
This is code from Howard Bandy's book Quantitative Technical Analysis. The problem is that it is giving me errors even though I typed it out exactly like it is in the book.
I still get the series imported but unused and undefined name irisdata errors/warnings.
This is in the console:
Code:
runfile('Z:/WORK/Programming/Python/Scripts/irisplotpairsdata2.py', wdir='//AMN/annex/WORK/Programming/Python/Scripts')
['sepal length' 'sepal width' 'petal length' 'petal width' 'class']
sepal length sepal width petal length petal width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
sepal length sepal width petal length petal width class
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
Traceback (most recent call last):
File "<ipython-input-100-f0b2002668bd>", line 1, in <module>
runfile('Z:/WORK/Programming/Python/Scripts/irisplotpairsdata2.py', wdir='//AMN/annex/WORK/Programming/Python/Scripts')
File "C:\MyPrograms\Spyder(Python)\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\MyPrograms\Spyder(Python)\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "Z:/WORK/Programming/Python/Scripts/irisplotpairsdata2.py", line 24, in <module>
irisX = irisdata[['sepal length','sepal width','petal length','petal width']]
TypeError: list indices must be integers, not list
Obviously, the program does not run.
I'm using spyder with python 2.7. Which is the platform he was using in the book.
Thanks for any insight.
Well Python is not wrong. You imported Series but never used, which is a warning that does not cause crash. The crash happens because you are dereferencing a variable, irisdata, which was never defined before. (Ctrl + f irisdata in your code and take a look.) Judging by your code, irisdataprobably needs to contain the parsed data of Z:\WORK\Programming\Python\irisdata.csv doesn't it? So you need to parse that out and assign it to irisdata. See this post
eg.
import csv
...
irisdata = list(csv.reader(open(iris_data_path, 'rb')))
I have a data frame (df) in pandas with four columns and I want a new column to represent the mean of this four columns: df['mean']= df.mean(1)
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400
So far so good. But when I save the results to a csv file this is what I found:
5.9,5.4,2.4,3.2,4.2250000000000005
0.6,0.7,0.7,0.7,0.6749999999999999
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
I guess I can force the format in the mean column, but any idea why this is happenning?
I am using winpython with python 3.3.2 and pandas 0.11.0
You could use the float_format parameter:
import pandas as pd
import io
content = '''\
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400'''
df = pd.read_table(io.BytesIO(content), sep='\s+')
df.to_csv('/tmp/test.csv', float_format='%g', index=False)
yields
1,2,3,4,mean
,,,,
5.9,5.4,2.4,3.2,4.225
0.6,0.7,0.7,0.7,0.675
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
The answers seem correct. Floating point numbers cannot be perfectly represented on our systems. There are bound to be some differences. Read The Floating Point Guide.
>>> a = 5.9+5.4+2.4+3.2
>>> a / 4
4.2250000000000005
As you said, you could always format the results if you want to get only a fixed number of points after the decimal.
>>> "{:.3f}".format(a/4)
'4.225'