Scikit-learn DictVectorizer for categoricals variables

Scikit-learn DictVectorizer for categoricals variables - python

I have a .csv file which entries look like this:
b0002 ,0,>0.00 ,3,<=0.644 ,<=0.472 ,<=0.690 ,<=0.069672 ,>15.00 ,>21.00 ,>16.00 ,>6.00 ,>16.00 ,>21.00 ,>9.00 ,>11.00 ,>20.00 ,>7.00 ,>4.00 ,>9.00 ,>9.00 ,>13.00 ,>8.00 ,>14.00 ,>3.00 ,"(1.00, 8.00] ",>10.00 ,>9.00 ,>183.00 ,1
I want to use the GaussianNB() to classify this. So far I managed to do that using another csv with numerical data, now I wanted to use this but I'm stuck.
What's the best way to transform categorical data for a classifier?
This:
p = read_csv("C:path to\\file.csv")
trainSet = p.iloc[1:20,2:5] //first 20 rows and just 3 attributes
dic = trainSet.transpose().to_dict()
vec = DictVectorizer()
vec.fit_transform(dic)
give this error:
Traceback (most recent call last):
File "\prova.py", line 23, in <module>
vec.fit_transform(dic)
File "\dict_vectorizer.py", line 142, in fit_transform
return self.transform(X)
File "\\dict_vectorizer.py", line 230, in transform
values.append(dtype(v))
TypeError: float() argument must be a string or a number
What's the best way to transform categorical data for a classifier?

The issue is with the transposed 'dataframe' returns a nested dict when .to_dict() is called on it.
#create a dummy frame
df = pd.DataFrame({'factor':['a','a','a','b','c','c','c'], 'factor1':['d','a','d','b','c','d','c'], 'num':range(1,8)})
#transpose the dataframe and get the inner dict from to_dict()
feats =df.T().to_dict().values()
from sklearn.feature_extraction import DictVectorizer
Dvec = DictVectorizer()
Dvec.fit_transform(feats).toarray()
The solution is to call .values() on the dict to get the inner dict
Get new feature names from Dvec:
Dvec.get_feature_names()

Related

Pandas TypeError when trying to count NaNs in subset of dataframe column

I'm writing a script to perform LLoD analysis for qPCR assays for my lab. I import the relevant columns from the .csv of data from the instrument using pandas.read_csv() with the usecols parameter, make a list of the unique values of RNA quantity/concentration column, and then I need to determine the detection rate / hit rate at each given concentration. If the target is detected, the result will be a number; if not, it'll be listed as "TND" or "Undetermined" or some other non-numeric string (depends on the instrument). So I wrote a function that (should) take a quantity and the dataframe of results and return the probability of detection for that quantity. However, on running the script, I get the following error:
Traceback (most recent call last):
File "C:\Python\llod_custom.py", line 34, in <module>
prop[idx] = hitrate(val, data)
File "C:\Python\llod_custom.py", line 29, in hitrate
df = pd.to_numeric(list[:,1], errors='coerce').isna()
File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '(slice(None, None, None), 1)' is an invalid key
The idea in the line that's throwing the error (df = pd.to_numeric(list[:,1], errors='coerce').isna()) is to change any non-numeric values in the column to NaN, then get a boolean array telling me whether a given row's entry is NaN, so I can count the number of numeric entries with df.sum() later.
I'm sure it's something that should be obvious to anyone who's worked with pandas / dataframes, but I haven't used dataframes in python before, so I'm at a loss. I'm also much more familiar with C and JavaScript, so something like python that isn't as rigid can actually be a bit confusing since it's so flexible. Any help would be greatly appreciated.
N.B. the conc column will consist of 5 to 10 different values, each repeated 5-10 times (i.e. 5-10 replicates at each of the 5-10 concentrations); the detect column will contain either a number or a character string in each row -- numbers mean success, strings mean failure... For my purposes the value of the numbers is irrelevant, I only need to know if the target was detected or not for a given replicate. My script (up to this point) follows:
import os
import pandas as pd
import numpy as np
import statsmodels as sm
from scipy.stats import norm
from tkinter import filedialog
from tkinter import *
# initialize tkinter
root = Tk()
root.withdraw()
# prompt for data file and column headers, then read those columns into a dataframe
print("In the directory prompt, select the .csv file containing data for analysis")
path = filedialog.askopenfilename()
conc = input("Enter the column header for concentration/number of copies: ")
detect = input("Enter the column header for target detection: ")
tnd = input("Enter the value listed when a target is not detected (e.g. \"TND\", \"Undetected\", etc.): ")
data = pd.read_csv(path, usecols=[conc, detect])
# create list of unique values for quantity of RNA, initialize vectors of same length
# to store probabilies and probit scores for each
qtys = data[conc].unique()
prop = probit = [0] * len(qtys)
# Function to get the hitrate/probability of detection for a given quantity
def hitrate(qty, dataFrame):
list = dataFrame[dataFrame.iloc[:,0] == qty]
df = pd.to_numeric(list[:,1], errors='coerce').isna()
return (len(df) - (len(df)-df.sum()))/len(df)
# iterate over quantities to calculate the corresponding probability of Detection
# and its associate probit score
for idx, val in enumerate(qtys):
prop[idx] = hitrate(val, data)
probit[idx] = norm.ppf(hitrate(val, data))
# create an array of the quantities with their associated probabilities & Probit scores
hitTable = vstack([qtys,prop,probit])
sample dataframe can be created with:
d = {'qty':[1,1,1,1,1, 10,10,10,10,10, 20,20,20,20,20, 50,50,50,50,50, 100,100,100,100,100], 'result':['TND','TND','TND',5,'TND', 'TND',5,'TND',5,'TND', 5,'TND',5,'TND',5, 5,6,5,5,'TND', 5,5,5,5,5]}
exData = pd.DataFrame(data=d)
Then just use exData as the dataframe data in the original code
EDIT: I've fixed the problem by tweaking Loic RW's answer slightly. The function hitrate should be
def hitrate(qty, df):
t_s = df[df.qty == qty].result
t_s = t_s.apply(pd.to_numeric, args=('coerce',)).isna()
return (len(t_s)-t_s.sum())/len(t_s)

Does the following achieve what you want? I made some assumptions on the structure of your data.
def hitrate(qty, df):
target_subset = df[df.qty == qty].target
target_subset = target_subset.apply(pd.to_numeric, args=('coerce',)).isna()
return 1-((target_subset.sum())/len(target_subset))
If i run the following:
data = pd.DataFrame({'qty': [1,2,2,2,3],
'target': [.5, .8, 'TND', 'Undetermined', .99]})
hitrate(2, data)
I get:
0.33333333333333337

Pandas not recognizing the `.cat` command when changing column to categorical data

I have a data frame with six categorical columns that I would like to change to categorical codes. I use to use the following:
cat_columns = ['col1', 'col2', 'col3']
df[cat_columns] = df[cat_columns].astype('category')
df[cat_columns = df[cat_columns].cat.codes
I'm on pandas 1.0.5.
I'm getting the following error:
Traceback (most recent call last):
File "<ipython-input-54-80cc82e5db1f>", line 1, in <module>
train_sample[non_loca_cat_columns].astype('category').cat.codes
File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\pandas\core\generic.py", line 5274, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'cat'
I am not sure how to accomplish what i'm trying to do.

The .cat is not applicable for Dataframe, so you have to apply for each column separately as series.
You can use .apply() and apply cat as a lambda function
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
Or loop through the columns and use the cat funtion
for col in cat_columns:
df[col] = df[col].cat.codes

Lifelines boolean index in Python did not match indexed array along dimension 0; dimension is 88 but corresponding boolean dimension is 76

This very simple piece of code,
# imports...
from lifelines import CoxPHFitter
import pandas as pd
src_file = "Pred.csv"
df = pd.read_csv(src_file, header=0, delimiter=',')
df = df.drop(columns=['score'])
cph = CoxPHFitter()
cph.fit(df, duration_col='Length', event_col='Status', show_progress=True)
produces an error:
Traceback (most recent call last):
File
"C:/Users/.../predictor.py", line 11,
in
cph.fit(df, duration_col='Length', event_col='Status', show_progress=True)
File
"C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py",
line 298, in fit
self._check_values(df)
File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py",
line 323, in _check_values
cols = str(list(X.columns[low_var]))
File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\pandas\core\indexes\base.py",
line 1754, in _ _ getitem _ _
result = getitem(key)
IndexError: boolean index did not match indexed array along dimension 0; dimension is 88 but corresponding
boolean dimension is 76
However, when I print df itself, everything's all right. As you can see, everything is inside the library. And the library's examples work fine.

Without knowing what your data look like - I had the same error, which was resolved when I removed all but the duration, event and coefficient(s) from the pandas df I was using. That is, I had a lot of extra columns in the df that were confusing the cox PH fitter since you don't actually specify which coef you want to include as an argument to cph.fit().

Memory error in numpy

I am trying to build this converter for one of my personal project using numpy and getting the Memory error. I am new to python. This works fine for small data but breaks when i give 5MB of data as input(attached the data). Here is the code. Could experts point out where the memory is blowing up here? Link to data can be found here
import numpy as np
import gc as gc
"""
USAGE: convert(data,cols)
data - numpy array of data
cols - tuple of columns to process. These columns should be categorical columns.
IMP: Indexing of colum in data starts with 0. Ypou cant index last column.
Ex: you want to index second col here, then
data
a b c
a b c
x y z
cols=(1,)
if you want to index 1st and second, then
cols=(0,1)
All 3
cols=(0,1,2)
You can also skip numeric column, which you dont want to encode, like
cols=(0,2) will skip 1 col
"""
def lookupBuilder(strArray):
a=np.arange(len(strArray))+1
lookups={k:v for (k,v) in zip(strArray,a)}
return lookups
def convert(data,cols):
for ix,i in enumerate(cols):
col=data[:,i:i+1]
lookup_data=lookupBuilder(np.unique(col))
for idx,value in enumerate(col):
col[idx]=lookup_data[value[0]]
np.delete(data,i,1)
gc.collect()
np.insert(data,i,col,axis=1)
return data
if __name__=="__main__":
pass
Error
Traceback (most recent call last):
File "C:\MLDatabases\python_scripts\MLP.py", line 230, in <module>
data=cc.convert(data,(1,2,3,4,5,6,7,8,9,13,19))
File "C:\MLDatabases\python_scripts\categorical_converter.py", line 49, in convert
np.insert(data,i,col,axis=1)
File "C:\python\lib\site-packages\numpy\lib\function_base.py", line 4906, in insert
new = empty(newshape, arr.dtype, arrorder)
MemoryError

VAR model with pandas + statsmodels in Python

I am an avid user of R, but recently switched to Python for a few different reasons. However, I am struggling a little to run the vector AR model in Python from statsmodels.
Q#1. I get an error when I run this, and I have a suspicion it has something to do with the type of my vector.
import numpy as np
import statsmodels.tsa.api
from statsmodels import datasets
import datetime as dt
import pandas as pd
from pandas import Series
from pandas import DataFrame
import os
df = pd.read_csv('myfile.csv')
speedonly = DataFrame(df['speed'])
results = statsmodels.tsa.api.VAR(speedonly)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
results = statsmodels.tsa.api.VAR(speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 336, in __init__
super(VAR, self).__init__(endog, None, dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 40, in __init__
self._init_dates(dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 54, in _init_dates
raise ValueError("dates must be of type datetime")
ValueError: dates must be of type datetime
Now, interestingly, when I run the VAR example from here https://github.com/statsmodels/statsmodels/blob/master/docs/source/vector_ar.rst#id5, it works fine.
I try the VAR model with a third, shorter vector, ts, from Wes McKinney's "Python for Data Analysis," page 293 and it doesn't work.
Okay, so now I'm thinking it's because the vectors are different types:
>>> speedonly.head()
speed
0 559.984
1 559.984
2 559.984
3 559.984
4 559.984
>>> type(speedonly)
<class 'pandas.core.frame.DataFrame'> #DOESN'T WORK
>>> type(data)
<type 'numpy.ndarray'> #WORKS
>>> ts
2011-01-02 -0.682317
2011-01-05 1.121983
2011-01-07 0.507047
2011-01-08 -0.038240
2011-01-10 -0.890730
2011-01-12 -0.388685
>>> type(ts)
<class 'pandas.core.series.TimeSeries'> #DOESN'T WORK
So I convert speedonly to an ndarray... and it still doesn't work. But this time I get another error:
>>> nda_speedonly = np.array(speedonly)
>>> results = statsmodels.tsa.api.VAR(nda_speedonly)
Traceback (most recent call last):
File "<pyshell#47>", line 1, in <module>
results = statsmodels.tsa.api.VAR(nda_speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 345, in __init__
self.neqs = self.endog.shape[1]
IndexError: tuple index out of range
Any suggestions?
Q#2. I have exogenous feature variables in my data set that appear to be useful for predictions. Is the above model from statsmodels even the best one to use?

When you give a pandas object to a time-series model, it expects that the index is dates. The error message is improved in the current source (to be released soon).
ValueError: Given a pandas object and the index does not contain dates
In the second case, you're giving a single 1d series to a VAR. VARs are used when you have more than one series. That's why you have the shape error because it expects there to be a second dimension in your array. We could probably improve the error message here. For a single series AR model with exogenous variables, you probably want to use sm.tsa.ARMA. Note that there is a known bug in ARMA.predict for models with exogenous variables to fixed soon. If you could provide a test case for this it would be helpful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit-learn DictVectorizer for categoricals variables - python

Related

Pandas TypeError when trying to count NaNs in subset of dataframe column

Pandas not recognizing the `.cat` command when changing column to categorical data

Lifelines boolean index in Python did not match indexed array along dimension 0; dimension is 88 but corresponding boolean dimension is 76

Memory error in numpy

VAR model with pandas + statsmodels in Python

Categories

Resources