How to apply one hot encoding only to the columns having numeric categorical values. I want to modify the same dataframe. Dataframe has other features with string values. thanks
If you've got a dataframe what you can do is use the pd.get_dummies(...) method.
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
You can checkout the Docs for more.
There is also an optional columns argument which takes in a list of the columns to turn into dummies.
Here is an SO question pertaining to how to get a list of columns and types.
Related
I have noticed OneHotEncoder fails when a categorical variable column has 6 or more categories.
For instance, I have this TestData.csv file that has two columns: Geography, and Continent.
Geography's distinct values are France, Spain, Kenya, Botswana, and Nigeria, while Continent's distinct values are Europe, and Africa.
My goal is to encode the Geography column using OneHotEncoder. I perform the following code to do this:
import numpy as np
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('TestData.csv')
X = dataset.iloc[:,:].values #X is hence a 2-dimensional numpy.ndarray
#Encoding categorical column Geography
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') #the 0 is the column index for the categorical column we want to encode, in this case Geography
X = np.array(ct.fit_transform(X))
I then print(X) to make sure I get the expected output which I do and it looks like this (also notice the Size of X):
However, if I add one new country to the TestData file, let's say Belgium. We now have 6 distinct countries. And now running the exact same code produces the following:
It fails at the line
X = np.array(ct.fit_transform(X))
As you can see, X is not changed and there is no encoding done. I have tested this multiple times. So it seems like OneHotEncoder can only handle up to 5 different category values.
Is there a parameter that I can change or another method I can do to encode categorical variables with more than 5 values?
PS - I know to remove the dummy variable after the encoding ;)
I am running Python 3.7.7
Thanks!
I think the issue is with the “sparse_threshold” parameter in ColumnTransformer. Try setting it to 0 so all output numpy arrays are dense. The density of your output is falling below 0.3 (the default value) which prompts it to try to switch to sparse arrays but it still contains the string column Continent and sparse arrays can’t contain strings.
I noticed you are already using pandas. Then there is another way to use one-hot encoding. #the_martian solution is a better answer to your question. My answer is more like an extended comment.
Preparing example data similar to yours.
import numpy as np
import pandas as pd
a = np.random.choice(['afr','deu','swe','fi','rus','eng','wu'], 40)
b = np.random.choice(['eu','as'], 40)
df = pd.DataFrame({'a':a, 'b':b})
df.head()
Output
a b
0 rus as
1 eng as
2 fi eu
3 swe eu
4 eng eu
You can use get_dummies for one-hot encoding
pd.get_dummies(df, columns=['a'])
Output(clipped)
b a_afr a_deu a_eng a_fi a_rus a_swe a_wu
0 eu 0 0 0 1 0 0 0
1 eu 0 0 0 0 1 0 0
2 as 0 0 0 0 1 0 0
3 eu 0 0 0 1 0 0 0
4 eu 0 0 0 0 0 0 1
5 as 0 0 0 0 0 1 0
...
I have a dataset that has a number of numeric variables. I want to use the KNN method to fill in the missing value. The following code does not fill in the missing value correctly because some of the filled values are out of range. For example I have binary variable but it fills them with a floating number.
As you can see in the tables below, I get 0.66 instead of 1.
Please advise why the code is wrong.
df = pd.read_csv('sample.csv')
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
df = pd.DataFrame(imputer.fit_transform(df),columns = df.columns)
Data set:
Column 1
Column 2
5
1
NAN
0
NAN
NAN
1
1
Result:
Column 1
Column 2
5
1
3
0
3
0.66
1
1
Columns 2 is basically contain two values ( 0 and 1 ).It is not a numerical feature ,it is categorical feature So you can use the mode to fill the missing value in column 2
I would like to transform a table which looks similiar to this below:
X|Y|Z|
1|2|3|
3|5|2|
4|2|1|
The result, I want to achive, should look like that:
col|1|2|3|4|5|
X |1|0|1|0|0|
Y |0|2|0|0|1|
Z |1|1|1|0|0|
So, after transformation the new columns should be unique values from previous table, the new values should be populated with count/appearance, and in the index should be the old column names.
I got stuck and i do not know hot to handle with cause I am a newbe in python, so thanks in advance for support.
Regards,
guddy_7
Use apply with value_counts, replace missing values to 0 and transpose by T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
1 2 3 4 5
X 1 0 1 1 0
Y 0 2 0 0 1
Z 1 1 1 0 0
Pandas:
data = data.dropna(axis = 'columns')
I am trying to do something similar using a cudf dataframe but the apis don't offer this functionality.
My solution is to convert to a pandas df, do the above command, then re-convert to a cudf. Is there a better solution?
cuDF now supports column based dropna, so the following will work:
import cudf
df = cudf.DataFrame({'a':[0,1,None], 'b':[None,0,2], 'c':[1,2,3]})
print(df)
a b c
0 0 null 1
1 1 0 2
2 null 2 3
df.dropna(axis='columns')
c
0 1
1 2
2 3
Until dropna is implemented, you can check the null_count of each column and drop the ones with null_count>0.
This question already has answers here:
Creating dummy variables in pandas for python
(12 answers)
Closed 4 years ago.
For example, the Gender attribute will be transformed into two attributes, "Genre=M" and "Genre=F"enter image description here
and i need two columns Male and Female ,assigning binary values corresponding to the presence or not presence of the attribute
Method 1: You can make use of pd.get_dummies(colname) which will give you n new columns(where n is number of distinct values of that col) each representing binary flags to represent the value state for each row.
Method 2:
We can also use df. Colname. map({'M' :0,'F':1})
Method 3:
We can use replace command like df. Colname. replace(['M', 'F' ], [1, 0], inplace=True)
First method is onehot encoding other 2 is similar to label encoding
Use the pandas function get_dummies.
get_dummies: Convert categorical variable into dummy/indicator variables. Source.
Example of usage:
s = pd.Series(list('abca'))
pd.get_dummies(s)
Output:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0