Sklearn encoding on columns with multiple classes on same cell

Sklearn encoding on columns with multiple classes on same cell - python

Let's suppose I have the following DataFrame:
Column
0 A - B - C
1 A - B
2 A - C
3 A
4 B
5 C
I want to encode the "Column" but I have multiple classes in the same cell. Using pandas I can do the following to get the proper encoding output:
df['Column'].str.get_dummies(sep=' - ')
A B C
0 1 1 1
1 1 1 0
2 1 0 1
3 1 0 0
4 0 1 0
5 0 0 1
How can I do the same transformation using Sklearn?

Another approach is to use the MultiLabelBinarizer class as it supports an iterable as input.
df['Column'] = df['Column'].str.split(' - ')
enc = MultiLabelBinarizer()
enc.fit_transform(df['Column'])

Looks like you can do this with CountVectorizer specifying how you want to identify boundaries.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'Column': ['A - B - C', 'A - B', 'A - C', 'A', 'B', 'C']})
vectorizer = CountVectorizer(analyzer=lambda x: x.split(' - '))
X = vectorizer.fit_transform(df['Column'])
X.toarray()
array([[1, 1, 1],
[1, 1, 0],
[1, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
vectorizer.get_feature_names()
['A', 'B', 'C']

Related

Count occurrences of feature pairs for a given set of mappings

I'm trying to aggregate a DataFrame such that for each from, and each to given in the mappings table (e.g. .iloc[0] where a maps to b), we take the corresponding f# (feature) columns from the labels table, and find the number of times that that feature mapping occurred.
The expected output is given in the output table.
Example: in the output table we can see there are 4 times when a from element mapped to a to element (i.e. where the from had an f1 feature and the to had an f2 feature). We can deduce these as being a->b, a->c, d->e, and d->g.
Mappings
from to
0 a b
1 a c
2 d e
3 d f
4 d g
Labels
name f1 f2 f3
0 a 1 0 0
1 b 0 1 0
2 c 0 1 0
3 d 1 1 0
4 e 0 1 0
5 f 0 0 1
6 g 1 1 0
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Table construction code
# dataframe 1 - the mappings
mappings = pd.DataFrame({
'from': ['a', 'a', 'd', 'd', 'd'],
'to': ['b', 'c', 'e', 'f', 'g']
})
# dataframe 2 - the labels
labels = pd.DataFrame({
'name': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'f1': [1, 0, 0, 1, 0, 0, 1],
'f2': [0, 1, 1, 1, 1, 0, 1],
'f3': [0, 0, 0, 0, 0, 1, 0],
})
# dataframe 3 - the expected output
output = pd.DataFrame(
index = ['f1', 'f2', 'f3'],
data = {
'f1': [1, 1, 0],
'f2': [4, 2, 0],
'f3': [1, 1, 0],
})

First we melt your labels dataframe from columns to rows, so we can easily match on them. Then we merge these values to our mapping and finally use crosstab to get your final result:
labels = labels.set_index('name').where(lambda x: x > 0).melt(ignore_index=False).dropna()
df = (
mappings.merge(labels.add_suffix('_from'), left_on='from', right_on='name')
.merge(labels.add_suffix('_to'), left_on='to', right_on='name')
)
final = pd.crosstab(index=df['variable_from'], columns=df['variable_to'])
final = (
final.reindex(index=final.columns, fill_value=0)
.rename_axis(index=None, columns=None)
).convert_dtypes()
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Note:
melt(ignore_index=False) requires pandas >= 1.1.0
convert_dtypes requires pandas >= 1.0.0
For pandas < 1.1.0 we can use stack instead of melt:
(
labels.set_index('name')
.where(lambda x: x > 0)
.stack()
.reset_index(level=1)
.rename(columns={'level_1': 'variable', 0: 'value'})
)

Python Pandas concatenate strings and numbers into one string

I am working with a pandas dataframe and trying to concatenate multiple string and numbers into one string.
This works
df1 = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'c']})
df1.apply(lambda x: ', '.join(x), axis=1)
0 a, a
1 b, b
2 c, c
How can I make this work just like df1?
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x), axis=1)
TypeError: ('sequence item 0: expected str instance, int found', 'occurred at index 2')

Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(3, 3)),
columns=list('abc')
)
print(df)
a b c
0 0 2 7
1 3 8 7
2 0 6 8
You can use astype(str) ahead of the lambda
df.astype(str).apply(', '.join, 1)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
Using a comprehension
pd.Series([', '.join(l) for l in df.values.astype(str).tolist()], df.index)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object

In [75]: df2
Out[75]:
Col1 Col2 Col3
0 a a x
1 b b y
2 1 1 2
In [76]: df2.astype(str).add(', ').sum(1).str[:-2]
Out[76]:
0 a, a, x
1 b, b, y
2 1, 1, 2
dtype: object

You have to convert column types to strings.
import pandas as pd
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x.astype('str')), axis=1)

Correlation between a pandas Series and a whole DataFrame

I have a series of values and I'm looking to compute the pearson correlation with every row of a given table.
How do I do I do that?
Example:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
s = pd.Series(v)
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# Here I expect ot do df.corrwith(s) - but won't work
Using Series.corr() to calculate, the expected output is
-0.1666666666666666 # correlation with the first row
0.83914639167827343 # correlation with the second row
-0.35355339059327379 # correlation with the third row

You need same index of Series as columns of DataFrame for align Series by DataFrame and add axis=1 in corrwith for row-wise correlation:
s1 = pd.Series(s.values, index=df.columns)
print (s1)
a -1
b 5
c 0
d 0
e 10
f 0
g -7
dtype: int64
print (df.corrwith(s1, axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
print (df.corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
EDIT:
You can specify columns and use subset:
cols = ['a','b','e']
print (df[cols])
a b e
0 1 0 0
1 0 1 1
2 1 1 0
print (df[cols].corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.891042
1 0.891042
2 -0.838628
dtype: float64

This might be useful to those concerned with performance.
I have found this runs in half the time compared to pandas corrwith.
Your data:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
The solution (note that v is not transformed into a series):
from scipy.stats.stats import pearsonr
s_corrs = df.apply(lambda x: pearsonr(x.values, v)[0], axis=1)

python - Transform data to numpy array for sklearn

I have a dataset formed by some text columns (with limited possibilities) and some numeric columns in a csv format. Is there any way to automatically transform the text columns to numbers (for example: A will be 0, B will be 1 and so on) to transform the dataset to np.array?
This will be later used on scikit-learn, so it needs to be np.array at the end of all the processing.
EDIT: Adding one line of the dataset:
ENABLED;ENABLED;10;MANUAL;ENABLED;ENABLED;1800000;OFF;0.175;5.0;0.13;OFF;NEITHER;ENABLED;-65;2417;"wifi01";65;-75;DISCONNECTED;NO;NO;2621454;432477;3759;2.2436838539123705E-6;

You can apply sklearn.preprocessing.labelEncoder() to each text column. Here is an example:
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5],
'col2': ['ON','ON','OFF','OFF','ON']})
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
df['encoded'] = lb.fit_transform(df.col2)
df
col1 col2 encoded
0 1 ON 1
1 2 ON 1
2 3 OFF 0
3 4 OFF 0
4 5 ON 1
I just added the numerical values in another column but you can replace them. Also, you can convert them into numpy array:
df.as_matrix()
array([[1, 'ON', 1],
[2, 'ON', 1],
[3, 'OFF', 0],
[4, 'OFF', 0],
[5, 'ON', 1]], dtype=object)
Here is how you may encode with numpy. In this example I am just passing a python list:
alist = ['ON','ON','OFF','OFF','ON']
uniqe_values , y = np.unique(alist, return_inverse=True)
print uniqe_values
print y
The results are:
['OFF' 'ON']
[1 1 0 0 1]

How to get number of rows in a grouped-by category in pandas

In pandas, I have (app_categ_events is a dataframe):
> print(app_categ_events.label_id.unique().shape)
> print(app_categ_events.category.unique().shape)
Out:
(492,)
(458,)
I want to look at the label_category’s that have more than one label_id for each (because I thought there was supposed to be a one-to-one mapping).
In r data.table, I can do:
app_categ_events[, count_rows := .N, by = list(category, label_id)]
# (or smth of that sort...)
print(app_categ_events[counts_rows > 1])
What’s the best way of doing that in pandas?

We transform the dataset to create the 'count_rows' column after grouping by 'category', 'label_id'
app_categ_events['count_rows'] = app_categ_events.groupby(['category',
'label_id'])['label_id'].transform('count')
print(app_categ_events)
# category label_id count_rows
#0 a 1 2
#1 a 1 2
#2 b 2 1
#3 b 3 1
Now, the equivalent of data.table as showed in the OP's post would be
print(app_categ_events[app_categ_events.count_rows>1])
# category label_id count_rows
#0 a 1 2
#1 a 1 2
data
import pandas as pd;
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'], 'label_id': [1, 1, 2, 3]})

You can use filtration to return the desired results.
df = pd.DataFrame({'label_id': [1, 1, 2, 3],
'category': ['a', 'b', 'b', 'c']})
df.groupby(['category']).filter(lambda group: len(group) > 1)
category label_id
1 b 1
2 b 2

Given:
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'],
'label_id': [1, 1, 2, 3]})
Solution:
# identify categories with greater than 1 number of related label_id's
cat_mask = app_categ_events.groupby('category')['label_id'].nunique().gt(1)
cats = cat_mask[cat_mask]
# filter data
app_categ_events[app_categ_events.category.isin(cats.index)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sklearn encoding on columns with multiple classes on same cell - python

Another approach is to use the MultiLabelBinarizer class as it supports an iterable as input. df['Column'] = df['Column'].str.split(' - ') enc = MultiLabelBinarizer() enc.fit_transform(df['Column'])

Related

Count occurrences of feature pairs for a given set of mappings

Python Pandas concatenate strings and numbers into one string

Correlation between a pandas Series and a whole DataFrame

python - Transform data to numpy array for sklearn

How to get number of rows in a grouped-by category in pandas

Categories

Resources