How do I encode the table below efficiently?
e.g.
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [2, 3, 4], [1, 3, 4]]), columns=['col_1', 'col_2', 'col_3'])
col_1 col_2 col_3
0 1 2 3
1 2 3 4
2 1 3 4
to
1 2 3 4
0 1 1 1 0
1 0 1 1 1
2 1 0 1 1
Here's one way -
def hotencode(df):
unq, idx = np.unique(df, return_inverse=1)
col_idx = idx.reshape(df.shape)
out = np.zeros((len(col_idx),col_idx.max()+1),dtype=int)
out[np.arange(len(col_idx))[:,None], col_idx] = 1
return pd.DataFrame(out, columns=unq, index=df.index)
Another way with broadcasting would be -
unq = np.unique(df)
out = (df.values[...,None] == unq).any(1).astype(int)
Sample run -
In [81]: df
Out[81]:
col_1 col_2 col_3
0 1 2 3
1 2 3 4
2 1 3 4
In [82]: hotencode(df)
Out[82]:
1 2 3 4
0 1 1 1 0
1 0 1 1 1
2 1 0 1 1
Related
Suppose I have a 2*3 dataframe:
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
A B C
0 1 3 5
1 2 4 6
I'm wondering how can I convert df to a (2*3)*1 dataframe that has the following form? I've tried pd.DataFrame.explode() and pd.wide_to_long() but they didn't appear to be the function I'm looking for.
value
A 0 1
A 1 2
B 0 3
B 1 4
C 0 5
C 1 6
You just need to stack:
df.stack().swaplevel().sort_index()
output:
A 0 1
1 2
B 0 3
1 4
C 0 5
1 6
Or use melt after resetting the index:
df.reset_index().melt(id_vars='index')
output:
index variable value
0 0 A 1
1 1 A 2
2 0 B 3
3 1 B 4
4 0 C 5
5 1 C 6
Alternative outputs
As dataframe:
(df.stack()
.rename('value')
.swaplevel()
.sort_index()
.to_frame()
)
value
A 0 1
1 2
B 0 3
1 4
C 0 5
1 6
All as columns:
(df.stack()
.rename('value')
.swaplevel()
.rename_axis(['col1', 'col2'])
.sort_index()
.reset_index()
)
col1 col2 value
0 A 0 1
1 A 1 2
2 B 0 3
3 B 1 4
4 C 0 5
5 C 1 6
I was wondering if there's an easier way to create the variables, "freq_t1", and "freq_t2" grouped by id, from the following data:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'id':[1,1,1,2,2,2],
'time':[1,1,2,3,2,2]
})
to
df = pd.DataFrame({
'id':[1,1,1,2,2,2],
'time':[1,1,2,3,2,2],
'freq_t1':[2,2,2,0,0,0],
'freq_t2':[1,1,1,2,2,2]
})
That is, id == 1 has two observations of time == 1, while id == 2 has zero. Similarly, id == 1 has one observation of time == 2, while id == 2 has two.
Use broadcasted comparison on the "time" column with your selected time values, then groupby and transform to broadcast the sum to the original columns. Here's an example:
tvals = [1, 2]
(pd.DataFrame(df['time'].values[:,None] == tvals, columns=tvals)
.groupby(df['id'])
.transform('sum')
.astype(int)
.add_prefix('freq_t'))
freq_t1 freq_t2
0 2 1
1 2 1
2 2 1
3 0 2
4 0 2
5 0 2
When tvals = [1, 2, 3], this produces
freq_t1 freq_t2 freq_t3
0 2 1 0
1 2 1 0
2 2 1 0
3 0 2 1
4 0 2 1
5 0 2 1
If you want columns for all t-values, you can also use get_dummies:
pd.get_dummies(df.time).groupby(df.id).transform('sum').add_prefix('freq_t')
freq_t1 freq_t2 freq_t3
0 2 1 0
1 2 1 0
2 2 1 0
3 0 2 1
4 0 2 1
5 0 2 1
Finally, to concatenate the result to df, use pd.concat:
res = pd.get_dummies(df.time).groupby(df.id).transform('sum').add_prefix('freq_t')
pd.concat([df, res], axis=1)
id time freq_t1 freq_t2 freq_t3
0 1 1 2 1 0
1 1 1 2 1 0
2 1 2 2 1 0
3 2 3 0 2 1
4 2 2 0 2 1
5 2 2 0 2 1
Supposing I have the two DataFrames shown below:
dd = pd.DataFrame([1,0, 3, 0, 5])
0
0 1
1 0
2 3
3 0
4 5
and
df = pd.DataFrame([2,4])
0
0 2
1 4
How can I broadcast the values of df into dd with step = 2 so I end up with
0
0 1
1 2
2 3
3 4
4 5
Another solution:
dd = pd.DataFrame([1, 0, 3, 0, 5])
df = pd.DataFrame([2, 4])
dd.iloc[1::2] = df.values
dd
# Out:
0
0 1
1 2
2 3
3 4
4 5
dd.values[1::2] = df.values
dd now contains:
0
0 1
1 2
2 3
3 4
4 5
Note that here step=2 condition is used. array[1::2] syntax means start from the array element with index 1, until the end, with step=2.
Change df.index by range and fill second DataFrame:
df.index = range(1, len(dd)+1, 2)[:len(df)]
print (df)
0
1 2
3 4
dd.loc[df.index] = df
print (dd)
0
0 1
1 2
2 3
3 4
4 5
I have a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 2, 3], ['a,b', 5, 6], ['c', 8, 9]])
0 1 2
0 a 2 3
1 a,b 5 6
2 c 8 9
I want to spread the first column to n columns (where n is the number of unique, comma-separated values, in this case 3). Each of the resulting columns shall be 1 if the value is present, and 0 else. Expected result is:
1 2 a c b
0 2 3 1 0 0
1 5 6 1 0 1
2 8 9 0 1 0
I came up with the following code, but it seems a bit circuitous to me.
>>> import re
>>> dfSpread = pd.get_dummies(df[0].str.split(',', expand=True)).\
rename(columns=lambda x: re.sub('.*_','',x))
>>> pd.concat([df.iloc[:,1:], dfSpread], axis = 1)
Is there a built-in function that does just that that I wasn't able to find?
Using get_dummies
df.set_index([1,2])[0].str.get_dummies(',').reset_index()
Out[229]:
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
You can use pop + concat here for an alternative version of Wen's answer.
pd.concat([df, df.pop(df.columns[0]).str.get_dummies(sep=',')], axis=1)
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
Say you have a multiindex DataFrame
x y z
a 1 0 1 2
2 3 4 5
b 1 0 1 2
2 3 4 5
3 6 7 8
c 1 0 1 2
2 0 4 6
Now you have another DataFrame which is
col1 col2
0 a 1
1 b 1
2 b 3
3 c 1
4 c 2
How do you split the multiindex DataFrame based on the one above?
Use loc by tuples:
df = df1.loc[df2.set_index(['col1','col2']).index.tolist()]
print (df)
x y z
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
df = df1.loc[[tuple(x) for x in df2.values.tolist()]]
print (df)
x y z
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
Or join:
df = df2.join(df1, on=['col1','col2']).set_index(['col1','col2'])
print (df)
x y z
col1 col2
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
Simply using isin
df[df.index.isin(list(zip(df2['col1'],df2['col2'])))]
Out[342]:
0 1 2 3
index1 index2
a 1 1 0 1 2
b 1 1 0 1 2
3 3 6 7 8
c 1 1 0 1 2
2 2 0 4 6
You can also do this using the MultiIndex reindex method https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
## Recreate your dataframes
tuples = [('a', 1), ('a', 2),
('b', 1), ('b', 2),
('b', 3), ('c', 1),
('c', 2)]
data = [[1, 0, 1, 2],
[2, 3, 4, 5],
[1, 0, 1, 2],
[2, 3, 4, 5],
[3, 6, 7, 8],
[1, 0, 1, 2],
[2, 0, 4, 6]]
idx = pd.MultiIndex.from_tuples(tuples, names=['index1','index2'])
df= pd.DataFrame(data=data, index=idx)
df2 = pd.DataFrame([['a', 1],
['b', 1],
['b', 3],
['c', 1],
['c', 2]])
# Answer Question
idx_subset = pd.MultiIndex.from_tuples([(a, b) for a, b in df2.values], names=['index1', 'index2'])
out = df.reindex(idx_subset)
print(out)
0 1 2 3
index1 index2
a 1 1 0 1 2
b 1 1 0 1 2
3 3 6 7 8
c 1 1 0 1 2
2 2 0 4 6