Expand Pandas series into dataframe by unique values - python

I would like to expand a series or dataframe into a sparse matrix based on the unique values of the series. It's a bit hard to explain verbally but an example should be clearer.
First, simpler version - if I start with this:
Idx Tag
0 A
1 B
2 A
3 C
4 B
I'd like to get something like this, where the unique values in the starting series are the column values here (could be 1s and 0s, Boolean, etc.):
Idx A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
Second, more advanced version - if I have values associated with each entry, preserving those and filling the rest of the matrix with a placeholder (0, NaN, something else), e.g. starting from this:
Idx Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
And ending up with this:
Idx A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
What's a Pythonic way to do this?

Here's how to do it, using pandas.get_dummies() which was designed specifically for this (often called "one-hot-encoding" in ML). I've done it step-by-step so you can see how it's done ;)
>>> df
Idx Tag Val
0 0 A 5
1 1 B 2
2 2 A 3
3 3 C 7
4 4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.concat([df[['Idx']], pd.get_dummies(df['Tag'])], axis=1)
Idx A B C
0 0 1 0 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 0 1 0
>>> pd.get_dummies(df['Tag']).to_numpy()
array([[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]], dtype=uint8)
>>> df2[['Val']].to_numpy()
array([[5],
[2],
[3],
[7],
[1]])
>>> pd.get_dummies(df2['Tag']).to_numpy() * df2[['Val']].to_numpy()
array([[5, 0, 0],
[0, 2, 0],
[3, 0, 0],
[0, 0, 7],
[0, 1, 0]])
>>> pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
>>> pd.concat([df, pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())], axis=1)
Idx Tag Val A B C
0 0 A 5 5 0 0
1 1 B 2 0 2 0
2 2 A 3 3 0 0
3 3 C 7 0 0 7
4 4 B 1 0 1 0

Based on #user17242583 's answer, found a pretty simple way to do it using pd.get_dummies combined with DataFrame.multiply:
>>> df
Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.get_dummies(df['Tag']).multiply(df['Val'], axis=0)
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0

Related

Transforming multilabels to single label problem

I am working on a data manipulation exercise, where the original dataset looks like;
df = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, -7, 4, 3, 2],
'a': [0, 1, 0, 1, 1],
'b': [0, 1, 1, 0, 0],
'c': [0, 1, 1, 1, 1],
'd': [0, 0, 1, 0, 1]})
Here the columns a,b,c are categories whereas x,x2 are features. The goal is to convert this dataset into following format;
dfnew1 = pd.DataFrame({
'x1': [1, 2,2,2, 3,3,3, 4,4, 5,5,5],
'x2': [2, -7,-7,-7, 4,4,4, 3,3, 2,2,2],
'a': [0, 1,0,0, 0,0,0, 1,0,1,0,0],
'b': [0, 0,1,0, 1,0,0,0, 0, 0,0,0],
'c': [0,0,0,1,0,1,0,0,1,0,1,0],
'd': [0,0,0,0,0,0,1,0,0,0,0,1],
'y':[0,'a','b','c','b','c','d','a','c','a','c','d']})
Can I get some help on how to do it? On my part, I was able to get in following form;
df.loc[:, 'a':'d']=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['label_concat']=df.loc[:, 'a':'d'].apply(lambda x: '-'.join([i for i in x if i!=0]),axis=1)
This gave me the following output;
x1 x2 a b c d label_concat
0 1 2 0 0 0 0
1 2 -7 a b c 0 a-b-c
2 3 4 0 b c d b-c-d
3 4 3 a 0 c 0 a-c
4 5 2 a 0 c d a-c-d
As seen, it is not the desired output. Can I please get some help on how to modify my approach to get desired output? thanks
You could try this, to get the desired output based on your original approach:
Option 1
temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)
df=df.explode('y').fillna(0).reset_index(drop=True)
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int)
df.loc[1:, 'a':'d']=m.astype(int)
Another approach, similar to #ALollz's solution:
Option 2
df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1)
m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0)
df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int)
df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0)
Output:
df
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 0 0 0 a
1 2 -7 0 1 0 0 b
1 2 -7 0 0 1 0 c
2 3 4 0 1 0 0 b
2 3 4 0 0 1 0 c
2 3 4 0 0 0 1 d
3 4 3 1 0 0 0 a
3 4 3 0 0 1 0 c
4 5 2 1 0 0 0 a
4 5 2 0 0 1 0 c
4 5 2 0 0 0 1 d
Explanation of Option 1:
First, we use your approach, but instead of change the original data, use copy temp, and also instead of joining the columns into a string, keep them as a list:
temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1) #without join
df['y']
0 []
1 [a, b, c]
2 [b, c, d]
3 [a, c]
4 [a, c, d]
Then we can use pd.DataFrame.explode to get the lists expanded, pd.DataFrame.fillna(0) to fill the first row, and pd.DataFrame.reset_index():
df=df.explode('y').fillna(0).reset_index(drop=True)
df
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 1 1 0 a
2 2 -7 1 1 1 0 b
3 2 -7 1 1 1 0 c
4 3 4 0 1 1 1 b
5 3 4 0 1 1 1 c
6 3 4 0 1 1 1 d
7 4 3 1 0 1 0 a
8 4 3 1 0 1 0 c
9 5 2 1 0 1 1 a
10 5 2 1 0 1 1 c
11 5 2 1 0 1 1 d
Then we mask df.loc[1:, 'a':'d'] to see when it is equal to y column, and then, we cast the mask to int, using astype(int):
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
m
a b c d
1 True False False False
2 False True False False
3 False False True False
4 False True False False
5 False False True False
6 False False False True
7 True False False False
8 False False True False
9 True False False False
10 False False True False
11 False False False True
df.loc[1:, 'a':'d']=m.astype(int)
df.loc[1:, 'a':'d']
a b c d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
7 1 0 0 0
8 0 0 1 0
9 1 0 0 0
10 0 0 1 0
11 0 0 0 1
Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this:
#Replace NaN values (the empty list from original df) with ''
df=df.explode('y').fillna('').reset_index(drop=True)
#make the mask with all the rows
msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
df.loc[:, 'a':'d']=msk.astype(int)
#Then, replace the original '' (NaN values) with 0
df=df.replace('',0)
Tricky problem. Here's one of probably many methods.
We set the index then use .loc to repeat that row as many times as we will need, based on the sum of the other columns (clip at 1 so every row appears at least once). Then we can use where to mask the DataFrame and turn the repeated 1s into 0s and we will dot with the columns to get the 'y' column you desire, replacing the empty string (when 0 across an entire row) with 0.
df1 = df.set_index(['x1', 'x2'])
df1 = df1.loc[df1.index.repeat(df1.sum(1).clip(lower=1))]
# a b c d
#x1 x2
#1 2 0 0 0 0
#2 -7 1 1 1 0
# -7 1 1 1 0
# -7 1 1 1 0
#3 4 0 1 1 1
# 4 0 1 1 1
# 4 0 1 1 1
#4 3 1 0 1 0
# 3 1 0 1 0
#5 2 1 0 1 1
# 2 1 0 1 1
# 2 1 0 1 1
N = df1.groupby(level=0).cumcount()+1
m = df1.groupby(level=0).cumsum(1).eq(N, axis=0)
df1 = df1.where(m).fillna(0, downcast='infer')
df1['y'] = df1.dot(df1.columns).replace('', 0)
df1 = df1.reset_index()
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 0 0 0 a
2 2 -7 0 1 0 0 b
3 2 -7 0 0 1 0 c
4 3 4 0 1 0 0 b
5 3 4 0 0 1 0 c
6 3 4 0 0 0 1 d
7 4 3 1 0 0 0 a
8 4 3 0 0 1 0 c
9 5 2 1 0 0 0 a
10 5 2 0 0 1 0 c
11 5 2 0 0 0 1 d

Concatenate multiple values in same row into a list

I have a df where I want to do multi-label classification. One of the ways which was suggested to me was to calculate the probability vector. Here's an example of my DF with what would represent training data.
id ABC DEF GHI
1 0 0 0 1
2 1 0 1 0
3 2 1 0 0
4 3 0 1 1
5 4 0 0 0
6 5 0 1 1
7 6 1 1 1
8 7 1 0 1
9 8 1 1 0
And I would like to concatenate columns ABC, DEF, GHI into a new column. I will also have to do this with more than 3 columns, so I want to do relatively cleanly using a column list or something similar:
col_list = ['ABC','DEF','GHI']
The result I am looking for would be something like:
id ABC DEF GHI Conc
1 0 0 0 1 [0,0,1]
2 1 0 1 0 [0,1,0]
3 2 1 0 0 [1,0,0]
4 3 0 1 1 [0,1,1]
5 4 0 0 0 [0,0,0]
6 5 0 1 1 [0,1,1]
7 6 1 1 1 [1,1,1]
8 7 1 0 1 [1,0,1]
9 8 1 1 0 [1,1,0]
Try:
col_list = ['ABC','DEF','GHI']
df['agg_lst']=df.apply(lambda x: list(x[col] for col in col_list), axis=1)
You can use 'agg' with function 'list':
df[cols].agg(list,axis=1)
1 [0, 0, 1]
2 [0, 1, 0]
3 [1, 0, 0]
4 [0, 1, 1]
5 [0, 0, 0]
6 [0, 1, 1]
7 [1, 1, 1]
8 [1, 0, 1]
9 [1, 1, 0]

How to label ascending consecutive numbers onto values in a column?

I have a column that looks something like this:
1
0
0
1
0
0
0
1
I want the output to look something like this:
1 <--
0
0
2 <--
0
0
0
3 <--
And so forth. I'm not sure where to begin. There about 10,000 rows and I feel like making a if statement might take awhile. How do I achieve this output?
Efficient and concise:
s.cumsum()*s
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
dtype: int64
Use Series.cumsum + Series.where
Here is an example:
print(df)
0
0 1
1 0
2 0
3 1
4 0
5 0
6 0
7 1
df['0']=df['0'].cumsum().where(df['0'].ne(0),df['0'])
print(df)
0
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
Try this:
s = pd.Series([1,0,0,1,0,0,0,1])
s.cumsum().mask(s==0, 0)
Output:
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
dtype: int64
np.where and cumsum:
df['cum_sum'] = np.where(df.val>0, df.val.cumsum(), 0)
output:
val cum_sum
0 1 1
1 0 0
2 0 0
3 1 2
4 0 0
5 0 0
6 0 0
7 1 3
you could do something like this
df = {'col1': [1, 0,0,0,1,0,0,1] }
count = 0
col = []
for val in zip(df['col1']):
if val[0] == 1:
count+=1
col.append(count)
else:
col.append(val[0])
and you get [1, 0, 0, 0, 2, 0, 0, 3]
Only select the rows that are non-zero and replace those values with cumsum
import pandas as pd
df=pd.DataFrame({'col': [0,1,0,0,1,0,0,0,1,0] })
index=df["col"]!=0
df.loc[index,"col"]=df.loc[index,"col"].cumsum()
print(df)
col
0 0
1 1
2 0
3 0
4 2
5 0
6 0
7 0
8 3
9 0

How to replace integer values in pandas in python?

I have a pandas dataframe as follows.
a b c d e
a 0 1 0 1 1
b 1 0 1 6 3
c 0 1 0 1 2
d 5 1 1 0 8
e 1 3 2 8 0
I want to replace values that is below 6 <=5 with 0. So my output should be as follows.
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
I was trying to do this using the following code.
df['a'].replace([1, 2, 3, 4, 5], 0)
df['b'].replace([1, 2, 3, 4, 5], 0)
df['c'].replace([1, 2, 3, 4, 5], 0)
df['d'].replace([1, 2, 3, 4, 5], 0)
df['e'].replace([1, 2, 3, 4, 5], 0)
However, I am sure that there is a more easy way of doing this task in pandas.
I am happy to provide more details if needed.
Using mask
df=df.mask(df<=5,0)
df
Out[380]:
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
For performance, I recommend np.where. You can assign the array back inplace using sliced assignment (df[:] = ...).
df[:] = np.where(df < 6, 0, df)
df
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
Another option involves fillna:
df[df>=6].fillna(0, downcast='infer')
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0

Efficient way to get a subset of indices in numpy

I have the following indices as you would get them from np.where(...):
coords = (
np.asarray([0 0 0 1 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 2 8 2 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
Another tuple with indices is meant to select those that are in coords:
index = tuple(
np.asarray([0 0 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 8 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
So for instance, coords[0] is selected because it's in index (at position 0), but coords[1] isn't selected because it's not available in index.
I can calculate the mask easily with [x in zip(*index) for x in zip(*coords)] (converted from bool to int for better readability):
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
but this wouldn't be very efficient for larger arrays. Is there a more "numpy-based" way that could calculate the mask?
Not so sure about efficiency but given you're basically comparing coordinates pairs you could use scipy distance functions. Something along:
from scipy.spatial.distance import cdist
c = np.stack(coords).T
i = np.stack(index).T
d = cdist(c, i)
In [113]: np.any(d == 0, axis=1).astype(int)
Out[113]:
array([1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1])
By default it uses L2 norm, you could probably make it slightly faster with a simpler distance function, e.g.:
d = cdist(c,i, lambda u, v: np.all(np.equal(u,v)))
np.any(d != 0, axis=1).astype(int)
You can use np.ravel_multi_index to compress the columns into unique numbers which are easier to handle:
cmx = *map(np.max, coords),
imx = *map(np.max, index),
shape = np.maximum(cmx, imx) + 1
ct = np.ravel_multi_index(coords, shape)
it = np.ravel_multi_index(index, shape)
it.sort()
result = ct == it[it.searchsorted(ct)]
print(result.view(np.int8))
Prints:
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Categories

Resources