I would like to expand a series or dataframe into a sparse matrix based on the unique values of the series. It's a bit hard to explain verbally but an example should be clearer.
First, simpler version - if I start with this:
Idx Tag
0 A
1 B
2 A
3 C
4 B
I'd like to get something like this, where the unique values in the starting series are the column values here (could be 1s and 0s, Boolean, etc.):
Idx A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
Second, more advanced version - if I have values associated with each entry, preserving those and filling the rest of the matrix with a placeholder (0, NaN, something else), e.g. starting from this:
Idx Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
And ending up with this:
Idx A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
What's a Pythonic way to do this?
Here's how to do it, using pandas.get_dummies() which was designed specifically for this (often called "one-hot-encoding" in ML). I've done it step-by-step so you can see how it's done ;)
>>> df
Idx Tag Val
0 0 A 5
1 1 B 2
2 2 A 3
3 3 C 7
4 4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.concat([df[['Idx']], pd.get_dummies(df['Tag'])], axis=1)
Idx A B C
0 0 1 0 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 0 1 0
>>> pd.get_dummies(df['Tag']).to_numpy()
array([[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]], dtype=uint8)
>>> df2[['Val']].to_numpy()
array([[5],
[2],
[3],
[7],
[1]])
>>> pd.get_dummies(df2['Tag']).to_numpy() * df2[['Val']].to_numpy()
array([[5, 0, 0],
[0, 2, 0],
[3, 0, 0],
[0, 0, 7],
[0, 1, 0]])
>>> pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
>>> pd.concat([df, pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())], axis=1)
Idx Tag Val A B C
0 0 A 5 5 0 0
1 1 B 2 0 2 0
2 2 A 3 3 0 0
3 3 C 7 0 0 7
4 4 B 1 0 1 0
Based on #user17242583 's answer, found a pretty simple way to do it using pd.get_dummies combined with DataFrame.multiply:
>>> df
Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.get_dummies(df['Tag']).multiply(df['Val'], axis=0)
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
I am working on a data manipulation exercise, where the original dataset looks like;
df = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, -7, 4, 3, 2],
'a': [0, 1, 0, 1, 1],
'b': [0, 1, 1, 0, 0],
'c': [0, 1, 1, 1, 1],
'd': [0, 0, 1, 0, 1]})
Here the columns a,b,c are categories whereas x,x2 are features. The goal is to convert this dataset into following format;
dfnew1 = pd.DataFrame({
'x1': [1, 2,2,2, 3,3,3, 4,4, 5,5,5],
'x2': [2, -7,-7,-7, 4,4,4, 3,3, 2,2,2],
'a': [0, 1,0,0, 0,0,0, 1,0,1,0,0],
'b': [0, 0,1,0, 1,0,0,0, 0, 0,0,0],
'c': [0,0,0,1,0,1,0,0,1,0,1,0],
'd': [0,0,0,0,0,0,1,0,0,0,0,1],
'y':[0,'a','b','c','b','c','d','a','c','a','c','d']})
Can I get some help on how to do it? On my part, I was able to get in following form;
df.loc[:, 'a':'d']=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['label_concat']=df.loc[:, 'a':'d'].apply(lambda x: '-'.join([i for i in x if i!=0]),axis=1)
This gave me the following output;
x1 x2 a b c d label_concat
0 1 2 0 0 0 0
1 2 -7 a b c 0 a-b-c
2 3 4 0 b c d b-c-d
3 4 3 a 0 c 0 a-c
4 5 2 a 0 c d a-c-d
As seen, it is not the desired output. Can I please get some help on how to modify my approach to get desired output? thanks
You could try this, to get the desired output based on your original approach:
Option 1
temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)
df=df.explode('y').fillna(0).reset_index(drop=True)
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int)
df.loc[1:, 'a':'d']=m.astype(int)
Another approach, similar to #ALollz's solution:
Option 2
df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1)
m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0)
df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int)
df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0)
Output:
df
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 0 0 0 a
1 2 -7 0 1 0 0 b
1 2 -7 0 0 1 0 c
2 3 4 0 1 0 0 b
2 3 4 0 0 1 0 c
2 3 4 0 0 0 1 d
3 4 3 1 0 0 0 a
3 4 3 0 0 1 0 c
4 5 2 1 0 0 0 a
4 5 2 0 0 1 0 c
4 5 2 0 0 0 1 d
Explanation of Option 1:
First, we use your approach, but instead of change the original data, use copy temp, and also instead of joining the columns into a string, keep them as a list:
temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1) #without join
df['y']
0 []
1 [a, b, c]
2 [b, c, d]
3 [a, c]
4 [a, c, d]
Then we can use pd.DataFrame.explode to get the lists expanded, pd.DataFrame.fillna(0) to fill the first row, and pd.DataFrame.reset_index():
df=df.explode('y').fillna(0).reset_index(drop=True)
df
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 1 1 0 a
2 2 -7 1 1 1 0 b
3 2 -7 1 1 1 0 c
4 3 4 0 1 1 1 b
5 3 4 0 1 1 1 c
6 3 4 0 1 1 1 d
7 4 3 1 0 1 0 a
8 4 3 1 0 1 0 c
9 5 2 1 0 1 1 a
10 5 2 1 0 1 1 c
11 5 2 1 0 1 1 d
Then we mask df.loc[1:, 'a':'d'] to see when it is equal to y column, and then, we cast the mask to int, using astype(int):
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
m
a b c d
1 True False False False
2 False True False False
3 False False True False
4 False True False False
5 False False True False
6 False False False True
7 True False False False
8 False False True False
9 True False False False
10 False False True False
11 False False False True
df.loc[1:, 'a':'d']=m.astype(int)
df.loc[1:, 'a':'d']
a b c d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
7 1 0 0 0
8 0 0 1 0
9 1 0 0 0
10 0 0 1 0
11 0 0 0 1
Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this:
#Replace NaN values (the empty list from original df) with ''
df=df.explode('y').fillna('').reset_index(drop=True)
#make the mask with all the rows
msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
df.loc[:, 'a':'d']=msk.astype(int)
#Then, replace the original '' (NaN values) with 0
df=df.replace('',0)
Tricky problem. Here's one of probably many methods.
We set the index then use .loc to repeat that row as many times as we will need, based on the sum of the other columns (clip at 1 so every row appears at least once). Then we can use where to mask the DataFrame and turn the repeated 1s into 0s and we will dot with the columns to get the 'y' column you desire, replacing the empty string (when 0 across an entire row) with 0.
df1 = df.set_index(['x1', 'x2'])
df1 = df1.loc[df1.index.repeat(df1.sum(1).clip(lower=1))]
# a b c d
#x1 x2
#1 2 0 0 0 0
#2 -7 1 1 1 0
# -7 1 1 1 0
# -7 1 1 1 0
#3 4 0 1 1 1
# 4 0 1 1 1
# 4 0 1 1 1
#4 3 1 0 1 0
# 3 1 0 1 0
#5 2 1 0 1 1
# 2 1 0 1 1
# 2 1 0 1 1
N = df1.groupby(level=0).cumcount()+1
m = df1.groupby(level=0).cumsum(1).eq(N, axis=0)
df1 = df1.where(m).fillna(0, downcast='infer')
df1['y'] = df1.dot(df1.columns).replace('', 0)
df1 = df1.reset_index()
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 0 0 0 a
2 2 -7 0 1 0 0 b
3 2 -7 0 0 1 0 c
4 3 4 0 1 0 0 b
5 3 4 0 0 1 0 c
6 3 4 0 0 0 1 d
7 4 3 1 0 0 0 a
8 4 3 0 0 1 0 c
9 5 2 1 0 0 0 a
10 5 2 0 0 1 0 c
11 5 2 0 0 0 1 d
I need to transform a df into antoher, being the original (df1) like this:
value
A--A 4
A--B 2
A--C 1
B--B 2
C--C 3
D--B 2
E--E 6
Then I have this other df2, filled with 0:
A B C D E
A 0 0 0 0 0
B 0 0 0 0 0
C 0 0 0 0 0
D 0 0 0 0 0
E 0 0 0 0 0
F 0 0 0 0 0
G 0 0 0 0 0
I need to convert it to a final df3, getting the values from the pairs in the index from df1, separted by "--", and fill it like this:
A B C D E
A 4 2 1 0 0
B 2 2 0 2 0
C 1 0 3 0 0
D 0 2 0 0 0
E 0 0 0 0 6
F 0 0 0 0 0
G 0 0 0 0 0
There can be pairs in pd2 not existing in pd1. It that case it remains with 0. Any suggestions??
You can create this from df itself. First, set df.index to a MultiIndex using str.split, and then unstack and reindex.
df.index = pd.MultiIndex.from_arrays(zip(*df.index.str.split('--')))
(df['value'].unstack()
.reindex(index=df2.index, columns=df2.columns)
.fillna(0, downcast='infer'))
A B C D E
A 4 2 1 0 0
B 0 2 0 0 0
C 0 0 3 0 0
D 0 2 0 0 0
E 0 0 0 0 6
F 0 0 0 0 0
G 0 0 0 0 0
If you know what rows and columns you want to use, you don't even need df2.
(df['value'].unstack()
.reindex(index=list('ABCDEFG'), columns=list('ABCDE'))
.fillna(0, downcast='infer'))
A B C D E
A 4 2 1 0 0
B 0 2 0 0 0
C 0 0 3 0 0
D 0 2 0 0 0
E 0 0 0 0 6
F 0 0 0 0 0
G 0 0 0 0 0
As per OP's comment, to maintain symmetricity, use pivot your table so NaNs are preserved, then fillna with the transpose:
v = (df['value'].unstack()
.reindex(index=df2.index, columns=df2.columns))
v.fillna(v.T.reindex_like(v)).fillna(0, downcast='infer')
A B C D E
A 4 2 1 0 0
B 2 2 0 2 0
C 1 0 3 0 0
D 0 2 0 0 0
E 0 0 0 0 6
F 0 0 0 0 0
G 0 0 0 0 0
This is a continuation of an earlier learning on numpy arrays.
A structured array is created from the elements of a list - and thereafter populated with values(not shown below).
>>> o = ['x','y','z']
>>> import numpy as np
>>> b = np.zeros((len(o),), dtype=[(i,object) for i in o])
>>> b
array([(0, 0, 0, 0, 0), (0, 0, 0, 0, 0), (0, 0, 0, 0, 0)],
dtype=[('x', '|O4'), ('y', '|O4'), ('z', '|O4')])
The populated array looks as below:
x y z
x 0 1 0
y 1 0 1,5
z 0 1,5 0
1.How do we add new vertices to the above?
2.Once the vertices have been added,what is the cleanest process to add the following array to the structured array(NOTE:not all vertices in this array are new):
d e y
d 0 '1,2' 0
e '1,2' 0 '1'
f 0 '1' 0
The expected output(please bear with me):
x y z d e f
x 0 1 0 0 0 0
y 1 0 1,5 0 1 0
z 0 1,5 0 0 0 0
d 0 0 0 0 1,2 0
e 0 1 0 1,2 0 0
f 0 0 0 0 1 0
Seems like a job for python pandas.
>>> import numpy as np
>>> import pandas as pd
>>> data=np.zeros((4,5))
>>> df=pd.DataFrame(data,columns=['x','y','z','a','b'])
>>> df
x y z a b
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
>>> df['c']=0 #Add a new column
>>> df
x y z a b c
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
>>> new_data=pd.DataFrame([['0','1,2','0'],['1,2','0','1'],['0','1','0']],columns=['d','e','y'])
>>> new_data
d e y
0 0 1,2 0
1 1,2 0 1
2 0 1 0
>>> df.merge(new_data,how='outer') #Merge data
x y z a b c d e
0 0 0 0 0 0 0 NaN NaN
1 0 0 0 0 0 0 NaN NaN
2 0 0 0 0 0 0 NaN NaN
3 0 0 0 0 0 0 NaN NaN
4 NaN 0 NaN NaN NaN NaN 0 1,2
5 NaN 0 NaN NaN NaN NaN 0 1
6 NaN 1 NaN NaN NaN NaN 1,2 0
There are many ways to merge the data that you showed, can you please explain in more detail exactly what you would like the ending array to look like?