I would like to transform the below pandas dataframe:
dd = pd.DataFrame({ "zz":[1,3], "y": ["a","b"], "x": [[1,2],[1]]})
x y z
0 [1, 2] a 1
1 [1] b 3
into :
x y z
0 1 a 1
1 1 b 3
2 2 a 1
As you can see, the first row is elaborated in columns X into its individual elements while repeating the other columns y, z. Can I do this without using a for loop?
Use:
#get lengths of lists
l = dd['x'].str.len()
df = dd.loc[dd.index.repeat(l)].assign(x=np.concatenate(dd['x'])).reset_index(drop=True)
print (df)
x y zz
0 1 a 1
1 2 a 1
2 1 b 3
But if order is important:
df1 = pd.DataFrame(dd['x'].values.tolist())
.stack()
.sort_index(level=[1,0])
.reset_index(name='x')
print (df1)
level_0 level_1 x
0 0 0 1.0
1 1 0 1.0
2 0 1 2.0
df = df1.join(dd.drop('x',1), on='level_0').drop(['level_0','level_1'], 1)
print (df)
x y zz
0 1.0 a 1
1 1.0 b 3
2 2.0 a 1
Using join and stack you can
In [655]: dd.drop('x', 1).join(
dd.apply(lambda x: pd.Series(x.x), axis=1)
.stack().reset_index(level=1, drop=True).to_frame('x'))
Out[655]:
y z x
0 a 1 1.0
0 a 1 2.0
1 b 3 1.0
Details
In [656]: dd.apply(lambda x: pd.Series(x.x), axis=1).stack().reset_index(level=1,drop=True)
Out[656]:
0 1.0
0 2.0
1 1.0
dtype: float64
In [657]: dd
Out[657]:
x y z
0 [1, 2] a 1
1 [1] b 3
new_dd = pd.DataFrame(dd.apply(lambda x: pd.Series(x['x']),axis=1).stack().reset_index(level=1, drop=True))
new_dd.columns = ['x']
new_dd.merge(dd[['y','zz']], left_index=True, right_index=True)
Related
I want to separate values in "alpha" column like this
Start:
alpha
beta
gamma
A
1
0
A
1
1
B
1
0
B
1
1
B
1
0
C
1
1
End:
alpha
beta
gamma
A
1
0
A
1
1
X
X
X
B
1
0
B
1
1
B
1
0
X
X
X
C
1
1
Thanks for help <3
You can try
out = (df.groupby('alpha')
.apply(lambda g: pd.concat([g, pd.DataFrame([['X', 'X', 'X']], columns=df.columns)]))
.reset_index(drop=True)[:-1])
print(out)
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
Assuming a range index as in the example, you can use:
# get indices in between 2 groups
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = pd.concat([df, df[idx].assign(**{c: 'X' for c in df})]).sort_index(kind='stable')
Or without groupby and sort_index:
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = df.loc[df.index.repeat(idx+1)]
df2.loc[df2.index.duplicated()] = 'X'
output:
alpha beta gamma
0 A 1 0
1 A 1 1
1 X X X
2 B 1 0
3 B 1 1
4 B 1 0
4 X X X
5 C 1 1
NB. add reset_index(drop=True) to get a new index
You can do:
dfx = pd.DataFrame({'alpha':['X'],'beta':['X'],'gamma':['X']})
df = df.groupby('alpha',as_index=False).apply(lambda x:x.append(dfx)).reset_index(drop=True)
Output:
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
8 X X X
To avoid adding a [X, X, X] at the end you can check the index first like:
df.groupby('alpha',as_index=False).apply(
lambda x:x.append(dfx)
if x.index[-1] != df.index[-1] else x).reset_index(drop=True)
I want to change shape of a dataframe from (x,y) to (1,x,y) or (x,1,y) or (x,y,1). I know in numpy I can do something like arr[np.newaxis,...], I wonder how can I achieve the same for a dataframe?
The pandas.panel object is deprecated. We use pandas.MultiIndex to handle higher dimensional data.
Consider the data frame df
df = pd.DataFrame(1, list('abc'), list('xyz'))
df
x y z
a 1 1 1
b 1 1 1
c 1 1 1
Add Level
The following are various ways to add a level and dimensionality.
axis=0, level=0
pd.concat([df], keys=['A'])
x y z
A a 1 1 1
b 1 1 1
c 1 1 1
df.set_index(pd.MultiIndex.from_product([['B'], df.index]))
x y z
B a 1 1 1
b 1 1 1
c 1 1 1
axis=0, level=1
pd.concat([df], keys=['A']).swaplevel(0, 1)
x y z
a A 1 1 1
b A 1 1 1
c A 1 1 1
df.set_index(pd.MultiIndex.from_product([df.index, ['B']]))
x y z
a B 1 1 1
b B 1 1 1
c B 1 1 1
axis=1, level=0
pd.concat([df], axis=1, keys=['A'])
A
x y z
a 1 1 1
b 1 1 1
c 1 1 1
df.set_axis(pd.MultiIndex.from_product([['B'], df.columns]), axis=1, inplace=False)
B
x y z
a 1 1 1
b 1 1 1
c 1 1 1
axis=1, level=1
pd.concat([df], axis=1, keys=['A']).swaplevel(0, 1, 1)
x y z
A A A
a 1 1 1
b 1 1 1
c 1 1 1
df.set_axis(pd.MultiIndex.from_product([df.columns, ['B']]), axis=1, inplace=False)
x y z
B B B
a 1 1 1
b 1 1 1
c 1 1 1
I'm trying to transform a dataframe
df = pd.DataFrame({
'c1': ['x','y','z'],
'c2': [[1,2,3],[1,3],[2,4]]})
which looks like
c1 c2
0 x [1, 2, 3]
1 y [1, 3]
2 z [2, 4]
into
p = pd.DataFrame({
'c1': ['x','y','z'],
1: [1,1,0],
2: [1,0,1],
3: [1,1,0],
4: [0,0,1]
})
which looks like
c1 1 2 3 4
0 x 1 1 1 0
1 y 1 0 1 0
2 z 0 1 0 1
the value 1's and 0's are supposed to be true and false. I'm still learning pivots. Please point me in the right direction.
You can use:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['c2']),columns=mlb.classes_, index=df.index)
df = df.drop('c2', 1).join(df1)
print (df)
c1 1 2 3 4
0 x 1 1 1 0
1 y 1 0 1 0
2 z 0 1 0 1
Another solution:
df1 = df['c2'].apply(lambda x: '|'.join([str(y) for y in x])).str.get_dummies()
df = df.drop('c2', 1).join(df1)
print (df)
c1 1 2 3 4
0 x 1 1 1 0
1 y 1 0 1 0
2 z 0 1 0 1
EDIT:
Thanks, MaxU for nice suggestion:
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('c2')),
columns=mlb.classes_,
index=df.index))
You can use
In [235]: df.join(pd.DataFrame([{x: 1 for x in r} for r in df.c2]).fillna(0))
Out[235]:
c1 c2 1 2 3 4
0 x [1, 2, 3] 1.0 1.0 1.0 0.0
1 y [1, 3] 1.0 0.0 1.0 0.0
2 z [2, 4] 0.0 1.0 0.0 1.0
Details
In [236]: pd.DataFrame([{x: 1 for x in r} for r in df.c2]).fillna(0)
Out[236]:
1 2 3 4
0 1.0 1.0 1.0 0.0
1 1.0 0.0 1.0 0.0
2 0.0 1.0 0.0 1.0
Consider the following dataframes d1 and d1
d1 = pd.DataFrame([
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]
], columns=list('ABC'))
d2 = pd.get_dummies(list('XYZZXY'))
d1
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 1 2 3
4 2 3 4
5 3 4 5
d2
X Y Z
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 1
4 1 0 0
5 0 1 0
I need to get a new dataframe with a multi-index columns object that has the product of every combination of columns from d1 and d2
So far I've done this...
from itertools import product
pd.concat({(x, y): d1[x] * d2[y] for x, y in product(d1, d2)}, axis=1)
A B C
X Y Z X Y Z X Y Z
0 1 0 0 2 0 0 3 0 0
1 0 2 0 0 3 0 0 4 0
2 0 0 3 0 0 4 0 0 5
3 0 0 1 0 0 2 0 0 3
4 2 0 0 3 0 0 4 0 0
5 0 3 0 0 4 0 0 5 0
There is nothing wrong with this method. But I'm looking for alternatives to evaluate.
Inspired by Yakym Pirozhenko
m, n = len(d1.columns), len(d2.columns)
lvl0 = np.repeat(np.arange(m), n)
lvl1 = np.tile(np.arange(n), m)
v1, v2 = d1.values, d2.values
pd.DataFrame(
v1[:, lvl0] * v2[:, lvl1],
d1.index,
pd.MultiIndex.from_tuples(list(zip(d1.columns[lvl0], d2.columns[lvl1])))
)
However, this is a more clumsy implementation of numpy broadcasting which is better covered by Divakar.
Timing
All answers were good answers and demonstrate different aspects of pandas and numpy. Please consider up-voting them if you found them useful and informative.
%%timeit
m, n = len(d1.columns), len(d2.columns)
lvl0 = np.repeat(np.arange(m), n)
lvl1 = np.tile(np.arange(n), m)
v1, v2 = d1.values, d2.values
pd.DataFrame(
v1[:, lvl0] * v2[:, lvl1],
d1.index,
pd.MultiIndex.from_tuples(list(zip(d1.columns[lvl0], d2.columns[lvl1])))
)
%%timeit
vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
pd.DataFrame(vals, columns=cols, index=d1.index)
%timeit d1.apply(lambda x: d2.mul(x, axis=0).stack()).unstack()
%timeit pd.concat({x : d2.mul(d1[x], axis=0) for x in d1.columns}, axis=1)
%timeit pd.concat({(x, y): d1[x] * d2[y] for x, y in product(d1, d2)}, axis=1)
1000 loops, best of 3: 663 µs per loop
1000 loops, best of 3: 624 µs per loop
100 loops, best of 3: 3.38 ms per loop
1000 loops, best of 3: 860 µs per loop
100 loops, best of 3: 2.01 ms per loop
Here is a one-liner that uses pandas stack and unstack method.
The "trick" is to use stack, so that the result of each computation within apply is a time series. Then use unstack to obtain the Multiindex form.
d1.apply(lambda x: d2.mul(x, axis=0).stack()).unstack()
Which gives:
A B C
X Y Z X Y Z X Y Z
0 1.0 0.0 0.0 2.0 0.0 0.0 3.0 0.0 0.0
1 0.0 2.0 0.0 0.0 3.0 0.0 0.0 4.0 0.0
2 0.0 0.0 3.0 0.0 0.0 4.0 0.0 0.0 5.0
3 0.0 0.0 1.0 0.0 0.0 2.0 0.0 0.0 3.0
4 2.0 0.0 0.0 3.0 0.0 0.0 4.0 0.0 0.0
5 0.0 3.0 0.0 0.0 4.0 0.0 0.0 5.0 0.0
Here's one approach with NumPy broadcasting -
vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
df_out = pd.DataFrame(vals, columns=cols, index=d1.index)
Sample run -
In [92]: d1
Out[92]:
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 1 2 3
4 2 3 4
5 3 4 5
In [93]: d2
Out[93]:
X Y Z
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 1
4 1 0 0
5 0 1 0
In [110]: vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
...: cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
...: df_out = pd.DataFrame(vals, columns=cols, index=d1.index)
...:
In [111]: df_out
Out[111]:
A B C
X Y Z X Y Z X Y Z
0 1 0 0 2 0 0 3 0 0
1 0 2 0 0 3 0 0 4 0
2 0 0 3 0 0 4 0 0 5
3 0 0 1 0 0 2 0 0 3
4 2 0 0 3 0 0 4 0 0
5 0 3 0 0 4 0 0 5 0
Here's a bit vectorized version. There could be a better way.
In [846]: pd.concat({x : d2.mul(d1[x], axis=0) for x in d1.columns}, axis=1)
Out[846]:
A B C
X Y Z X Y Z X Y Z
0 1 0 0 2 0 0 3 0 0
1 0 2 0 0 3 0 0 4 0
2 0 0 3 0 0 4 0 0 5
3 0 0 1 0 0 2 0 0 3
4 2 0 0 3 0 0 4 0 0
5 0 3 0 0 4 0 0 5 0
You could get the multi-index first, use it to obtain the shapes and multiply directly.
cols = pd.MultiIndex.from_tuples(
[(c1, c2) for c1 in d1.columns for c2 in d2.columns])
a = d1.loc[:,cols.get_level_values(0)]
b = d2.loc[:,cols.get_level_values(1)]
a.columns = b.columns = cols
res = a * b
I have a pandas dataframe and want replace each value with the mean for it.
ID X Y
1 a 1
2 a 2
3 a 3
4 b 2
5 b 4
How do I replace Y values with mean Y for every unique X?
ID X Y
1 a 2
2 a 2
3 a 2
4 b 3
5 b 3
Use transform:
df['Y'] = df.groupby('X')['Y'].transform('mean')
print (df)
ID X Y
0 1 a 2
1 2 a 2
2 3 a 2
3 4 b 3
4 5 b 3
For new column in another DataFrame use map with drop_duplicates:
df1 = pd.DataFrame({'X':['a','a','b']})
print (df1)
X
0 a
1 a
2 b
df1['Y'] = df1['X'].map(df.drop_duplicates('X').set_index('X')['Y'])
print (df1)
X Y
0 a 2
1 a 2
2 b 3
Another solution:
df1['Y'] = df1['X'].map(df.groupby('X')['Y'].mean())
print (df1)
X Y
0 a 2
1 a 2
2 b 3