How to reshape a multi-column dataframe by index? - python

Following from here . The solution works for only one column. How to improve the solution for multiple columns. i.e If I have a dataframe like
df= pd.DataFrame([['a','b'],['b','c'],['c','z'],['d','b']],index=[0,0,1,1])
0 1
0 a b
0 b c
1 c z
1 d b
How to reshape them like
0 1 2 3
0 a b b c
1 c z d b
If df is
0 1
0 a b
1 c z
1 d b
Then
0 1 2 3
0 a b NaN NaN
1 c z d b

Use flatten/ravel
In [4401]: df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
Out[4401]:
0 1 2 3
0 a b b c
1 c z d b
Or, stack
In [4413]: df.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
Out[4413]:
0 1 2 3
0 a b b c
1 c z d b
Also, with unequal indices
In [4435]: df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
Out[4435]:
0 1 2 3
0 a b NaN NaN
1 c z d b

Use groupby + pd.Series + np.reshape:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.reshape(-1, )))
0 1 2 3
0 a b b c
1 c z d b
Solution for unequal number of indices - call the pd.DataFrame constructor instead.
df
0 1
0 a b
1 c z
1 d b
df.groupby(level=0).apply(lambda x: \
pd.DataFrame(x.values.reshape(1, -1))).reset_index(drop=True)
0 1 2 3
0 a b NaN NaN
1 c z d b

pd.DataFrame({n: g.values.ravel() for n, g in df.groupby(level=0)}).T
0 1 2 3
0 a b b c
1 c z d b
This is all over the place and I'm too tired to make it pretty
v = df.values
cc = df.groupby(level=0).cumcount().values
i0, r = pd.factorize(df.index.values)
n, m = v.shape
j0 = np.tile(np.arange(m), n)
j = np.arange(r.size * m).reshape(-1, m)[cc].ravel()
i = i0.repeat(m)
e = np.empty((r.size, m * r.size), dtype=object)
e[i, j] = v.ravel()
pd.DataFrame(e, r)
0 1 2 3
0 a b None None
1 c z d b

Let's try
df1 = df.set_index(df.groupby(level=0).cumcount(), append=True).unstack()
df1.set_axis(labels=pd.np.arange(len(df1.columns)), axis=1)
Output:
0 1 2 3
0 a b b c
1 c d z b
Output for df with NaN:
0 1 2 3
0 a None b None
1 c d z b

Related

separating values ​between rows with pandas

I want to separate values in "alpha" column like this
Start:
alpha
beta
gamma
A
1
0
A
1
1
B
1
0
B
1
1
B
1
0
C
1
1
End:
alpha
beta
gamma
A
1
0
A
1
1
X
X
X
B
1
0
B
1
1
B
1
0
X
X
X
C
1
1
Thanks for help <3
You can try
out = (df.groupby('alpha')
.apply(lambda g: pd.concat([g, pd.DataFrame([['X', 'X', 'X']], columns=df.columns)]))
.reset_index(drop=True)[:-1])
print(out)
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
Assuming a range index as in the example, you can use:
# get indices in between 2 groups
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = pd.concat([df, df[idx].assign(**{c: 'X' for c in df})]).sort_index(kind='stable')
Or without groupby and sort_index:
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = df.loc[df.index.repeat(idx+1)]
df2.loc[df2.index.duplicated()] = 'X'
output:
alpha beta gamma
0 A 1 0
1 A 1 1
1 X X X
2 B 1 0
3 B 1 1
4 B 1 0
4 X X X
5 C 1 1
NB. add reset_index(drop=True) to get a new index
You can do:
dfx = pd.DataFrame({'alpha':['X'],'beta':['X'],'gamma':['X']})
df = df.groupby('alpha',as_index=False).apply(lambda x:x.append(dfx)).reset_index(drop=True)
Output:
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
8 X X X
To avoid adding a [X, X, X] at the end you can check the index first like:
df.groupby('alpha',as_index=False).apply(
lambda x:x.append(dfx)
if x.index[-1] != df.index[-1] else x).reset_index(drop=True)

pandas compare first rows and make identical

I have two dfs.
df1 = pd.DataFrame(["bazzar","dogsss","zxvfzx","anythi"], columns = [0], index = [0,1,2,3])
df2 = pd.DataFrame(["baar","maar","cats","$%&*"], columns = [0], index = [0,1,2,3])
df1 = df1[0].apply(lambda x: pd.Series(list(x)))
df2 = df2[0].apply(lambda x: pd.Series(list(x)))
which look like
df1
0 1 2 3 4 5
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
df2
0 1 2 3
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
I want to compare their first rows and make them identical by inserting new columns containing the character z to df2, so that df2 becomes
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
An additional example:
df3 = pd.DataFrame(["aazzbbzcc","bbbbbbbbb","ccccccccc","ddddddddd"], columns = [0], index = [0,1,2,3])
df4 = pd.DataFrame(["aabbcc","111111","222222","333333"], columns = [0], index = [0,1,2,3])
df3 = df3[0].apply(lambda x: pd.Series(list(x)))
df4 = df4[0].apply(lambda x: pd.Series(list(x)))
df3
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 b b b b b b b b b
2 c c c c c c c c c
3 d d d d d d d d d
df4
0 1 2 3 4 5
0 a a b b c c
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
You can see, an important relationship between the first rows of the two dataframes: they will eventually become the same when character z are added to the later dataframe (i.e. df2 and df4), so that the expected output for this example is:
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 1 1 z z 1 1 z 1 1
2 2 2 z z 2 2 z 2 2
3 3 3 z z 3 3 z 3 3
Any idea how to do that?
Because in first rows are duplicated values are create MultiIndex with first rows and GroupBy.cumcount for both DataFrames:
a = df1.iloc[[0]].T
df1.columns = [a[0], a.groupby(a[0]).cumcount()]
b = df2.iloc[[0]].T
df2.columns = [b[0], b.groupby(b[0]).cumcount()]
print (df1)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
print (df2)
0 b a r
0 0 1 0
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
And then is used DataFrame.reindex with replace missing values by first row of df1:
df = df2.reindex(df1.columns, axis=1).fillna(df1.iloc[0])
print (df)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Last set range to columns:
df.columns = range(len(df.columns))
print (df)
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Check where to add:
list(difflib.ndiff(df2[0][0], df1[0][0]))
[' b', ' a', '+ z', '+ z', ' a', ' r']
Add manually
df2[0].str.replace('(.){2}', '\\1zz', regex = True).str.split('(?<=\\S)(?=\\S)', expand = True)
Out[1557]:
0 1 2 3 4 5
0 a z z r z z
1 a z z r z z
2 a z z s z z
3 % z z * z z

How to split dataframe in pandas

I have dataframe below
A B C
0 a h
0 b i
0 c j
1 d k
1 e l
2 f m
2 g n
I would like to split dataframe by df.A
A B C
0 a h
0 b i
0 c j
and
A B C
1 d k
1 e l
and
A B C
2 f m
2 g n
I tried groupby but It didnt work well. how can I split dataframe to multiple dataframe?
You can create dictionary of DataFrames by dict comprehension:
dfs = {k:v for k, v in df.groupby('A')}
print (dfs)
{0: A B C
0 0 a h
1 0 b i
2 0 c j, 1: A B C
3 1 d k
4 1 e l, 2: A B C
5 2 f m
6 2 g n}
print (dfs[0])
A B C
0 0 a h
1 0 b i
2 0 c j
print (dfs[1])
A B C
3 1 d k
4 1 e l
If necessary you can reset index:
dfs = {k:v.reset_index(drop=True) for k, v in df.groupby('A')}
print (dfs)
{0: A B C
0 0 a h
1 0 b i
2 0 c j, 1: A B C
0 1 d k
1 1 e l, 2: A B C
0 2 f m
1 2 g n}
print (dfs[1])
A B C
0 1 d k
1 1 e l
print (dfs[2])
A B C
0 2 f m
1 2 g n

Get first version of a line with duplicate values versus one column

Hello I'm looking for a way to get from this dataframe df::
df = pd.DataFrame(dict(X=list('abbcccddef'),
Y=list('ABCDEFGHIJ'),
Z=list('1234123412')))
df
# X Y Z
# 0 a A 1
# 1 b B 2
# 2 b C 3
# 3 c D 4
# 4 c E 1
# 5 c F 2
# 6 d G 3
# 7 d H 4
# 8 e I 1
# 9 f J 2
Only the first lines for each X value, so this one::
# X Y Z
# 0 a A 1
# 1 b B 2
# 3 c D 4
# 6 d G 3
# 8 e I 1
# 9 f J 2
I'm looking for a more elegant way than this::
x_unique = df.X.unique()
x_unique
# array(['a', 'b', 'c', 'd', 'e', 'f'], dtype=object)
res = df[df.X == x_unique[0]].iloc[0]
for u in x_unique[1:]:
res = pd.concat([res, df[df.X==u].iloc[0]], axis=1)
res
# 0 1 3 6 8 9
# X a b c d e f
# Y A B D G I J
# Z 1 2 4 3 1 2
res = res.transpose()
res
# X Y Z
# 0 a A 1
# 1 b B 2
# 3 c D 4
# 6 d G 3
# 8 e I 1
# 9 f J 2
You could use drop_duplicates() method on X
In [60]: df.drop_duplicates('X')
Out[60]:
X Y Z
0 a A 1
1 b B 2
3 c D 4
6 d G 3
8 e I 1
9 f J 2
You can also do:
In [3]: import pandas as pd
In [4]: df = pd.DataFrame(dict(X=list('abbcccddef'),
Y=list('ABCDEFGHIJ'),
Z=list('1234123412')))
In [5]: df.groupby('X').first()
Out[5]:
Y Z
X
a A 1
b B 2
c D 4
d G 3
e I 1
f J 2

Pivoting a table with hierarchical index

This is a simple problem but for some reason I am not able to find an easy solution.
I have a hierarchically indexed Series, for example:
s = pd.Series(data=randint(0, 3, 45),
index=pd.MultiIndex.from_tuples(list(itertools.product('pqr',[0,1,2],'abcde')),
names=['Index1', 'Index2', 'Index3']), name='P')
s = s.map({0:'A', 1:'B', 2:'C'})
So it looks like
Index1 Index2 Index3
p 0 a A
b A
c C
d B
e C
1 a B
b C
c C
d B
e B
q 0 a B
b C
c C
d C
e C
1 a A
b A
c B
d C
e A
I want to do a frequency count by value so that the output looks like
Index1 Index2 P
p 0 A 2
B 1
C 2
1 A 0
B 3
C 2
q 0 A 0
B 1
C 4
1 A 3
B 1
C 1
You can apply value_counts to the Series groupby:
In [11]: s.groupby(level=[0, 1]).value_counts() # equiv .apply(pd.value_counts)
Out[11]:
Index1 Index2
p 0 C 2
A 2
B 1
1 B 3
A 2
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 B 2
C 2
A 1
2 C 3
B 1
A 1
r 0 A 3
B 1
C 1
1 B 3
C 2
2 B 3
C 1
A 1
dtype: int64
If you want to include the 0s (which the above won't) you could use cross_tab:
In [21]: ct = pd.crosstab(rows=[s.index.get_level_values(0), s.index.get_level_values(1)],
cols=s.values,
aggfunc=len,
rownames=s.index.names[:2],
colnames=s.index.names[2:3])
In [22]: ct
Out[22]:
Index3 A B C
Index1 Index2
p 0 2 1 2
1 2 3 0
2 3 1 1
q 0 3 1 1
1 1 2 2
2 1 1 3
r 0 3 1 1
1 0 3 2
2 1 3 1
In [23]: ct.stack()
Out[23]:
Index1 Index2 Index3
p 0 A 2
B 1
C 2
1 A 2
B 3
C 0
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 A 1
B 2
C 2
2 A 1
B 1
C 3
r 0 A 3
B 1
C 1
1 A 0
B 3
C 2
2 A 1
B 3
C 1
dtype: int64
Which may be slightly faster...

Categories

Resources