How to split dataframe in pandas - python

I have dataframe below
A B C
0 a h
0 b i
0 c j
1 d k
1 e l
2 f m
2 g n
I would like to split dataframe by df.A
A B C
0 a h
0 b i
0 c j
and
A B C
1 d k
1 e l
and
A B C
2 f m
2 g n
I tried groupby but It didnt work well. how can I split dataframe to multiple dataframe?

You can create dictionary of DataFrames by dict comprehension:
dfs = {k:v for k, v in df.groupby('A')}
print (dfs)
{0: A B C
0 0 a h
1 0 b i
2 0 c j, 1: A B C
3 1 d k
4 1 e l, 2: A B C
5 2 f m
6 2 g n}
print (dfs[0])
A B C
0 0 a h
1 0 b i
2 0 c j
print (dfs[1])
A B C
3 1 d k
4 1 e l
If necessary you can reset index:
dfs = {k:v.reset_index(drop=True) for k, v in df.groupby('A')}
print (dfs)
{0: A B C
0 0 a h
1 0 b i
2 0 c j, 1: A B C
0 1 d k
1 1 e l, 2: A B C
0 2 f m
1 2 g n}
print (dfs[1])
A B C
0 1 d k
1 1 e l
print (dfs[2])
A B C
0 2 f m
1 2 g n

Related

How do I explode equal length strings into Pandas DataFrame columns (without empty column)

I have a data frame with strings of equal length (10). I want to "explode" this column into 10 columns. No matter what solution I try, there is a leading empty column. Existing solutions give me this problem, so needless to say existing answers to this question are not satisfactory.
import pandas as pd
df = pd.DataFrame(['tenletters', 'alsotenten', 'letter10!!', 'ten10lette'],
columns=['col1'])
df2 = pd.DataFrame(df['col1'].str.split('').tolist())
0 1 2 3 4 5 6 7 8 9 10 11
0 t e n l e t t e r s
1 a l s o t e n t e n
2 l e t t e r 1 0 ! !
3 t e n 1 0 l e t t e
How can I do this the proper way (i.e., without a leading empty column)?
Use map
df_final = pd.DataFrame(df['col1'].map(list).tolist())
Out[44]:
0 1 2 3 4 5 6 7 8 9
0 t e n l e t t e r s
1 a l s o t e n t e n
2 l e t t e r 1 0 ! !
3 t e n 1 0 l e t t e
>>> pd.DataFrame(df['col1'].apply(list).tolist())
0 1 2 3 4 5 6 7 8 9
0 t e n l e t t e r s
1 a l s o t e n t e n
2 l e t t e r 1 0 ! !
3 t e n 1 0 l e t t e
You use pd.Series.apply
df.col1.apply(lambda x: pd.Series(list(x)))
0 1 2 3 4 5 6 7 8 9
0 t e n l e t t e r s
1 a l s o t e n t e n
2 l e t t e r 1 0 ! !
3 t e n 1 0 l e t t e
You can try this for fun. (Not a performant solution)
Using pd.Series.str.extractall
df.col1.str.extractall(r'(.)').unstack()
0
match 0 1 2 3 4 5 6 7 8 9
0 t e n l e t t e r s
1 a l s o t e n t e n
2 l e t t e r 1 0 ! !
3 t e n 1 0 l e t t e
Note: The column is MultiIndex to make it a single-level df.columns = d.columns.get_level_values(1)

Adding a row to each index on a multi-indexed dataframe

I have a multi-indexed dataframe, and I want to add to every one of the most outer index another line, where the two other indices are marked with a specific string (Same string for all indices in all values). The other values of that row can be empty or anything else.
I tried creating a different dataframe using groupby and appending them but I can't get the indices to work.
For example, for the dataframe:
Index1 Index2 Index3 val
A d 1 a
A d 2 b
A e 3 c
A e 4 d
B f 5 e
B f 6 f
B g 7 g
C h 8 h
C h 9 i
C i 10 j
I would like to get:
Index1 Index2 Index3 val
A d 1 a
A d 2 b
A e 3 c
A e 4 d
A StringA StringA <any value>
B f 5 e
B f 6 f
B g 7 g
B StringA StringA <any value>
C h 8 h
C h 9 i
C i 10 j
C StringA StringA <any value>
IIUC
s=pd.DataFrame({'Index1':df.Index1.unique(),
'Index2':df.Index1.radd('String').unique(),
'Index3': df.Index1.radd('String').unique(),
'val':[1]*df.Index1.nunique()})
pd.concat([df.reset_index(),s]).sort_values('Index1').set_index(['Index1','Index2','Index3'])
Out[301]:
Index1 Index2 Index3 val
0 A d 1 a
1 A d 2 b
2 A e 3 c
3 A e 4 d
0 A StringA StringA 1
4 B f 5 e
5 B f 6 f
6 B g 7 g
1 B StringB StringB 1
7 C h 8 h
8 C h 9 i
9 C i 10 j
2 C StringC StringC 1
You can unstack, assign, stack:
new_df = df.unstack(level=(-1,-2))
# you can pass a series here
new_df[('val','StringA','StringA')] = 'ABC'
new_df.stack(level=(-1,-2))
Output:
val
Index1 Index2 Index3
A d 1 a
2 b
e 3 c
4 d
StringA StringA ABC
B f 5 e
6 f
g 7 g
StringA StringA ABC
C h 8 h
9 i
i 10 j
StringA StringA ABC
Or try using:
groupby = df.groupby(df['Index1'], as_index=False).last()
groupby[['Index2', 'Index3', 'val']] = ['StringA', 'StringA', np.nan]
df = pd.concat([df, groupby]).sort_values(['Index1', 'Index3']).reset_index()
print(df)
Output:
index Index1 Index2 Index3 val
0 0 A d 1 a
1 1 A d 2 b
2 2 A e 3 c
3 3 A e 4 d
4 0 A StringA StringA NaN
5 4 B f 5 e
6 5 B f 6 f
7 6 B g 7 g
8 1 B StringA StringA NaN
9 7 C h 8 h
10 8 C h 9 i
11 9 C i 10 j
12 2 C StringA StringA NaN

How to reshape a multi-column dataframe by index?

Following from here . The solution works for only one column. How to improve the solution for multiple columns. i.e If I have a dataframe like
df= pd.DataFrame([['a','b'],['b','c'],['c','z'],['d','b']],index=[0,0,1,1])
0 1
0 a b
0 b c
1 c z
1 d b
How to reshape them like
0 1 2 3
0 a b b c
1 c z d b
If df is
0 1
0 a b
1 c z
1 d b
Then
0 1 2 3
0 a b NaN NaN
1 c z d b
Use flatten/ravel
In [4401]: df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
Out[4401]:
0 1 2 3
0 a b b c
1 c z d b
Or, stack
In [4413]: df.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
Out[4413]:
0 1 2 3
0 a b b c
1 c z d b
Also, with unequal indices
In [4435]: df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
Out[4435]:
0 1 2 3
0 a b NaN NaN
1 c z d b
Use groupby + pd.Series + np.reshape:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.reshape(-1, )))
0 1 2 3
0 a b b c
1 c z d b
Solution for unequal number of indices - call the pd.DataFrame constructor instead.
df
0 1
0 a b
1 c z
1 d b
df.groupby(level=0).apply(lambda x: \
pd.DataFrame(x.values.reshape(1, -1))).reset_index(drop=True)
0 1 2 3
0 a b NaN NaN
1 c z d b
pd.DataFrame({n: g.values.ravel() for n, g in df.groupby(level=0)}).T
0 1 2 3
0 a b b c
1 c z d b
This is all over the place and I'm too tired to make it pretty
v = df.values
cc = df.groupby(level=0).cumcount().values
i0, r = pd.factorize(df.index.values)
n, m = v.shape
j0 = np.tile(np.arange(m), n)
j = np.arange(r.size * m).reshape(-1, m)[cc].ravel()
i = i0.repeat(m)
e = np.empty((r.size, m * r.size), dtype=object)
e[i, j] = v.ravel()
pd.DataFrame(e, r)
0 1 2 3
0 a b None None
1 c z d b
Let's try
df1 = df.set_index(df.groupby(level=0).cumcount(), append=True).unstack()
df1.set_axis(labels=pd.np.arange(len(df1.columns)), axis=1)
Output:
0 1 2 3
0 a b b c
1 c d z b
Output for df with NaN:
0 1 2 3
0 a None b None
1 c d z b

Get first version of a line with duplicate values versus one column

Hello I'm looking for a way to get from this dataframe df::
df = pd.DataFrame(dict(X=list('abbcccddef'),
Y=list('ABCDEFGHIJ'),
Z=list('1234123412')))
df
# X Y Z
# 0 a A 1
# 1 b B 2
# 2 b C 3
# 3 c D 4
# 4 c E 1
# 5 c F 2
# 6 d G 3
# 7 d H 4
# 8 e I 1
# 9 f J 2
Only the first lines for each X value, so this one::
# X Y Z
# 0 a A 1
# 1 b B 2
# 3 c D 4
# 6 d G 3
# 8 e I 1
# 9 f J 2
I'm looking for a more elegant way than this::
x_unique = df.X.unique()
x_unique
# array(['a', 'b', 'c', 'd', 'e', 'f'], dtype=object)
res = df[df.X == x_unique[0]].iloc[0]
for u in x_unique[1:]:
res = pd.concat([res, df[df.X==u].iloc[0]], axis=1)
res
# 0 1 3 6 8 9
# X a b c d e f
# Y A B D G I J
# Z 1 2 4 3 1 2
res = res.transpose()
res
# X Y Z
# 0 a A 1
# 1 b B 2
# 3 c D 4
# 6 d G 3
# 8 e I 1
# 9 f J 2
You could use drop_duplicates() method on X
In [60]: df.drop_duplicates('X')
Out[60]:
X Y Z
0 a A 1
1 b B 2
3 c D 4
6 d G 3
8 e I 1
9 f J 2
You can also do:
In [3]: import pandas as pd
In [4]: df = pd.DataFrame(dict(X=list('abbcccddef'),
Y=list('ABCDEFGHIJ'),
Z=list('1234123412')))
In [5]: df.groupby('X').first()
Out[5]:
Y Z
X
a A 1
b B 2
c D 4
d G 3
e I 1
f J 2

Pivoting a table with hierarchical index

This is a simple problem but for some reason I am not able to find an easy solution.
I have a hierarchically indexed Series, for example:
s = pd.Series(data=randint(0, 3, 45),
index=pd.MultiIndex.from_tuples(list(itertools.product('pqr',[0,1,2],'abcde')),
names=['Index1', 'Index2', 'Index3']), name='P')
s = s.map({0:'A', 1:'B', 2:'C'})
So it looks like
Index1 Index2 Index3
p 0 a A
b A
c C
d B
e C
1 a B
b C
c C
d B
e B
q 0 a B
b C
c C
d C
e C
1 a A
b A
c B
d C
e A
I want to do a frequency count by value so that the output looks like
Index1 Index2 P
p 0 A 2
B 1
C 2
1 A 0
B 3
C 2
q 0 A 0
B 1
C 4
1 A 3
B 1
C 1
You can apply value_counts to the Series groupby:
In [11]: s.groupby(level=[0, 1]).value_counts() # equiv .apply(pd.value_counts)
Out[11]:
Index1 Index2
p 0 C 2
A 2
B 1
1 B 3
A 2
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 B 2
C 2
A 1
2 C 3
B 1
A 1
r 0 A 3
B 1
C 1
1 B 3
C 2
2 B 3
C 1
A 1
dtype: int64
If you want to include the 0s (which the above won't) you could use cross_tab:
In [21]: ct = pd.crosstab(rows=[s.index.get_level_values(0), s.index.get_level_values(1)],
cols=s.values,
aggfunc=len,
rownames=s.index.names[:2],
colnames=s.index.names[2:3])
In [22]: ct
Out[22]:
Index3 A B C
Index1 Index2
p 0 2 1 2
1 2 3 0
2 3 1 1
q 0 3 1 1
1 1 2 2
2 1 1 3
r 0 3 1 1
1 0 3 2
2 1 3 1
In [23]: ct.stack()
Out[23]:
Index1 Index2 Index3
p 0 A 2
B 1
C 2
1 A 2
B 3
C 0
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 A 1
B 2
C 2
2 A 1
B 1
C 3
r 0 A 3
B 1
C 1
1 A 0
B 3
C 2
2 A 1
B 3
C 1
dtype: int64
Which may be slightly faster...

Categories

Resources