Pivoting a table with hierarchical index - python

This is a simple problem but for some reason I am not able to find an easy solution.
I have a hierarchically indexed Series, for example:
s = pd.Series(data=randint(0, 3, 45),
index=pd.MultiIndex.from_tuples(list(itertools.product('pqr',[0,1,2],'abcde')),
names=['Index1', 'Index2', 'Index3']), name='P')
s = s.map({0:'A', 1:'B', 2:'C'})
So it looks like
Index1 Index2 Index3
p 0 a A
b A
c C
d B
e C
1 a B
b C
c C
d B
e B
q 0 a B
b C
c C
d C
e C
1 a A
b A
c B
d C
e A
I want to do a frequency count by value so that the output looks like
Index1 Index2 P
p 0 A 2
B 1
C 2
1 A 0
B 3
C 2
q 0 A 0
B 1
C 4
1 A 3
B 1
C 1

You can apply value_counts to the Series groupby:
In [11]: s.groupby(level=[0, 1]).value_counts() # equiv .apply(pd.value_counts)
Out[11]:
Index1 Index2
p 0 C 2
A 2
B 1
1 B 3
A 2
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 B 2
C 2
A 1
2 C 3
B 1
A 1
r 0 A 3
B 1
C 1
1 B 3
C 2
2 B 3
C 1
A 1
dtype: int64
If you want to include the 0s (which the above won't) you could use cross_tab:
In [21]: ct = pd.crosstab(rows=[s.index.get_level_values(0), s.index.get_level_values(1)],
cols=s.values,
aggfunc=len,
rownames=s.index.names[:2],
colnames=s.index.names[2:3])
In [22]: ct
Out[22]:
Index3 A B C
Index1 Index2
p 0 2 1 2
1 2 3 0
2 3 1 1
q 0 3 1 1
1 1 2 2
2 1 1 3
r 0 3 1 1
1 0 3 2
2 1 3 1
In [23]: ct.stack()
Out[23]:
Index1 Index2 Index3
p 0 A 2
B 1
C 2
1 A 2
B 3
C 0
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 A 1
B 2
C 2
2 A 1
B 1
C 3
r 0 A 3
B 1
C 1
1 A 0
B 3
C 2
2 A 1
B 3
C 1
dtype: int64
Which may be slightly faster...

Related

How to reshape a multi-column dataframe by index?

Following from here . The solution works for only one column. How to improve the solution for multiple columns. i.e If I have a dataframe like
df= pd.DataFrame([['a','b'],['b','c'],['c','z'],['d','b']],index=[0,0,1,1])
0 1
0 a b
0 b c
1 c z
1 d b
How to reshape them like
0 1 2 3
0 a b b c
1 c z d b
If df is
0 1
0 a b
1 c z
1 d b
Then
0 1 2 3
0 a b NaN NaN
1 c z d b
Use flatten/ravel
In [4401]: df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
Out[4401]:
0 1 2 3
0 a b b c
1 c z d b
Or, stack
In [4413]: df.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
Out[4413]:
0 1 2 3
0 a b b c
1 c z d b
Also, with unequal indices
In [4435]: df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
Out[4435]:
0 1 2 3
0 a b NaN NaN
1 c z d b
Use groupby + pd.Series + np.reshape:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.reshape(-1, )))
0 1 2 3
0 a b b c
1 c z d b
Solution for unequal number of indices - call the pd.DataFrame constructor instead.
df
0 1
0 a b
1 c z
1 d b
df.groupby(level=0).apply(lambda x: \
pd.DataFrame(x.values.reshape(1, -1))).reset_index(drop=True)
0 1 2 3
0 a b NaN NaN
1 c z d b
pd.DataFrame({n: g.values.ravel() for n, g in df.groupby(level=0)}).T
0 1 2 3
0 a b b c
1 c z d b
This is all over the place and I'm too tired to make it pretty
v = df.values
cc = df.groupby(level=0).cumcount().values
i0, r = pd.factorize(df.index.values)
n, m = v.shape
j0 = np.tile(np.arange(m), n)
j = np.arange(r.size * m).reshape(-1, m)[cc].ravel()
i = i0.repeat(m)
e = np.empty((r.size, m * r.size), dtype=object)
e[i, j] = v.ravel()
pd.DataFrame(e, r)
0 1 2 3
0 a b None None
1 c z d b
Let's try
df1 = df.set_index(df.groupby(level=0).cumcount(), append=True).unstack()
df1.set_axis(labels=pd.np.arange(len(df1.columns)), axis=1)
Output:
0 1 2 3
0 a b b c
1 c d z b
Output for df with NaN:
0 1 2 3
0 a None b None
1 c d z b

rename index using index and name column

I have the dataframe df
import pandas as pd
b=np.array([0,1,2,2,0,1,2,2,3,4,4,4,5,6,0,1,0,0]).reshape(-1,1)
c=np.array(['a','a','a','a','b','b','b','b','b','b','b','b','b','b','c','c','d','e']).reshape(-1,1)
df = pd.DataFrame(np.hstack([b,c]),columns=['Start','File'])
df
Out[22]:
Start File
0 0 a
1 1 a
2 2 a
3 2 a
4 0 b
5 1 b
6 2 b
7 2 b
8 3 b
9 4 b
10 4 b
11 4 b
12 5 b
13 6 b
14 0 c
15 1 c
16 0 d
17 0 e
I would like to rename the index using index_File
in order to have 0_a, 1_a, ...17_e as indeces
You use set_index with or without the inplace=True
df.set_index(df.File.radd(df.index.astype(str) + '_'))
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
At the expense of a few more code characters, we can quicken this up and take care of the unnecessary index name
df.set_index(df.File.values.__radd__(df.index.astype(str) + '_'))
Start File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
You can directly assign to the index, first by converting the default index to str using astype and then concatenate the str as usual:
In[41]:
df.index = df.index.astype(str) + '_' + df['File']
df
Out[41]:
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e

Pandas count over groups

I have a pandas dataframe that looks as follows:
ID round player1 player2
1 1 A B
1 2 A C
1 3 B D
2 1 B C
2 2 C D
2 3 C E
3 1 B C
3 2 C D
3 3 C A
The dataframe contains sport match results, where the ID column denotes one tournament, the round column denotes the round for each tournament, and player1 and player2 columns contain the names of players that played against eachother in the respective round.
I now want to cumulatively count the tournament participations for, say, player A. In pseudocode this means: If the player with name A comes up in either the player1 or player2 column per tournament ID, increment the counter by 1.
The result should look like this (note: in my example player A did participate in tournaments with the IDs 1 and 3):
ID round player1 player2 playerAparticipated
1 1 A B 1
1 2 A C 1
1 3 B D 1
2 1 B C 0
2 2 C D 0
2 3 C E 0
3 1 B C 2
3 2 C D 2
3 3 C A 2
My current status is, that I added a "helper" column containing the values 1 or 0 denoting, if the respective player participated in the tournament:
ID round player1 player2 helper
1 1 A B 1
1 2 A C 1
1 3 B D 1
2 1 B C 0
2 2 C D 0
2 3 C E 0
3 1 B C 1
3 2 C D 1
3 3 C A 1
I think that I just need one final step, e.g., a smart use of cumsum() that counts the helper column in the desired way. However, I could not come up with the solution yet.
I think you need:
drop_duplicates by column ID first and then set_index
filter out 0 values by boolean indexing, cumsum and last reindex for add 0 for missing index values
new column create by map
df1 = df.drop_duplicates('ID').set_index('ID')
s = df1.loc[df1['helper'] != 0, 'helper'].cumsum().reindex(index=df1.index, fill_value=0)
df['playerAparticipated'] = df['ID'].map(s)
print (df)
ID round player1 player2 helper playerAparticipated
0 1 1 A B 1 1
1 1 2 A C 1 1
2 1 3 B D 1 1
3 2 1 B C 0 0
4 2 2 C D 0 0
5 2 3 C E 0 0
6 3 1 B C 1 2
7 3 2 C D 1 2
8 3 3 C A 1 2
Instead map is possible use join with rename:
df = df.join(s.rename('playerAparticipated'), on='ID')
print (df)
ID round player1 player2 helper playerAparticipated
0 1 1 A B 1 1
1 1 2 A C 1 1
2 1 3 B D 1 1
3 2 1 B C 0 0
4 2 2 C D 0 0
5 2 3 C E 0 0
6 3 1 B C 1 2
7 3 2 C D 1 2
8 3 3 C A 1 2
A similar approach to #jezrael that I cooked up a little slower :).
First, move ID into your index:
df = df.reset_index().set_index(['index','ID'])
# round player1 player2 helper
# index ID
# 0 1 1 A B 1
# 1 1 2 A C 1
# 2 1 3 B D 1
# 3 2 1 B C 0
# 4 2 2 C D 0
# 5 2 3 C E 0
# 6 3 1 B C 1
# 7 3 2 C D 1
# 8 3 3 C A 1
Next, filter out rows where helper is 0 and get a cumulative sum of tournaments by ID, and assign the result to a variable:
tournament_count = df[df['helper'] > 0].groupby(['ID','helper']).first().reset_index(level=1)['helper'].cumsum().rename("playerAparticipated")
# ID
# 1 1
# 3 2
Finally, join the tournament_count DataFrame with df:
df.join(tournament_counts, how="left").fillna(0)
# round player1 player2 helper tournament_counts
# index ID
# 0 1 1 A B 1 1.0
# 1 1 2 A C 1 1.0
# 2 1 3 B D 1 1.0
# 3 2 1 B C 0 0.0
# 4 2 2 C D 0 0.0
# 5 2 3 C E 0 0.0
# 6 3 1 B C 1 2.0
# 7 3 2 C D 1 2.0
# 8 3 3 C A 1 2.0

Python - Divide the column into multiple columns using Split

I have a Pandas dataframe like this {each row in B is a string with values joined with | symbol}:
A B
a 1|2|3
b 2|4|5
c 3|2|5
I want to create columns which say that the value is present in that row(of column B) or not:
A B 1 2 3 4 5
a 1|2|3 1 1 1 0 0
b 2|4|5 0 1 0 1 1
c 3|5 0 0 1 0 1
I have tried this by looping the columns. But, can it be done using lambda or comprehensions?
You can try get_dummies:
print df
A B
0 a 1|2|3
1 b 2|4|5
2 c 3|2|5
print df.B.str.get_dummies(sep='|')
1 2 3 4 5
0 1 1 1 0 0
1 0 1 0 1 1
2 0 1 1 0 1
And if you need old column B use join:
print df.join(df.B.str.get_dummies(sep='|'))
A B 1 2 3 4 5
0 a 1|2|3 1 1 1 0 0
1 b 2|4|5 0 1 0 1 1
2 c 3|2|5 0 1 1 0 1
Hope this helps.
In [19]: df
Out[19]:
A B
0 a 1|2|3
1 b 2|4|5
2 c 3|2|5
In [20]: op = df.merge(df.B.apply(lambda s: pd.Series(dict((col, 1) for col in s.split('|')))),
left_index=True, right_index=True).fillna(0)
In [21]: op
Out[21]:
A B 1 2 3 4 5
0 a 1|2|3 1 1 1 0 0
1 b 2|4|5 0 1 0 1 1
2 c 3|2|5 0 1 1 0 1

cumulative number of unique elements for pandas dataframe

i have a pandas data frame
id tag
1 A
1 A
1 B
1 C
1 A
2 B
2 C
2 B
I want to add a column which computes the cumulative number of unique tags over at id level. More specifically, I would like to have
id tag count
1 A 1
1 A 1
1 B 2
1 C 3
1 A 3
2 B 1
2 C 2
2 B 2
For a given id, count will be non-decreasing. Thanks for your help!
I think this does what you want:
unique_count = df.drop_duplicates().groupby('id').cumcount() + 1
unique_count.reindex(df.index).ffill()
The +1 is because the count starts at zero. This only works if the dataframe is sorted by id. Was that intended? You can always sort beforehand.
You can find some other approaches in R and Python here
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2],'tag':["A","A", "B","C","A","B","C","B"]})
df['count']=df.groupby('id')['tag'].apply(lambda x: (~pd.Series(x).duplicated()).cumsum())
id tag count
0 1 A 1
1 1 A 1
2 1 B 2
3 1 C 3
4 1 A 3
5 2 B 1
6 2 C 2
7 2 B 2
How about this:
d['X'] = 1
d.groupby("Col").X.cumsum()
idt=[1,1,1,1,1,2,2,2]
tag=['A','A','B','C','A','B','C','B']
df=pd.DataFrame(tag,index=idt,columns=['tag'])
df=df.reset_index()
print(df)
index tag
0 1 A
1 1 A
2 1 B
3 1 C
4 1 A
5 2 B
6 2 C
7 2 B
df['uCnt']=df.groupby(['index','tag']).cumcount()+1
print(df)
index tag uCnt
0 1 A 1
1 1 A 2
2 1 B 1
3 1 C 1
4 1 A 3
5 2 B 1
6 2 C 1
7 2 B 2
df['uCnt']=df['uCnt']//df['uCnt']**2
print(df)
index tag uCnt
0 1 A 1
1 1 A 0
2 1 B 1
3 1 C 1
4 1 A 0
5 2 B 1
6 2 C 1
7 2 B 0
df['uCnt']=df.groupby(['index'])['uCnt'].cumsum()
print(df)
index tag uCnt
0 1 A 1
1 1 A 1
2 1 B 2
3 1 C 3
4 1 A 3
5 2 B 1
6 2 C 2
7 2 B 2
df=df.set_index('index')
print(df)
tag uCnt
index
1 A 1
1 A 1
1 B 2
1 C 3
1 A 3
2 B 1
2 C 2
2 B 2

Categories

Resources