Aggregate sets of Pandas DataFrames columns - python

I have a pandas DataFrame with some independent columns, and I'm looking for an efficient way to unwind / aggregate them.
So, let's say I have the table:
+-----+-----+-------+------+-------+
| One | Two | Three | Four | Count |
+-----+-----+-------+------+-------+
| a | x | y | y | 3 |
+-----+-----+-------+------+-------+
| b | z | x | x | 5 |
+-----+-----+-------+------+-------+
| c | y | x | y | 1 |
+-----+-----+-------+------+-------+
Where rows Two, Three and Four are independent.
I would like to end up with the table:
+-----+-------+-------+
| One | Other | Count |
+-----+-------+-------+
| a | x | 3 |
+-----+-------+-------+
| a | y | 6 |
+-----+-------+-------+
| b | x | 10 |
+-----+-------+-------+
| b | z | 5 |
+-----+-------+-------+
| c | x | 1 |
+-----+-------+-------+
| c | y | 2 |
+-----+-------+-------+
How would be the best way to achieve this?

You can use melt function from pandas to reshape your data frame from wide to long format then groupby the One and Other columns and sum the Count column:
import pandas as pd
pd.melt(df, id_vars = ['One', 'Count'], value_name = 'Other').groupby(['One', 'Other'])['Count'].sum().reset_index()
One Other Count
0 a x 3
1 a y 6
2 b x 10
3 b z 5
4 c x 1
5 c y 2

Related

Use Python Pandas For Loop to create pivot tables for each column in dataframe

df1:
| A | B | C |ID |
|---|---|---|---|
| 1 | 5 | 2 | Y |
| 4 | 6 | 4 | Z |
df2:
| A | B | C |ID |
|---|---|---|---|
| 2 | 1 | 2 | Y |
| 4 | 6 | 4 | Z |
Merged:
| case | A | B | C |ID |
|------|---|---|---|---|
|before| 1 | 5 | 2 | Y |
|before| 4 | 6 | 4 | Z |
|after | 2 | 1 | 2 | Y |
|after | 4 | 6 | 4 | Z |
desired pivot for column A:
|ID |before|after|
|- |------|-----|
| Y | 1 | 2|
| Z | 4 | 4|
I want to use a for loop to create a pivot table for each column in dfs 1 and 2. The rows of these pivots will be the ID, the columns will be 'case'.
I would like to create a new df for each column's pivot table using a for loop.
Later, I will drop the rows in each pivot table where the before and after values are the same (only keeping the rows where they are different).
I hope I've understood you correctly:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
for c in df.columns.get_level_values(0).unique():
print(df.xs(c, axis=1))
print("-" * 80)
Prints:
case after before
ID
Y 2 1
Z 4 4
--------------------------------------------------------------------------------
case after before
ID
Y 1 5
Z 6 6
--------------------------------------------------------------------------------
case after before
ID
Y 2 2
Z 4 4
--------------------------------------------------------------------------------
EDIT: To add dataframes into a dictionary:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
all_dataframes = {}
for c in df.columns.get_level_values(0).unique():
all_dataframes[c] = df.xs(c, axis=1).copy()
for key, dataframe in all_dataframes.items():
print("Key:", key)
print("DataFrame:", dataframe)
print("-" * 80)

How to group by certain column then take the count of multiple columns where it is not NA and add them in Pandas Python?

I would like to do a group by the ID and then add the count of values in A and B where it is not NA then add the count of both A and B together. To add on to that, what if I want to count only the y values in A?
+----+---+---+
| ID | A | B |
+----+---+---+
| 1 | x | x |
| 1 | x | x |
| 1 | y | |
| 2 | y | x |
| 2 | y | |
| 2 | y | x |
| 2 | x | x |
| 3 | x | x |
| 3 | | x |
| 3 | y | x |
+----+---+---+
+----+--------+
| ID | Output |
+----+--------+
| 1 | 3 |
| 2 | 6 |
| 3 | 4 |
+----+--------+
Here's a way to do:
df = df.groupby('ID').agg(lambda x: sum(pd.notna(x))).sum(1).reset_index(name='Output')
print(df)
ID Output
0 1 5.0
1 2 7.0
2 3 5.0

Pandas Groupby using multiple criteria on different axis

I have a df DataFrame like :
| A | B | A_ | B_ |COMMON|
--------------------------------
0 | 1 | 3 | 0 | 1 | a |
--------------------------------
1 | 8 | 5 | 4 | 0 | a |
--------------------------------
2 | 3 | 6 | 2 | 4 | b |
--------------------------------
3 | 9 | 9 | 1 | 7 | b |
And I want to group all columns X with X_ for all letters A,B,... (let's say, the group is called X as well), and group as well using COMMON. I would like to apply later function like std() to all the grouped values.
So the result would look like:
COMMON | A | B |
---------------------------
a |std(...)|std(...)|
---------------------------
b |std(...)|std(...)|
I have been able to group either one or the other, using df.groupby(['COMMMON']) for one criteria and .groupby(mapping_function, axis=1) for the other one, but how do I use them together?
Another alternative for an intermediate step would be to concatenate individual columns so that I would get:
| A | B |COMMON|
----------------------
0 | 1 | 3 |a |
---------------------
1 | 8 | 5 |a |
---------------------
2 | 3 | 6 |b |
---------------------
3 | 9 | 9 |b |
---------------------
0 | 0 | 1 |a |
---------------------
1 | 4 | 0 |a |
---------------------
2 | 2 | 4 |b |
---------------------
3 | 1 | 7 |b |
But I also don't know how to do that.
Also as you might see, I don't really care about the index.
Thank you for your help!
You can reshape first by melt with removing _ from column names (for better performance, because strip only few values) with pivot_table:
df = (df.rename(columns=lambda x: x.strip('_'))
.melt('COMMON')
.pivot_table(index='COMMON',columns='variable', values='value', aggfunc='std'))
print (df)
variable A B
COMMON
a 3.593976 2.217356
b 3.593976 2.081666
IIUC
df.melt('COMMON').assign(variable=lambda x : x['variable'].str.rstrip('_')).\
groupby(['COMMON','variable']).value.std().unstack()
Out[18]:
variable A B
COMMON
a 3.593976 2.217356
b 3.593976 2.081666
Just groupby
h = lambda x: x[-1][0]
df.set_index('COMMON', append=True).stack().groupby(['COMMON', h]).std().unstack()
A B
COMMON
a 3.593976 2.217356
b 3.593976 2.081666

How can I filter a pandas dataframe within a nested column set?

I have the following pandas dataframe:
+---+-------------+-------------+
| | Col1 | |
+ +-------------+-------------+
| | Sub1 | Sub2 | SubX | SubY |
+---+------+------+------+------+
| 0 | N | A | 1 | Z |
| 1 | N | B | 1 | Z |
| 2 | N | C | 2 | Z |
| 3 | N | D | 2 | Z |
| 4 | N | E | 3 | Z |
| 5 | N | F | 3 | Z |
| 6 | N | G | 4 | Z |
| 7 | N | H | 4 | Z |
+---+------+------+------+------+
I would like to filter the dataframe by column SubX, the selected rows should have the value 3, like this:
+---+-------------+-------------+
| | Col1 | |
+ +-------------+-------------+
| | Sub1 | Sub2 | SubX | SubY |
+---+------+------+------+------+
| 4 | N | E | 3 | Z |
| 5 | N | F | 3 | Z |
+---+------+------+------+------+
Could you help to find the right pandas query? It's pretty hard for me, because of the nested column structure. Thanks a lot!
I extended multiindex hierarchie because it wasn't clear to me what the blank space should be.
df
Col1 Col2
Sub1 Sub2 SubX SubY
0 N A 1 Z
1 N B 1 Z
2 N C 2 Z
3 N D 2 Z
4 N E 3 Z
5 N F 3 Z
6 N G 4 Z
7 N H 4 Z
Now do the following:
df[df['Col2','SubX']==3]
Output
Col1 Col2
Sub1 Sub2 SubX SubY
4 N E 3 Z
5 N F 3 Z

Retrieve the index of a cross section of a dataframe in pandas

Let's say we have a multi-index dataframe df like this:
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| x | p | 1 | 2
| x | q | 3 | 4
| y | p | 5 | 6
| y | q | 7 | 8
Now, I can get the index of cross section of the criteria one > 4 like this:
idx = df[df['one'] > 4].index
And then, using it with .ix:
df.ix[idx]
yields a slice of the frame:
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| y | p | 5 | 6
| y | q | 7 | 8
Now, I want to do the same but with a cross section by one level on the multi-index. The .xs is useful in this manner:
df.xs('p', level='B')
returns:
| | one | two
| A |-------|-------
|----|-------|-------
| x | 1 | 2
| y | 5 | 6
But this dataframe has different index structure and its index is not a slice of the df index.
So, my question is, what should to in the place of idx, so that the following expression
df.ix[idx]
to yield
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| x | p | 1 | 2
| y | p | 5 | 6
You need to use the argument drop_level and set it to False to keep the index:
In [9]: df.xs('p', level='B', drop_level=False)
Out[9]:
one two
A B
x p 1 2
y p 5 6

Categories

Resources