Pandas Groupby using multiple criteria on different axis - python

I have a df DataFrame like :
| A | B | A_ | B_ |COMMON|
--------------------------------
0 | 1 | 3 | 0 | 1 | a |
--------------------------------
1 | 8 | 5 | 4 | 0 | a |
--------------------------------
2 | 3 | 6 | 2 | 4 | b |
--------------------------------
3 | 9 | 9 | 1 | 7 | b |
And I want to group all columns X with X_ for all letters A,B,... (let's say, the group is called X as well), and group as well using COMMON. I would like to apply later function like std() to all the grouped values.
So the result would look like:
COMMON | A | B |
---------------------------
a |std(...)|std(...)|
---------------------------
b |std(...)|std(...)|
I have been able to group either one or the other, using df.groupby(['COMMMON']) for one criteria and .groupby(mapping_function, axis=1) for the other one, but how do I use them together?
Another alternative for an intermediate step would be to concatenate individual columns so that I would get:
| A | B |COMMON|
----------------------
0 | 1 | 3 |a |
---------------------
1 | 8 | 5 |a |
---------------------
2 | 3 | 6 |b |
---------------------
3 | 9 | 9 |b |
---------------------
0 | 0 | 1 |a |
---------------------
1 | 4 | 0 |a |
---------------------
2 | 2 | 4 |b |
---------------------
3 | 1 | 7 |b |
But I also don't know how to do that.
Also as you might see, I don't really care about the index.
Thank you for your help!

You can reshape first by melt with removing _ from column names (for better performance, because strip only few values) with pivot_table:
df = (df.rename(columns=lambda x: x.strip('_'))
.melt('COMMON')
.pivot_table(index='COMMON',columns='variable', values='value', aggfunc='std'))
print (df)
variable A B
COMMON
a 3.593976 2.217356
b 3.593976 2.081666

IIUC
df.melt('COMMON').assign(variable=lambda x : x['variable'].str.rstrip('_')).\
groupby(['COMMON','variable']).value.std().unstack()
Out[18]:
variable A B
COMMON
a 3.593976 2.217356
b 3.593976 2.081666

Just groupby
h = lambda x: x[-1][0]
df.set_index('COMMON', append=True).stack().groupby(['COMMON', h]).std().unstack()
A B
COMMON
a 3.593976 2.217356
b 3.593976 2.081666

Related

Use Python Pandas For Loop to create pivot tables for each column in dataframe

df1:
| A | B | C |ID |
|---|---|---|---|
| 1 | 5 | 2 | Y |
| 4 | 6 | 4 | Z |
df2:
| A | B | C |ID |
|---|---|---|---|
| 2 | 1 | 2 | Y |
| 4 | 6 | 4 | Z |
Merged:
| case | A | B | C |ID |
|------|---|---|---|---|
|before| 1 | 5 | 2 | Y |
|before| 4 | 6 | 4 | Z |
|after | 2 | 1 | 2 | Y |
|after | 4 | 6 | 4 | Z |
desired pivot for column A:
|ID |before|after|
|- |------|-----|
| Y | 1 | 2|
| Z | 4 | 4|
I want to use a for loop to create a pivot table for each column in dfs 1 and 2. The rows of these pivots will be the ID, the columns will be 'case'.
I would like to create a new df for each column's pivot table using a for loop.
Later, I will drop the rows in each pivot table where the before and after values are the same (only keeping the rows where they are different).
I hope I've understood you correctly:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
for c in df.columns.get_level_values(0).unique():
print(df.xs(c, axis=1))
print("-" * 80)
Prints:
case after before
ID
Y 2 1
Z 4 4
--------------------------------------------------------------------------------
case after before
ID
Y 1 5
Z 6 6
--------------------------------------------------------------------------------
case after before
ID
Y 2 2
Z 4 4
--------------------------------------------------------------------------------
EDIT: To add dataframes into a dictionary:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
all_dataframes = {}
for c in df.columns.get_level_values(0).unique():
all_dataframes[c] = df.xs(c, axis=1).copy()
for key, dataframe in all_dataframes.items():
print("Key:", key)
print("DataFrame:", dataframe)
print("-" * 80)

Create a column from a choice of other columns using IF style statement

Given the following table:
+---------+---------+-------------+
| field_a | field_b | which_field |
+---------+---------+-------------+
| 1 | 2 | a |
| 1 | 2 | b |
| 3 | 4 | a |
| 3 | 4 | b |
+---------+---------+-------------+
I'd like to create a column called output where the value for each row is taken from either field_a or field_b based upon the value in which_field. So the resulting table would look like this:
+---------+---------+-------------+--------+
| field_a | field_b | which_field | output |
+---------+---------+-------------+--------+
| 1 | 2 | a | 1 |
| 1 | 2 | b | 2 |
| 3 | 4 | a | 3 |
| 3 | 4 | b | 4 |
+---------+---------+-------------+--------+
I've reviewed a number of examples using loc and np.where but these only seem to be able to handle assigning a fixed value rather than the value from a choice of columns.
This is an MRE - in reality there could be multiple which_field fields so it would be great to get an answer that can cope with multiple conditions.
Thanks in advance!
Use DataFrame.melt with DataFrame.loc:
df1 = df.melt('which_field', ignore_index=False)
df['output'] = df1.loc[('field_' + df1['which_field']).eq(df1['variable']), 'value']
print (df)
field_a field_b which_field output
0 1 2 a 1
1 1 2 b 2
2 3 4 a 3
3 3 4 b 4

Pandas DataFrame) one column replace other df

I have two pandas DataFrame
# python 3
one is | A | B | C | and another is | D | E | F |
|---|---|---| |---|---|---|
| 1 | 2 | 3 | | 3 | 4 | 6 |
| 4 | 5 | 6 | | 8 | 7 | 9 |
| ......... | | ......... |
I want to get 'expected' result
expected result :
| A | D | E | F | C |
|---|---|---|---|---|
| 1 | 3 | 4 | 6 | 3 |
| 4 | 8 | 7 | 9 | 6 |
| ................. |
df1['B'] convert into df2
I have tried
pd.concat([df1,df2], axis=1, sort=False)
and drop column df['B']
but it doesn't seem to be very efficient.
Could it be solved by using insert() or another method?
I think your method is good, also you can remove column before concat:
pd.concat([df1.drop('B', axis=1),df2], axis=1, sort=False)
Another method with DataFrame.join:
df1.drop('B', axis=1).join(df2)

Intersect two dataframes in Pandas with respect to first dataframe?

I want to intersect two Pandas dataframes (1 and 2) based on two columns (A and B) present in both dataframes. However, I would like to return a dataframe that only has data with respect to the data in the first dataframe, omitting anything that is not found in the second dataframe.
So for example:
Dataframe 1:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 2 | Extra | Columns | In | 1 |
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Dataframe 2:
A | B | Extra | Columns | In | 2 |
----------------------------------
1 | 3 | Extra | Columns | In | 2 |
1 | 4 | Extra | Columns | In | 2 |
1 | 5 | Extra | Columns | In | 2 |
should return:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Is there a way I can do this simply?
You can use df.merge:
df = df1.merge(df2, on=['A','B'], how='inner').drop('2', axis=1)
how='inner' is default. Just put it there for your understanding of how df.merge works.
As #piRSquared suggested, you can do:
df1.merge(df2[['A', 'B']], how='inner')

Retrieve the index of a cross section of a dataframe in pandas

Let's say we have a multi-index dataframe df like this:
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| x | p | 1 | 2
| x | q | 3 | 4
| y | p | 5 | 6
| y | q | 7 | 8
Now, I can get the index of cross section of the criteria one > 4 like this:
idx = df[df['one'] > 4].index
And then, using it with .ix:
df.ix[idx]
yields a slice of the frame:
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| y | p | 5 | 6
| y | q | 7 | 8
Now, I want to do the same but with a cross section by one level on the multi-index. The .xs is useful in this manner:
df.xs('p', level='B')
returns:
| | one | two
| A |-------|-------
|----|-------|-------
| x | 1 | 2
| y | 5 | 6
But this dataframe has different index structure and its index is not a slice of the df index.
So, my question is, what should to in the place of idx, so that the following expression
df.ix[idx]
to yield
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| x | p | 1 | 2
| y | p | 5 | 6
You need to use the argument drop_level and set it to False to keep the index:
In [9]: df.xs('p', level='B', drop_level=False)
Out[9]:
one two
A B
x p 1 2
y p 5 6

Categories

Resources