How to delete "heading above headings" in pandas dataframe [duplicate] - python

If I've got a multi-level column index:
>>> cols = pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")])
>>> pd.DataFrame([[1,2], [3,4]], columns=cols)
a
---+--
b | c
--+---+--
0 | 1 | 2
1 | 3 | 4
How can I drop the "a" level of that index, so I end up with:
b | c
--+---+--
0 | 1 | 2
1 | 3 | 4

You can use MultiIndex.droplevel:
>>> cols = pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")])
>>> df = pd.DataFrame([[1,2], [3,4]], columns=cols)
>>> df
a
b c
0 1 2
1 3 4
[2 rows x 2 columns]
>>> df.columns = df.columns.droplevel()
>>> df
b c
0 1 2
1 3 4
[2 rows x 2 columns]

As of Pandas 0.24.0, we can now use DataFrame.droplevel():
cols = pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")])
df = pd.DataFrame([[1,2], [3,4]], columns=cols)
df.droplevel(0, axis=1)
# b c
#0 1 2
#1 3 4
This is very useful if you want to keep your DataFrame method-chain rolling.

Another way to drop the index is to use a list comprehension:
df.columns = [col[1] for col in df.columns]
b c
0 1 2
1 3 4
This strategy is also useful if you want to combine the names from both levels like in the example below where the bottom level contains two 'y's:
cols = pd.MultiIndex.from_tuples([("A", "x"), ("A", "y"), ("B", "y")])
df = pd.DataFrame([[1,2, 8 ], [3,4, 9]], columns=cols)
A B
x y y
0 1 2 8
1 3 4 9
Dropping the top level would leave two columns with the index 'y'. That can be avoided by joining the names with the list comprehension.
df.columns = ['_'.join(col) for col in df.columns]
A_x A_y B_y
0 1 2 8
1 3 4 9
That's a problem I had after doing a groupby and it took a while to find this other question that solved it. I adapted that solution to the specific case here.

Another way to do this is to reassign df based on a cross section of df, using the .xs method.
>>> df
a
b c
0 1 2
1 3 4
>>> df = df.xs('a', axis=1, drop_level=True)
# 'a' : key on which to get cross section
# axis=1 : get cross section of column
# drop_level=True : returns cross section without the multilevel index
>>> df
b c
0 1 2
1 3 4

A small trick using sum with level=1(work when level=1 is all unique)
df.sum(level=1,axis=1)
Out[202]:
b c
0 1 2
1 3 4
More common solution get_level_values
df.columns=df.columns.get_level_values(1)
df
Out[206]:
b c
0 1 2
1 3 4

You could also achieve that by renaming the columns:
df.columns = ['a', 'b']
This involves a manual step but could be an option especially if you would eventually rename your data frame.

I have struggled with this problem since I don’t know why my droplevel() function does not work. Work through several and learn that ‘a’ in your table is columns name and ‘b’, ‘c’ are index. Do like this will help
df.columns.name = None
df.reset_index() #make index become label

Related

How do I replace pandas rows with values of another dataframe for all instances of the value in the first df?

I have two dataframes:
df1=
A B C
a 1 3
b 2 3
c 2 2
a 1 4
df2=
A B C
a 1 3.5
Now I need to replace all occurrences of a in df1 (2 in this case) with a in df2, leaving b and c unchanged. The final dataframe should be:
df_final=
A B C
b 2 3
c 2 2
a 1 3.5
Do you mean:
df_final = pd.concat((df1[df1['A'].ne('a')], df2))
Or if you have several values like a:
list_special = ['a']
df_final = pd.concat((df1[~df1['A'].isin(list_special)], df2))
If df2 just has the average of duplicated values, you can do df1.groupby(["A", "B"]).mean().reset_index()
Otherwise, you can do something like this:
In [27]: df = df1.groupby(["A", "B"]).first().merge(df2, how="left", on=["A", "
...: B"])
...: df["C"] = df["C_y"].fillna(df["C_x"])
...: df = df[["A", "B", "C"]]
...: df
Out[27]:
A B C
0 a 1 3.5
1 b 2 3.0
2 c 2 2.0

Replace column of pandas multi-index DataFrame with another DataFrame

I have a pandas DataFrame like this:
import pandas as pd
import numpy as np
data1 = np.repeat(np.array(range(3), ndmin=2), 3, axis=0)
columns1 = pd.MultiIndex.from_tuples([('foo', 'a'), ('foo', 'b'), ('bar', 'c')])
df1 = pd.DataFrame(data1, columns=columns1)
print(df1)
foo bar
a b c
0 0 1 2
1 0 1 2
2 0 1 2
And another one like this:
data2 = np.repeat(np.array(range(3, 5), ndmin=2), 3, axis=0)
columns2 = ['d', 'e']
df2 = pd.DataFrame(data2, columns=columns2)
print(df2)
d e
0 3 4
1 3 4
2 3 4
Now, I would like to replace 'bar' of df1 with df2, but the regular syntax of single-level indexing doesn't seem to work:
df1['bar'] = df2
print(df1)
foo bar
a b c
0 0 1 NaN
1 0 1 NaN
2 0 1 NaN
When what I would like to get is:
foo bar
a b d e
0 0 1 3 4
1 0 1 3 4
2 0 1 3 4
I'm not sure if I'm missing something on the syntax or if this is related to the issues described here and here. Could someone explain why this doesn't work and how to get the desired outcome?
I'm using python 2.7 and pandas 0.24, if it makes a difference.
For lack of better alternative, I'm currently doing this:
df2.columns = pd.MultiIndex.from_product([['bar'], df2.columns])
df1.drop(columns='bar', level=0, inplace=True)
df1 = df1.join(df2)
Which gives the desired result. One needs to be cautious though if the order of columns is important, as this approach will likely change it.
Reading further the mentioned issues on Github, I think the reason the approach in the question doesn't work is indeed related to an inconsistency in the pandas API that hasn't been fixed yet.

How to simply add a column level to a pandas dataframe

let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Find the max of two or more columns with pandas

I have a dataframe with columns A,B. I need to create a column C such that for every record / row:
C = max(A, B).
How should I go about doing this?
You can get the maximum like this:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]]
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]].max(axis=1)
0 1
1 8
2 3
and so:
>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If you know that "A" and "B" are the only columns, you could even get away with
>>> df["C"] = df.max(axis=1)
And you could use .apply(max, axis=1) too, I guess.
#DSM's answer is perfectly fine in almost any normal scenario. But if you're the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.
For example, you can use ndarray.max() along the first axis.
# Data borrowed from #DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
A B
0 1 -2
1 2 8
2 3 1
df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns,
# df['C'] = df.values.max(1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If your data has NaNs, you will need numpy.nanmax:
df['C'] = np.nanmax(df.values, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:
df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).
The graph was generated using perfplot. Benchmarking code, for reference:
import pandas as pd
import perfplot
np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))
perfplot.show(
setup=lambda n: pd.concat([df_] * n, ignore_index=True),
kernels=[
lambda df: df.assign(new=df.max(axis=1)),
lambda df: df.assign(new=df.values.max(1)),
lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
],
labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
n_range=[2**k for k in range(0, 15)],
xlabel='N (* len(df))',
logx=True,
logy=True)
For finding max among multiple columns would be:
df[['A','B']].max(axis=1).max(axis=0)
Example:
df =
A B
timestamp
2019-11-20 07:00:16 14.037880 15.217879
2019-11-20 07:01:03 14.515359 15.878632
2019-11-20 07:01:33 15.056502 16.309152
2019-11-20 07:02:03 15.533981 16.740607
2019-11-20 07:02:34 17.221073 17.195145
print(df[['A','B']].max(axis=1).max(axis=0))
17.221073

Categories

Resources