I have a pandas DataFrame like this:
import pandas as pd
import numpy as np
data1 = np.repeat(np.array(range(3), ndmin=2), 3, axis=0)
columns1 = pd.MultiIndex.from_tuples([('foo', 'a'), ('foo', 'b'), ('bar', 'c')])
df1 = pd.DataFrame(data1, columns=columns1)
print(df1)
foo bar
a b c
0 0 1 2
1 0 1 2
2 0 1 2
And another one like this:
data2 = np.repeat(np.array(range(3, 5), ndmin=2), 3, axis=0)
columns2 = ['d', 'e']
df2 = pd.DataFrame(data2, columns=columns2)
print(df2)
d e
0 3 4
1 3 4
2 3 4
Now, I would like to replace 'bar' of df1 with df2, but the regular syntax of single-level indexing doesn't seem to work:
df1['bar'] = df2
print(df1)
foo bar
a b c
0 0 1 NaN
1 0 1 NaN
2 0 1 NaN
When what I would like to get is:
foo bar
a b d e
0 0 1 3 4
1 0 1 3 4
2 0 1 3 4
I'm not sure if I'm missing something on the syntax or if this is related to the issues described here and here. Could someone explain why this doesn't work and how to get the desired outcome?
I'm using python 2.7 and pandas 0.24, if it makes a difference.
For lack of better alternative, I'm currently doing this:
df2.columns = pd.MultiIndex.from_product([['bar'], df2.columns])
df1.drop(columns='bar', level=0, inplace=True)
df1 = df1.join(df2)
Which gives the desired result. One needs to be cautious though if the order of columns is important, as this approach will likely change it.
Reading further the mentioned issues on Github, I think the reason the approach in the question doesn't work is indeed related to an inconsistency in the pandas API that hasn't been fixed yet.
Related
If I've got a multi-level column index:
>>> cols = pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")])
>>> pd.DataFrame([[1,2], [3,4]], columns=cols)
a
---+--
b | c
--+---+--
0 | 1 | 2
1 | 3 | 4
How can I drop the "a" level of that index, so I end up with:
b | c
--+---+--
0 | 1 | 2
1 | 3 | 4
You can use MultiIndex.droplevel:
>>> cols = pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")])
>>> df = pd.DataFrame([[1,2], [3,4]], columns=cols)
>>> df
a
b c
0 1 2
1 3 4
[2 rows x 2 columns]
>>> df.columns = df.columns.droplevel()
>>> df
b c
0 1 2
1 3 4
[2 rows x 2 columns]
As of Pandas 0.24.0, we can now use DataFrame.droplevel():
cols = pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")])
df = pd.DataFrame([[1,2], [3,4]], columns=cols)
df.droplevel(0, axis=1)
# b c
#0 1 2
#1 3 4
This is very useful if you want to keep your DataFrame method-chain rolling.
Another way to drop the index is to use a list comprehension:
df.columns = [col[1] for col in df.columns]
b c
0 1 2
1 3 4
This strategy is also useful if you want to combine the names from both levels like in the example below where the bottom level contains two 'y's:
cols = pd.MultiIndex.from_tuples([("A", "x"), ("A", "y"), ("B", "y")])
df = pd.DataFrame([[1,2, 8 ], [3,4, 9]], columns=cols)
A B
x y y
0 1 2 8
1 3 4 9
Dropping the top level would leave two columns with the index 'y'. That can be avoided by joining the names with the list comprehension.
df.columns = ['_'.join(col) for col in df.columns]
A_x A_y B_y
0 1 2 8
1 3 4 9
That's a problem I had after doing a groupby and it took a while to find this other question that solved it. I adapted that solution to the specific case here.
Another way to do this is to reassign df based on a cross section of df, using the .xs method.
>>> df
a
b c
0 1 2
1 3 4
>>> df = df.xs('a', axis=1, drop_level=True)
# 'a' : key on which to get cross section
# axis=1 : get cross section of column
# drop_level=True : returns cross section without the multilevel index
>>> df
b c
0 1 2
1 3 4
A small trick using sum with level=1(work when level=1 is all unique)
df.sum(level=1,axis=1)
Out[202]:
b c
0 1 2
1 3 4
More common solution get_level_values
df.columns=df.columns.get_level_values(1)
df
Out[206]:
b c
0 1 2
1 3 4
You could also achieve that by renaming the columns:
df.columns = ['a', 'b']
This involves a manual step but could be an option especially if you would eventually rename your data frame.
I have struggled with this problem since I don’t know why my droplevel() function does not work. Work through several and learn that ‘a’ in your table is columns name and ‘b’, ‘c’ are index. Do like this will help
df.columns.name = None
df.reset_index() #make index become label
I want to match two pandas Dataframes by the name of their columns.
import pandas as pd
df1 = pd.DataFrame([[0,2,1],[1,3,0],[0,4,0]], columns=['A', 'B', 'C'])
A B C
0 0 2 1
1 1 3 0
2 0 4 0
df2 = pd.DataFrame([[0,0,1],[1,5,0],[0,7,0]], columns=['A', 'B', 'D'])
A B D
0 0 0 1
1 1 5 0
2 0 7 0
If the names match, do nothing. (Keep the column of df2)
If a column is in Dataframe 1 but not in Dataframe 2, add the column in Dataframe 2 as a vector of zeros.
If a column is in Dataframe 2 but not in Dataframe 1, drop it.
The output should look like this:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
I know if I do:
df2 = df2[df1.columns]
I get:
KeyError: "['C'] not in index"
I could also add the vectors of zeros manually, but of course this is a toy example of a much longer dataset. Is there any smarter/pythonic way of doing this?
It appears that df2 columns should be the same as df1 columns after this operation, as columns that are in df1 and not df2 should be added, while columns only in df2 should be removed. We can simply reindex df2 to match df1 columns with a fill_value=0 (this is the safe equivalent to df2 = df2[df1.columns] when adding new columns with a fill value):
df2 = df2.reindex(columns=df1.columns, fill_value=0)
df2:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
Consider a dictionary like the following:
>>> dict_temp = {'a': np.array([[0,1,2], [3,4,5]]),
'b': np.array([[3,4,5], [2,5,1], [5,3,7]])}
How can I build a pandas DataFrame out of this, using a multi-index with level 0 and 1 as follows:
level_0 = ['a', 'b']
level_1 = [[0,1], [0,1,2]]
I expect the code to build the multi-index levels itself... I don't care about the column names for now.
Appreciate comments...
Try concat:
pd.concat({k:pd.DataFrame(d) for k, d in dict_temp.items()})
Output:
0 1 2
a 0 0 1 2
1 3 4 5
b 0 3 4 5
1 2 5 1
2 5 3 7
>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.
I have a pandas dataframe and I want to create a new column, that is computed differently for different groups of rows. Here is a quick example:
import pandas as pd
data = {'foo': list('aaade'), 'bar': range(5)}
df = pd.DataFrame(data)
The dataframe looks like this:
bar foo
0 0 a
1 1 a
2 2 a
3 3 d
4 4 e
Now I am adding a new column and try to assign some values to selected rows:
df['xyz'] = 0
df.loc[(df['foo'] == 'a'), 'xyz'] = df.loc[(df['foo'] == 'a')].apply(lambda x: x['bar'] * 2, axis=1)
The dataframe has not changed. What I would expect is the dataframe to look like this:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
In my real-world problem, the 'xyz' column is also computated for the other rows, but using a different function. In fact, I am also using different columns for the computation. So my questions:
Why does the assignment in the above example not work?
Is it neccessary to do df.loc[(df['foo'] == 'a') twice (as I am doing it now)?
You're changing a copy of df (a boolean mask of the DataFrame is a copy, see docs).
Another way to achieve the desired result is as follows:
In [11]: df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
Out[11]:
0 0
1 2
2 4
3 0
4 0
dtype: int64
In [12]: df['xyz'] = df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
In [13]: df
Out[13]:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
Perhaps a neater way is just to:
In [21]: 2 * (df1.bar) * (df1.foo == 'a')
Out[21]:
0 0
1 2
2 4
3 0
4 0
dtype: int64