How to deal with multi-index in pandas dataframe? - python

I have the following dataframe:
[1]: https://i.stack.imgur.com/3gvRa.png
which has three sub-levels of indices. The first one is a number of 4 digits, the second is a date, and the last one is an index from the original dataframe.
I want to rename each level by ['District_ID', 'Month'] and to drop the third level. I already tried to drop the last one, and I used:
DF.index = DF.index.droplevel(2)
As a result, the third level is gone but the second one duplicates for all the rows. In addition I want to rename the column indices. How can I accomplish these tasks?

I think you have a mistaken impression of what is going on. Take this simple sample frame
midx = pd.MultiIndex.from_tuples(
(lev_0, lev_1, lev_2) for lev_0, lev_1 in zip("ab", "xy") for lev_2 in range(2)
)
df = pd.DataFrame({"col": range(4)}, index=midx)
col
a x 0 0
1 1
b y 0 2
1 3
and look at the result of
print(df.index)
df = df.droplevel(2)
print(df.index)
MultiIndex([('a', 'x', 0),
('a', 'x', 1),
('b', 'y', 0),
('b', 'y', 1)],
)
MultiIndex([('a', 'x'),
('a', 'x'),
('b', 'y'),
('b', 'y')],
)
This should be exactly what you want? If you print the df after the droplevel it looks as if there's something strange happening with the first level, but this is only for making the print clearer.
As for the renaming:
df.index.names = ["lev_0", "lev_1"]
or
df.index.set_names(["lev_0", "lev_1"], inplace=True)
both lead to
col
lev_0 lev_1
a x 0
x 1
b y 2
y 3
Or if you want to rename the columns (not clear to me what you are looking for), then you could do
df.columns = ["new_col"]
or
df = df.rename(columns={"col": "new_col"})
new_col
a x 0
x 1
b y 2
y 3

Related

How to replace one of the levels of a MultiIndex dataframe with one of its columns

I have a dataframe such as
multiindex1 = pd.MultiIndex.from_product([['a'], np.arange(3, 8)])
df1 = pd.DataFrame(np.random.randn(5, 3), index=multiindex1)
multiindex2 = pd.MultiIndex.from_product([['s'], np.arange(1, 6)])
df2 = pd.DataFrame(np.random.randn(5, 3), index=multiindex2)
multiindex3 = pd.MultiIndex.from_product([['d'], np.arange(2, 7)])
df3 = pd.DataFrame(np.random.randn(5, 3), index=multiindex3)
df = pd.concat([df1, df2, df3])
df.index.names = ['contract', 'index']
df.columns = ['z', 'x', 'c']
>>>
z x c
contract index
a 3 0.354879 0.206557 0.308081
4 0.822102 -0.425685 1.973288
5 -0.801313 -2.101411 -0.707400
6 -0.740651 -0.564597 -0.975532
7 -0.310679 0.515918 -1.213565
s 1 -0.175135 0.777495 0.100466
2 2.295485 0.381226 -0.242292
3 -0.753414 1.172924 0.679314
4 -0.029526 -0.020714 1.546317
5 0.250066 -1.673020 -0.773842
d 2 -0.602578 -0.761066 -1.117238
3 -0.935758 0.448322 -2.135439
4 0.808704 -0.604837 -0.319351
5 0.321139 0.584896 -0.055951
6 0.041849 -1.660013 -2.157992
Now I want to replace the index of index with the column c. That is to say, I want the result as
z x
contract c
a 0.308081 0.354879 0.206557
1.973288 0.822102 -0.425685
-0.707400 -0.801313 -2.101411
-0.975532 -0.740651 -0.564597
-1.213565 -0.310679 0.515918
s 0.100466 -0.175135 0.777495
-0.242292 2.295485 0.381226
0.679314 -0.753414 1.172924
1.546317 -0.029526 -0.020714
-0.773842 0.250066 -1.673020
d -1.117238 -0.602578 -0.761066
-2.135439 -0.935758 0.448322
-0.319351 0.808704 -0.604837
-0.055951 0.321139 0.584896
-2.157992 0.041849 -1.660013
I implement it in one way
df.reset_index().set_index(['contract', 'c']).drop(['index'], axis=1)
But it seems there are some duplecate steps because I manipulate the indexs for three times. So if there is a more elegent way to achieve that?
Try this
# convert column "c" into an index and remove "index" from index
df.set_index('c', append=True).droplevel('index')
Explanation:
Pandas' set_index method has append argument that controls whether to append columns to existing index or not; setting it True appends column "c" as an index. droplevel method removes index level (can remove column level too but removes index level by default).

Pandas: Assign MultiIndex Column from DataFrame

I have a DataFrame with multiIndex columns. Suppose it is this:
index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
('two', 'a'), ('two', 'b')])
df = pd.DataFrame({'col': np.arange(1.0, 5.0)}, index=index)
df = df.unstack(1)
(I know this definition could be more direct). I now want to set a new level 0 column based on a DataFrame. For example
df['col2'] = df['col'].applymap(lambda x: int(x < 3))
This does not work. The only method I have found so far is to add each column seperately:
Pandas: add a column to a multiindex column dataframe
, or some sort of convoluted joining process.
The desired result is a new level 0 column 'col2' with two level 1 subcolumns: 'a' and 'b'
Any help would be much appreciated, Thank you.
I believe need solution with no unstack and stack - filter by boolean indexing, rename values for avoid duplicates and last use DataFrame.append:
df2 = df[df['col'] < 3].rename({'one':'one1', 'two':'two1'}, level=0)
print (df2)
col
one1 a 1.0
b 2.0
df = df.append(df2)
print (df)
col
one a 1.0
b 2.0
two a 3.0
b 4.0
one1 a 1.0
b 2.0

Dataframe empty when passing column names

I am facing issue where on passing numpy array to dataframe without column names initializes it properly. Whereas, if I pass column names, it is empty.
x = np.array([(1, '1'), (2, '2')], dtype = 'i4,S1')
df = pd.DataFrame(x)
In []: df
Out[]:
f0 f1
0 1 1
1 2 2
df2 = pd.DataFrame(x, columns=['a', 'b'])
In []: df2
Out[]:
Empty DataFrame
Columns: [a, b]
Index: []
I think you need specify column names in parameter dtype, see DataFrame from structured or record array:
x = np.array([(1, '1'), (2, '2')], dtype=[('a', 'i4'),('b', 'S1')])
df2 = pd.DataFrame(x)
print (df2)
a b
0 1 b'1'
1 2 b'2'
Another solution without parameter dtype:
x = np.array([(1, '1'), (2, '2')])
df2 = pd.DataFrame(x, columns=['a', 'b'])
print (df2)
a b
0 1 1
1 2 2
It's the dtype param, without specifiying it, it works as expected.
See the example at documentation DataFrame
import numpy as np
import pandas as pd
x = np.array([(1, "11"), (2, "22")])
df = pd.DataFrame(x)
print df
df2 = pd.DataFrame(x, columns=['a', 'b'])
print df2

Get column names from Pivoted Panda Dataframe, without the column name of the original list

I have a dataset:
a b c
99-01-11 8 367235
99-01-11 5 419895
99-01-11 1 992194
99-03-23 4 419895
99-04-30 1 992194
99-06-02 9 419895
99-08-08 2 367235
99-08-12 3 419895
99-08-17 10 992194
99-10-22 3 419895
99-12-04 4 992194
00-03-04 2 367235
00-09-29 9 367235
00-09-30 9 367235
I changed it to a pivot table using the following code:
df = (pd.read_csv('orcs.csv'))
df_wanted = pd.pivot_table(df, index=['c'], columns=['a'], values=['b'])
My goal: I am trying to get a list of the column names in the pivot table. In other words, I am trying to get this:
['1999-01-11','1999-01-11','1999-01-11','1999-03-23','1999-04-30','1999-06-02','1999-08-08']
I tried to use this piece of code:
y= df_wanted.columns.tolist()
But this gives me a list with both the original column name and the pivot's new column name:
[('c', '00-03-04'), ('c', '00-09-29'), ('c', '00-09-30'), ('c', '99-01-11'), ('c', '99-03-23'), ('c', '99-04-30'), ('c', '99-06-02'), ('c', '99-08-08'), ('c', '99-08-12'), ('c', '99-08-17'), ('c', '99-10-22'), ('c', '99-12-04')]
I tried deleting the 'c' in various ways, such as
def remove_values_from_list(the_list, val):
while val in the_list:
the_list.remove(val)
remove_values_from_list(y, 'c')
but have had no luck. Does anyone know how to fix this problem? PS. retaining the order of the list is important, as I am going to use it as an array of y values for a line graph.
Many thanks.
The best is first omit [] in pivot_table for avoid MultiIndex in columns and then use tolist() with cast to string:
df_wanted = pd.pivot_table(df,index='c',columns='a',values='b')
#print (df_wanted)
print (df_wanted.columns.astype(str).tolist())
['1999-01-11', '1999-03-23', '1999-04-30', '1999-06-02', '1999-08-08',
'1999-08-12', '1999-08-17', '1999-10-22', '1999-12-04',
'2000-03-04', '2000-09-29', '2000-09-30']

Map List of Tuples to New Column

Suppose I have a pandas.DataFrame:
In [76]: df
Out[76]:
a b c
0 -0.685397 0.845976 w
1 0.065439 2.642052 x
2 -0.220823 -2.040816 y
3 -1.331632 -0.162705 z
Suppose I have a list of tuples:
In [78]: tp
Out[78]: [('z', 0.25), ('y', 0.33), ('x', 0.5), ('w', 0.75)]
I would like to map tp do df such that the the second element in each tuple lands in a new column that corresponds with the row matching the first element in each tuple.
The end result would look like this:
In [87]: df2
Out[87]:
a b c new
0 -0.685397 0.845976 w 0.75
1 0.065439 2.642052 x 0.50
2 -0.220823 -2.040816 y 0.33
3 -1.331632 -0.162705 z 0.25
I've tried using lambdas, pandas.applymap, pandas.map, etc but cannot seem to crack this one. So for those that will point out I have not actually asked a question, how would I map tp do df such that the the second element in each tuple lands in a new column that corresponds with the row matching the first element in each tuple?
You need to turn your list of tuples into a dict which is ridiculously easy to do in python, then call map on it:
In [4]:
df['new'] = df['c'].map(dict(tp))
df
Out[4]:
a b c new
index
0 -0.685397 0.845976 w 0.75
1 0.065439 2.642052 x 0.50
2 -0.220823 -2.040816 y 0.33
3 -1.331632 -0.162705 z 0.25
The docs for map show that that it takes as a function arg a dict, series or function.
applymap takes a function as an arg but operates element wise on the whole dataframe which is not what you want to do in this case.
The online docs show how to apply an operation element wise, as does the excellent book
Does this example help?
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
>>> d = {'col1': ts1, 'col2': ts2}
>>> df = DataFrame(data=d, index=index)
>>> df2 = DataFrame(np.random.randn(10, 5))
>>> df3 = DataFrame(np.random.randn(10, 5),
... columns=['a', 'b', 'c', 'd', 'e'])

Categories

Resources