Pandas: Change repeated index to hierarchical index - python

See the example below.
Given a dataframe whose index has values repeated, how can I get a new dataframe with a hierarchical index whose first level is the original index and whose second level is 0, 1, 2, ..., n?
Example:
>>> df
0 1
a 2 4
a 4 6
b 7 8
b 2 4
c 3 7
>>> df2 = df.some_operation()
>>> df2
0 1
a 0 2 4
1 4 6
b 0 7 8
1 2 4
c 0 3 7

You can using cumcount
df.assign(level2=df.groupby(level=0).cumcount()).set_index('level2',append=True)
Out[366]:
0 1
level2
a 0 2 4
1 4 6
b 0 7 8
1 2 4
c 0 3 7

Can do the fake way (totally not recommended, don't use this):
>>> df.index=[v if i%2 else '' for i,v in enumerate(df.index)]
>>> df.insert(0,'',([0,1]*3)[:-1])
>>> df
0 1
0 2 4
a 1 4 6
0 7 8
b 1 2 4
0 3 7
>>>
Change index names and create a column which the column name is '' (empty string).

Related

On DataFrame.pivot(), different result with what I expected

I'm referring to
https://github.com/pandas-dev/pandas/tree/main/doc/cheatsheet.
As you can see, if I use pivot(), then all values are in row number 0 and 1.
But if I do use pivot(), the result was different like below.
DataFrame before pivot():
DataFrame after pivot():
Is the result on purpose?
In your data, the grey column (index of the row) is missing:
df = pd.DataFrame({'variable': list('aaabbbccc'), 'value': range(9)})
print(df)
# Output
variable value
0 a 0
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5
6 c 6
7 c 7
8 c 8
Add the grey column:
df['grey'] = df.groupby('variable').cumcount()
print(df)
# Output
variable value grey
0 a 0 0
1 a 1 1
2 a 2 2
3 b 3 0
4 b 4 1
5 b 5 2
6 c 6 0
7 c 7 1
8 c 8 2
Now you can pivot:
df = df.pivot('grey', 'variable', 'value')
print(df)
# Output
variable a b c
grey
0 0 3 6
1 1 4 7
2 2 5 8
Take the time to read How can I pivot a dataframe?

Pandas drop duplicate base on 2 columns, having differents value

How to drop duplicate in that specific way:
Index B C
1 2 1
2 2 0
3 3 1
4 3 1
5 4 0
6 4 0
7 4 0
8 5 1
9 5 0
10 5 1
Desired output :
Index B C
3 3 1
5 4 0
So dropping duplicate on B but if C is the same on all row and keep one sample/record.
For example, B = 3 for index 3/4 but since C = 1 for both, I do not destroy them all
But for example B = 5 for index 8/9/10 since C = 1 or 0, it get destroy.
Try this, using transform with nunique and drop_duplicates:
df[df.groupby('B')['C'].transform('nunique') == 1].drop_duplicates(subset='B')
Output:
B C
Index
3 3 1
5 4 0

How do I get dataframe values with multiindex where some value is NOT in multiindex?

Here is example of my df (for example):
2000-02-01 2000-03-01 ...
sub_col_one sub_col_two sub_col_one sub_col_two ...
idx_one idx_two
2 a 5 2 3 3
0 b 0 5 8 1
2 x 0 0 6 1
0 d 8 3 5 5
3 x 5 6 5 9
2 e 2 5 0 5
3 x 1 7 4 4
The question:
How could I get all rows of that df, where idx_two is not equal to x?
I've tried get_level_values, but cant get what I need.
Use Index.get_level_values with name of level with boolean indexing:
df1 = df[df.index.get_level_values('idx_two') != 'x']
Or with position of level, here 1, because python counts from 0:
df1 = df[df.index.get_level_values(1) != 'x']

How to add rows into existing dataframe in pandas? - python

df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12]})
How can I insert a new row of zeros at index 0 in one single line?
I tried pd.concat([pd.DataFrame([[0,0,0]]),df) but it did not work.
The desired output:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
You can concat the temp df with the original df but you need to pass the same column names so that it aligns in the concatenated df, additionally to get the index as you desire call reset_index with drop=True param.
In [87]:
pd.concat([pd.DataFrame([[0,0,0]], columns=df.columns),df]).reset_index(drop=True)
Out[87]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
alternatively to EdChum's solution you can do this:
In [163]: pd.DataFrame([[0,0,0]], columns=df.columns).append(df, ignore_index=True)
Out[163]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
An answer more specific to the dataframe being prepended to
pd.concat([df.iloc[[0], :] * 0, df]).reset_index(drop=True)

Sort pandas DataFrame by multiple columns and duplicated index

I have a pandas DataFrame with duplicated indices. There are 3 rows with each index, and they correspond to a group of items. There are two columns, a and b.
df = pandas.DataFrame([{'i': b % 4, 'a': abs(b - 6) , 'b': b}
for b in range(12)]).set_index('i')
I want to sort the DataFrame so that:
All of the rows with the same indices are adjacent. (all of the groups are together)
The groups are in reverse order by the lowest value of a within the group.
For example, in the above df, the first three items should be the ones with index 0, because the lowest a value for those three rows is 2, and all of the other groups have at least one row with an a value lower than 2. The second three items could be either group 3 or group 1, because the lowest a value in both of those groups is 1. The last group of items should be group 2, because it has a row with an a value of 0.
Within each group, the items are sorted in ascending order by b.
Desired output:
a b
i
0 6 0
0 2 4
0 2 8
3 3 3
3 1 7
3 5 11
1 5 1
1 1 5
1 3 9
2 4 2
2 0 6
2 4 10
I've been trying something like:
df.groupby('i')[['a']].transform(min).sort(['a', 'b'], ascending=[0, 1])
But it gives me a KeyError, and it only gets that far if I make i a column instead of an index anyway.
The most straightforward way I see is moving your index to a column, and calculating a new column with the group min.
In [43]: df = df.reset_index()
In [45]: df['group_min'] = df.groupby('i')['a'].transform('min')
Then you can sort by your conditions:
In [49]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True])
Out[49]:
i a b group_min
0 0 6 0 2
4 0 2 4 2
8 0 2 8 2
3 3 3 3 1
7 3 1 7 1
11 3 5 11 1
1 1 5 1 1
5 1 1 5 1
9 1 3 9 1
2 2 4 2 0
6 2 0 6 0
10 2 4 10 0
To get back to your desired frame, drop the tracking variable and reset the index.
In [50]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True]).drop('group_min', axis=1).set_index('i')
Out[50]:
a b
i
0 6 0
0 2 4
0 2 8
3 3 3
3 1 7
3 5 11
1 5 1
1 1 5
1 3 9
2 4 2
2 0 6
2 4 10
You can first sort by a in descending order and then sort your index:
>>> df.sort(['a', 'b'], ascending=[False, True]).sort_index()
a b
i
0 6 0
0 2 4
0 2 8
1 5 1
1 3 9
1 1 5
2 4 2
2 4 10
2 0 6
3 5 11
3 3 3
3 1 7

Categories

Resources