Sort pandas DataFrame by multiple columns and duplicated index - python

I have a pandas DataFrame with duplicated indices. There are 3 rows with each index, and they correspond to a group of items. There are two columns, a and b.
df = pandas.DataFrame([{'i': b % 4, 'a': abs(b - 6) , 'b': b}
for b in range(12)]).set_index('i')
I want to sort the DataFrame so that:
All of the rows with the same indices are adjacent. (all of the groups are together)
The groups are in reverse order by the lowest value of a within the group.
For example, in the above df, the first three items should be the ones with index 0, because the lowest a value for those three rows is 2, and all of the other groups have at least one row with an a value lower than 2. The second three items could be either group 3 or group 1, because the lowest a value in both of those groups is 1. The last group of items should be group 2, because it has a row with an a value of 0.
Within each group, the items are sorted in ascending order by b.
Desired output:
a b
i
0 6 0
0 2 4
0 2 8
3 3 3
3 1 7
3 5 11
1 5 1
1 1 5
1 3 9
2 4 2
2 0 6
2 4 10
I've been trying something like:
df.groupby('i')[['a']].transform(min).sort(['a', 'b'], ascending=[0, 1])
But it gives me a KeyError, and it only gets that far if I make i a column instead of an index anyway.

The most straightforward way I see is moving your index to a column, and calculating a new column with the group min.
In [43]: df = df.reset_index()
In [45]: df['group_min'] = df.groupby('i')['a'].transform('min')
Then you can sort by your conditions:
In [49]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True])
Out[49]:
i a b group_min
0 0 6 0 2
4 0 2 4 2
8 0 2 8 2
3 3 3 3 1
7 3 1 7 1
11 3 5 11 1
1 1 5 1 1
5 1 1 5 1
9 1 3 9 1
2 2 4 2 0
6 2 0 6 0
10 2 4 10 0
To get back to your desired frame, drop the tracking variable and reset the index.
In [50]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True]).drop('group_min', axis=1).set_index('i')
Out[50]:
a b
i
0 6 0
0 2 4
0 2 8
3 3 3
3 1 7
3 5 11
1 5 1
1 1 5
1 3 9
2 4 2
2 0 6
2 4 10

You can first sort by a in descending order and then sort your index:
>>> df.sort(['a', 'b'], ascending=[False, True]).sort_index()
a b
i
0 6 0
0 2 4
0 2 8
1 5 1
1 3 9
1 1 5
2 4 2
2 4 10
2 0 6
3 5 11
3 3 3
3 1 7

Related

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Pandas: Change repeated index to hierarchical index

See the example below.
Given a dataframe whose index has values repeated, how can I get a new dataframe with a hierarchical index whose first level is the original index and whose second level is 0, 1, 2, ..., n?
Example:
>>> df
0 1
a 2 4
a 4 6
b 7 8
b 2 4
c 3 7
>>> df2 = df.some_operation()
>>> df2
0 1
a 0 2 4
1 4 6
b 0 7 8
1 2 4
c 0 3 7
You can using cumcount
df.assign(level2=df.groupby(level=0).cumcount()).set_index('level2',append=True)
Out[366]:
0 1
level2
a 0 2 4
1 4 6
b 0 7 8
1 2 4
c 0 3 7
Can do the fake way (totally not recommended, don't use this):
>>> df.index=[v if i%2 else '' for i,v in enumerate(df.index)]
>>> df.insert(0,'',([0,1]*3)[:-1])
>>> df
0 1
0 2 4
a 1 4 6
0 7 8
b 1 2 4
0 3 7
>>>
Change index names and create a column which the column name is '' (empty string).

Pandas: how to add row values by index value

I'm having trouble working out how to add the index value of a pandas dataframe to each value at that index. For example, if I have a dataframe of zeroes, the row with index 1 should have a value of 1 for all columns. The row at index 2 should have values of 2 for each column, and so on.
Can someone enlighten me please?
You can use pd.DataFrame.add with axis=0. Just remember, as below, to convert your index to a series first.
df = pd.DataFrame(np.random.randint(0, 10, (5, 5)))
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 9 6 1 8 0
2 2 9 0 5 3
3 3 1 1 7 0
4 2 6 3 6 6
df = df.add(df.index.to_series(), axis=0)
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 10 7 2 9 1
2 4 11 2 7 5
3 6 4 4 10 3
4 6 10 7 10 10

Pandas DataFrame get column combined max values

I have a pandas DataFrame like following.
df = pd.DataFrame({"A": [3,1,2,4,5,3,4,10], "B": [1,3,2,4,0,0,1,0]})
The row values 0 to 10 are recommendations (10 is best). One DataFrame column is a category (A, B, etc.) the 0 to 10 recommendation is related to. All categories have the same weight but each row is related to one item.
I want the DataFrame to be sorted for items with the max values combined to both (or more) categories. So if a row related to an item has a value of 10 in category A but value 0 in category B, that would not be the expected solution for the highest rated item. In example given above the row with values [4,4] would be the best choice.
My groupby solution does not give the expected result.
grouped = df.groupby(['A', 'B'])
grouped[["A", "B"]].max().sort(ascending=False)
result:
A B
A B
10 2 10 0
5 0 5 0
4 4 4 4
1 4 1
3 1 3 1
0 3 0
2 2 2 2
1 3 1 3
A row based total sum would also not yield the expected result since it does not differentiate between categories.
df = pd.DataFrame({"A": [3,1,2,4,5,3,4,10], "B": [1,3,2,4,0,0,1,0]})
then calculate the rank for each column in the data frame
rank = df.rank(method = "dense")
rank
Out[44]:
A B
0 3 2
1 1 4
2 2 3
3 4 5
4 5 1
5 3 1
6 4 2
7 6 1
add a new column to the data frame which is the the total rank based on all categories
df['total_rank'] = rank.sum(axis = 1)
df
Out[46]:
A B total_rank
0 3 1 5
1 1 3 5
2 2 2 5
3 4 4 9
4 5 0 6
5 3 0 4
6 4 1 6
7 10 0 7
and finally sort your data frame by total rank
df.sort(columns='total_rank' , ascending = False)
Out[49]:
A B total_rank
3 4 4 9
7 10 0 7
4 5 0 6
6 4 1 6
0 3 1 5
1 1 3 5
2 2 2 5
5 3 0 4
How about this
df['pos'] = df.A/df.A.mean() + df.B/df.B.mean()
df.sort( columns='pos', ascending=False)
# A B pos
#3 4 4 3.909091
#7 10 0 2.500000
#1 1 3 2.431818
#2 2 2 1.954545
#6 4 1 1.727273
#0 3 1 1.477273
#4 5 0 1.250000
#5 3 0 0.750000
If you have more columns you want to rank ['A','B','C', ...]
cols = ['A','B'] # ,'C', 'D', ... ]
df['pos'] = pandas.np.sum([ df[col]/df[col].mean() for col in cols ],axis=0)
Update
Because 0 is considered a quality value (lowest), I would amend my answer as follows (not sure it makes a huge difference)
df['pos'] = (df.A+1)/(df.A.max()+1) + (df.B+1)/(df.B.max()+1)
df.sort( columns='pos', ascending=False)
# A B pos
#3 4 4 1.454545
#7 10 0 1.200000
#1 1 3 0.981818
#2 2 2 0.872727
#6 4 1 0.854545
#0 3 1 0.763636
#4 5 0 0.745455
#5 3 0 0.563636

Efficient Partitioning of Pandas DataFrame rows between sandwiched indicator variables

Suppose I have a pandas df with an indicator row that sandwiches a period. Ex.
In [9]: pd.DataFrame({'col1':np.arange(1,11),'indicator':[0,1,0,0,0,1,0,0,1,1]})
Out[9]:
col1 indicator
0 1 0
1 2 1
2 3 0
3 4 0
4 5 0
5 6 1
6 7 0
7 8 0
8 9 1
9 10 1
What I want to do, is to use groupby to select the partitions separated by the indicators.
ex.
Group 1
col1 indicator
0 1 0
1 2 1
Group 2
2 3 0
3 4 0
4 5 0
5 6 1
Group 3
6 7 0
7 8 0
8 9 1
Group 4
9 10 1
The naive solution will be to just take the indicator column out as a list, run a for-loop through it, and just label each part. But suppose the dataset is really big, and you want to avoid the for-loop. Is there something more clever that can be done here, to separate out the different groups?
Thanks!
Just assign another column as a cumsum of indicator, then apply groupby, this should do the trick:
# reverse the order as you have indicator at end of group, then reverse back
df['grouped'] = df['indicator'].loc[::-1].cumsum().loc[::-1]
for g in df.groupby('grouped', sort=False):
print g
(4, col1 indicator grouped
0 1 0 4
1 2 1 4)
(3, col1 indicator grouped
2 3 0 3
3 4 0 3
4 5 0 3
5 6 1 3)
(2, col1 indicator grouped
6 7 0 2
7 8 0 2
8 9 1 2)
(1, col1 indicator grouped
9 10 1 1)

Categories

Resources