How to sort a dataframe based on idxmax? - python

I have a dataframe like this:
A B C
0 1 2 1
1 3 -8 10
2 10 3 -20
3 50 7 1
I would like to rearrange its columns based on the index of the maximal absolute value in each column. In column A, the maximal absolute value is in row 3, in B it is row 1 and in C it is row 2 which means that my new dataframe should be in the order B C A.
Currently I do this as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 3, 10, 50], 'B': [2, -8, 3, 7], 'C': [1, 10, -20, 1]})
indMax = abs(df).idxmax(axis=0)
df = df[np.argsort(indMax)]
So I first determine the indices of the maximal value per column which are stored in indMax, then I sort it and rearrange the dataframe accordingly which gives me the desired output:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50
My question is whether there is the possibility to pass the function idxmax directly to a sort function and change the dataframe inplace.

IIUC the following does what you want:
In [69]
df.ix[:,df.abs().idxmax().sort_values().index]
Out[69]:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50
Here we determine the idxmax in the abs values, sort the values and pass the index to index the df.
As to sorting in place you can just assign back to the df.
For a pre 0.17.0 version the following works:
In [75]:
df.ix[:,df.abs().idxmax().sort(inplace=False).index]
Out[75]:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50

This is ugly, but it seems to work using reindex_axis:
import numpy as np
>>> df.reindex_axis(df.columns[list(np.argsort(abs(df).idxmax(axis=0)))], axis=1)
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50

Related

Get the index of n maximum values in a column in dataframe

I have a data frame and I want to get the index and value of the 4 maximum values in each rows. For example, in the following df, in column a, 10, 6, 7, 8 are four maximum values.
import pandas as pd
df = pd.DataFrame()
df['a'] = [10, 2, 3, -1,4,5,6,7,8]
df['id'] = [100, 2, 3, -1,4,5,0,1,2]
df
The output which I want is:
Try nlargest,
df.nlargest(4, 'a').reset_index()
Output:
index a id
0 0 10 100
1 8 8 2
2 7 7 1
3 6 6 0
You can sort the a column
out = (df.sort_values('a', ascending=False).iloc[:4]
.sort_index(ascending=True)
.reset_index())
print(out)
index a id
0 0 10 100
1 6 6 0
2 7 7 1
3 8 8 2

Pandas update values using loc with repeated indices

All,
I have a dataframe with repeated indices. I'm trying to update the values using the index for all rows with that index. Here is an example of what I have
name x
t
0 A 5
0 B 2
1 A 7
2 A 5
2 B 9
2 C 3
"A" is present at every time. I want to replace "x" with the current value of "x", minus the value of "x" for "A" at that time. The tricky part is to get with an array or dataframe that is, in this case
array([5, 5, 7, 5, 5, 5])
which is the value for "A", but repeated for each timestamp. I can then subtract this from df['x']. My working solution is below.
temp = df[df['name'] == 'A']
d = dict(zip(temp.index, temp['x']))
df['x'] = df['x'] - df.index.to_frame()['t'].replace(d)
name x
t
0 A 0
0 B -3
1 A 0
2 A 0
2 B 4
2 C -2
This works, but feels a bit hacky, and I can't help but think there is a better (and must faster) solution...
I will do reindex
df.x-=df.loc[df.name=='A','x'].reindex(df.index).values
df
Out[362]:
name x
t
0 A 0
0 B -3
1 A 0
2 A 0
2 B 4
2 C -2
groupby .cumsum() of where name =A and subtract fast value in each group from the rest
df['x']=df.groupby((df.name=='A').cumsum())['x'].apply(lambda s:s.sub(s.iloc[0]))
name x
t
0 A 0
0 B -3
1 A 0
2 A 0
2 B 4
2 C -2

Append Data to Pandas Dataframe

I have the following pandas dataframe:
df = pd.DataFrame({'A': [1, 2, 3, 7], 'B': [4, 5, 6, 29]})
I'm working on a for loop that grabs an index and then appends data to the end of that row.
How do I append columns C, D, E for a given index of the table? Let's say on iteration one, the index is 2:
A B C D E
0 1 4 0 0 0
1 2 5 0 0 0
2 3 6 34 12 23
3 7 29 0 0 0
On the next iteration of the for loop, the index might be 1. Then the dataframe would be:
A B C D E
0 1 4 0 0 0
1 2 5 8 11 4
2 3 6 34 12 23
3 7 29 0 0 0
How do I do this?
You can target specific rows by using loc and providing the index.
For example:
df.loc[5:'D']=10
This will add the value 10, to the column D of row index 5.
Your question states that you want to add new columns depending on the row index. This doesn't make sense, because a dataframe is not like a NoSQL document where you can just add columns independent of other rows.
What you should do is have all your columns already added to your dataframe, then add values as you go.
To add multiple values:
df.loc[5, ['D', 'B']] = 10

How to perform an IF statement for duplicate values within the same column

I have a DataFrame and want to find duplicate values within a column and if found, create a new column add a zero for every duplicate case but leave the original value unchanged.
Original DataFrame:
Code1
1
2
3
4
5
1
2
1
1
New DataFrame:
Code1 Code2
1 1
2 2
3 3
4 4
5 5
6 6
1 10
2 20
1 100
1 1000
6 60
Use groupby and cumcount
df.assign(counts = df.groupby("Code1").cumcount(),
Code2=lambda x:x["Code1"]*10**(x["counts"])
).drop("counts", axis=1)
Code1 Code2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 1 10
6 2 20
7 1 100
8 1 1000
there might be a solution using transform (but I'm just not having time right now to investigate). However, here it's really explicit about what is happening
import pandas as pd
data = [1, 2, 3, 4, 5, 1, 2, 1, 1]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Code1'])
code2 = []
x = {}
for d in data:
if d not in x:
x[d] = d
else:
x[d] = x[d] * 10
code2.append(x[d])
df['Code2'] = code2
print(df)

Pandas GroupBy and select rows with the minimum value in a specific column

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Categories

Resources