Sorting pandas dataframe to get min value along diagonal - python

I have a panda dataframe, it is used for a heatmap. I would like the minimal value of each column to be along the diagonal.
I've sorted the columsn using
data = data.loc[:, data.min().sort_values().index]
This works. Now I just need to sort the values such that the index of the min value in the first column is row 0, then the min value of second column is row 1, and so on.
Example
import seaborn as sns
import pandas as pd
data = [[5,1,9],
[7,8,6],
[5,3,2]]
data = pd.DataFrame(data)
#sns.heatmap(data)
data = data.loc[:, data.min().sort_values().index]
#sns.heatmap(data) # Gives result in step 1
# Step1, Columsn sorted by min value, 1, 2, 5
data = [[1,9,5],
[8,6,7],
[3,2,5]]
data = pd.DataFrame(data)
#sns.heatmap(data)
# How do i perform step two, maintinaing column order.
# Step 2, Rows sorted by min value 1,2,7
data = [[1,9,5],
[3,2,5],
[8,6,7]]
data = pd.DataFrame(data)
sns.heatmap(data)
Is this possible in panda in a clever way?

Setup
data = pd.DataFrame([[5, 1, 9], [7, 8, 6], [5, 3, 2]])
You can accomplish this by using argsort of the diagonal elements of your sorted DataFrame, then indexing the DataFrame using these values.
Step 1
Use your initial sort:
data = data.loc[:, data.min().sort_values().index]
1 2 0
0 1 9 5
1 8 6 7
2 3 2 5
Step 2
Use np.argsort with np.diag:
data.iloc[np.argsort(np.diag(data))]
1 2 0
0 1 9 5
2 3 2 5
1 8 6 7

I'm not quite sure, but you've already done the following to sort column
data = data.loc[:, data.min().sort_values().index]
the same trick could also be applied to sort row
data = data.loc[data.min(axis=1).sort_values().index, :]

To move some values around so that the min value within each column is placed along the diagonal you could try something like this:
for i in range(len(data)):
min_index = data.iloc[:, i].idxmin()
if data.iloc[i,i] != data.iloc[min_index, i]:
data.iloc[i,i], data.iloc[min_index,i] = data.iloc[min_index, i], data.iloc[i,i]
Basically just swap the min with the diagonal.

Related

Creating a DataFrame from a dictionary of Series results in lost indices and NaNs

dict_with_series = {'Even':pd.Series([2,4,6,8,10]),'Odd':pd.Series([1,3,5,7,9])}
Data_frame_using_dic_Series = pd.DataFrame(dict_with_series)
# Data_frame_using_dic_Series = pd.DataFrame(dict_with_series,index=\[1,2,3,4,5\]), gives a NaN value I dont know why
display(Data_frame_using_dic_Series)
I tried labeling the index but when i did it eliminates the first column and row instead it prints extra column and row at the bottom with NaN value. Can anyone explain me why is it behaving like this , have I done something wrong
If I don't use the index labeling argument it works fine
When you run:
Data_frame_using_dic_Series = pd.DataFrame(dict_with_series,index=[1,2,3,4,5])
You request to only use the indices 1-5 from the provided Series, but the original indexing of a Series is from 0, thus resulting in a reindexing.
If you want to change the index, do it afterwards:
Data_frame_using_dic_Series = (pd.DataFrame(dict_with_series)
.set_axis([1, 2, 3, 4, 5])
)
Output:
Even Odd
1 2 1
2 4 3
3 6 5
4 8 7
5 10 9

Counting number of rows between min and max in pandas

I have a simple question in pandas.
Lets say I have following data:
d = {'a': [1, 3, 5, 2, 10, 3, 5, 4, 2]}
df = pd.DataFrame(data=d)
df
How do I count the number of rows which are between minimum and maximum value in column a? So number of rows (it is 3 in this case) which are between 1 and 10 in this particular case?
Thanks
You can use this:
diff = np.abs(df['a'].idxmin() - df['a'].idxmax()) - 1
IIUC, you could get the index of the min and max, and subtract 2:
out = len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2
output: 3
If the is a chance that the max is before the min:
out = max(len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2, 0)
Alternative it the order of the min/max does not matter:
import numpy as np
out = np.ptp(df.reset_index()['a'].agg(['idxmin', 'idxmax']))-1
update: index of the 2nd largest/smallest:
# second largest
df['a'].nlargest(2).idxmin()
# 2
# second smallest
df['a'].nsmallest(2).idxmax()
# 3
Since they are numbers, we can dump down into numpy and compute the result:
arr = df.a.to_numpy()
np.abs(np.argmax(arr) - np.argmin(arr)) - 1
3
np.abs(df.loc[df.a==df.a.min()].index-df.loc[df.a==df.a.max()].index)-1

Removing duplicates in dataframe via creating a list of their indices pandas

i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2

How to avoid iterrows for this pandas dataframe processing

I need some help in converting the following code to a more efficient one without using iterrows().
for index, row in df.iterrows():
alist=row['index_vec'].strip("[] ").split(",")
blist=[int(i) for i in alist]
for col in blist:
df.loc[index, str(col)] = df.loc[index, str(col)] +1
The above code basically reads a string under 'index_vec' column, parses and converts to integers, and then increments the associated columns by one for each integer. An example of the output is shown below:
Take the 0th row as an example. Its string value is "[370, 370, -1]". So the above code increments column "370" by 2 and column "-1" by 1. The output display is truncated so that only "-10" to "17" columns are shown.
The use of iterrows() is very slow to process a large dataframe. I'd like to get some help in speeding it up. Thank you.
You can also use apply and set axis = 1 to go row wise. Then create a custom function pass into apply:
Example starting df:
index_vec 1201 370 -1
0 [370, -1, -1] 0 0 1
1 [1201, 1201] 0 1 1
import pandas as pd
df = pd.DataFrame({'index_vec': ["[370, -1, -1]", "[1201, 1201]"], '1201': [0, 0], '370': [0, 1], '-1': [1, 1]})
def add_counts(x):
counts = pd.Series(x['index_vec'].strip("[]").split(", ")).value_counts()
x[counts.index] = x[counts.index] + counts
return x
df.apply(add_counts, axis = 1)
print(df)
Outputs:
index_vec 1201 370 -1
0 [370, -1, -1] 0 1 3
1 [1201, 1201] 2 1 1
Let us do
a=df['index_vec'].str.strip("[] ").str.split(",").explode()
s=pd.crosstab(a.index,a).reindex_like(df).fillna(0)
df=df.add(a)

Python pandas: Return indices of all rows like another row

Suppose we have a toy example like below.
np.random.seed(seed=1)
df = pd.DataFrame(np.random.randint(low=0,
high=2,
size=(5, 2)))
df
0 1
0 1 1
1 0 0
2 1 1
3 1 1
4 1 0
We want to return the indices of all rows like a certain row. Suppose I want the indices of all rows like row 0, which has a 1 in both column 0 and column 1.
I would want a data structure that has: (0, 2, 3).
I think you can do it like this
df.index[df.eq(df.iloc[0]).all(1)].tolist()
[0, 2, 3]
One way may be to use lambda:
df.index[df.apply(lambda row: all(row == df.iloc[0]), axis=1)].tolist()
Other way may be to use mask :
df.index[df[df == df.iloc[0].values].notnull().all(axis=1)].tolist()
Result:
[0, 2, 3]

Categories

Resources