I've been browsing for an answer to my issue but I can't seem to find a suitable solution. I have a dataframe with distances (NxN cells) and I find the minimum distance of the whole dataframe with:
min_distance = distances.values.min()
Now I need to find the location (which row and which column of the dataframe) of the min_distance. Any ideas?
EDIT
Minimal code
import numpy as np
import pandas as pd
distances=[]
for i in range(5):
distances.append([])
for j in range(5):
distances[i].append(np.random.randint(10))
distances=pd.DataFrame(distances)
min_distance = distances.values.min()
print "Minimum=", min_distance
print "Location of minimum value="
I depends on what form you want your result in. But a very straight forward approach would be to use stack and idxmin.
Like so:
Setup
import pandas as pd
df = pd.DataFrame([[2, 2, 2], [2, 1, 2], [2, 2, 2]],
columns=list('ABC'), index=list('abc'))
print df
A B C
a 2 2 2
b 2 1 2
c 2 2 2
We should expect the min to be 1 and the location to be row b columns B
Solution
df.stack().idxmin()
('b', 'B')
Now you could manipulate this to deliver this any other way. This just happens to deliver a tuple.
Generate example:
N = 4
df = pd.DataFrame(np.random.rand(N,N))
Find minimal index of flattened dataframe:
idx_min = df.values.flatten().argmin()
Simple arithmetic to get the row and column numbers back:
row = ((idx_min + 1) // N) - 1
column = idx_min - (row * N)
Related
I have a simple question in pandas.
Lets say I have following data:
d = {'a': [1, 3, 5, 2, 10, 3, 5, 4, 2]}
df = pd.DataFrame(data=d)
df
How do I count the number of rows which are between minimum and maximum value in column a? So number of rows (it is 3 in this case) which are between 1 and 10 in this particular case?
Thanks
You can use this:
diff = np.abs(df['a'].idxmin() - df['a'].idxmax()) - 1
IIUC, you could get the index of the min and max, and subtract 2:
out = len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2
output: 3
If the is a chance that the max is before the min:
out = max(len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2, 0)
Alternative it the order of the min/max does not matter:
import numpy as np
out = np.ptp(df.reset_index()['a'].agg(['idxmin', 'idxmax']))-1
update: index of the 2nd largest/smallest:
# second largest
df['a'].nlargest(2).idxmin()
# 2
# second smallest
df['a'].nsmallest(2).idxmax()
# 3
Since they are numbers, we can dump down into numpy and compute the result:
arr = df.a.to_numpy()
np.abs(np.argmax(arr) - np.argmin(arr)) - 1
3
np.abs(df.loc[df.a==df.a.min()].index-df.loc[df.a==df.a.max()].index)-1
So I want to multiply each row of a dataframe with a multiplier vector, and I am managing, but it looks ugly. Can this be improved?
import pandas as pd
import numpy as np
# original data
df_a = pd.DataFrame([[1,2,3],[4,5,6]])
print(df_a, '\n')
# multiplier vector
df_b = pd.DataFrame([2,2,1])
print(df_b, '\n')
# multiply by a list - it works
df_c = df_a*[2,2,1]
print(df_c, '\n')
# multiply by the dataframe - it works
df_c = df_a*df_b.T.to_numpy()
print(df_c, '\n')
"It looks ugly" is subjective, that said, if you want to multiply all rows of a dataframe with something else you either need:
a dataframe of a compatible shape (and compatible indices, as those are aligned before operations in pandas, which is why df_a*df_b.T would only work for the common index: 0)
a 1D vector, which in pandas is a Series
Using a Series:
df_a*df_b[0]
output:
0 1 2
0 2 4 3
1 8 10 6
Of course, better define a Series directly if you don't really need a 2D container:
s = pd.Series([2,2,1])
df_a*s
Just for the beauty, you can use Einstein summation:
>>> np.einsum('ij,ji->ij', df_a, df_b)
array([[ 2, 4, 3],
[ 8, 10, 6]])
i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2
So, I have the following sample dataframe (included only one row for clarity/simplicity):
df = pd.DataFrame({'base_number': [2],
'std_dev': [1]})
df['amount_needed'] = 5
df['upper_bound'] = df['base_number'] + df['std_dev']
df['lower_bound'] = df['base_number'] - df['std_dev']
For each given rows, I would like to generate the amount of rows such that the total amount per row is the number given by df['amount_needed'] (so 5, in this example). I would like those 5 new rows to be spread across a spectrum given by df['upper_bound'] and df['lower_bound']. So for the example above, I would like the following result as an output:
df_new = pd.DataFrame({'base_number': [1, 1.5, 2, 2.5, 3]})
Of course, this process will be done for all rows in a much larger dataframe, with many other columns which aren't relevant to this particular issue, which is why I'm trying to find a way to automate this process.
One row of df will create one series (or one data frame). Here's one way to iterate over df and create the series with the values you specified:
for row in df.itertuples():
arr = np.linspace(row.lower_bound,
row.upper_bound,
row.amount_needed)
s = pd.Series(arr).rename('base_number')
print(s)
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
Name: base_number, dtype: float64
Ended up using jsmart's contribution and working on it to generate a new dataframe, conserving original id's in order to merge the other columns from the old one onto this new one according to id as needed (whole process shown below):
amount_needed = 5
df = pd.DataFrame({'base_number': [2, 4, 8, 0],
'std_dev': [1, 2, 3, 0]})
df['amount_needed'] = amount_needed
df['upper_bound'] = df['base_number'] + df['std_dev']
df['lower_bound'] = df['base_number'] - df['std_dev']
s1 = pd.Series([],dtype = int)
for row in df.itertuples():
arr = np.linspace(row.lower_bound,
row.upper_bound,
row.amount_needed)
s = pd.Series(arr).rename('base_number')
s1 = pd.concat([s1, s])
df_new = pd.DataFrame({'base_number': s1})
ids_og = list(range(1, len(df) + 1))
ids_og = [ids_og] * amount_needed
ids_og = sorted(list(itertools.chain.from_iterable(ids_og)))
df_new['id'] = ids_og
I have a panda dataframe, it is used for a heatmap. I would like the minimal value of each column to be along the diagonal.
I've sorted the columsn using
data = data.loc[:, data.min().sort_values().index]
This works. Now I just need to sort the values such that the index of the min value in the first column is row 0, then the min value of second column is row 1, and so on.
Example
import seaborn as sns
import pandas as pd
data = [[5,1,9],
[7,8,6],
[5,3,2]]
data = pd.DataFrame(data)
#sns.heatmap(data)
data = data.loc[:, data.min().sort_values().index]
#sns.heatmap(data) # Gives result in step 1
# Step1, Columsn sorted by min value, 1, 2, 5
data = [[1,9,5],
[8,6,7],
[3,2,5]]
data = pd.DataFrame(data)
#sns.heatmap(data)
# How do i perform step two, maintinaing column order.
# Step 2, Rows sorted by min value 1,2,7
data = [[1,9,5],
[3,2,5],
[8,6,7]]
data = pd.DataFrame(data)
sns.heatmap(data)
Is this possible in panda in a clever way?
Setup
data = pd.DataFrame([[5, 1, 9], [7, 8, 6], [5, 3, 2]])
You can accomplish this by using argsort of the diagonal elements of your sorted DataFrame, then indexing the DataFrame using these values.
Step 1
Use your initial sort:
data = data.loc[:, data.min().sort_values().index]
1 2 0
0 1 9 5
1 8 6 7
2 3 2 5
Step 2
Use np.argsort with np.diag:
data.iloc[np.argsort(np.diag(data))]
1 2 0
0 1 9 5
2 3 2 5
1 8 6 7
I'm not quite sure, but you've already done the following to sort column
data = data.loc[:, data.min().sort_values().index]
the same trick could also be applied to sort row
data = data.loc[data.min(axis=1).sort_values().index, :]
To move some values around so that the min value within each column is placed along the diagonal you could try something like this:
for i in range(len(data)):
min_index = data.iloc[:, i].idxmin()
if data.iloc[i,i] != data.iloc[min_index, i]:
data.iloc[i,i], data.iloc[min_index,i] = data.iloc[min_index, i], data.iloc[i,i]
Basically just swap the min with the diagonal.