Counting number of rows between min and max in pandas - python

I have a simple question in pandas.
Lets say I have following data:
d = {'a': [1, 3, 5, 2, 10, 3, 5, 4, 2]}
df = pd.DataFrame(data=d)
df
How do I count the number of rows which are between minimum and maximum value in column a? So number of rows (it is 3 in this case) which are between 1 and 10 in this particular case?
Thanks

You can use this:
diff = np.abs(df['a'].idxmin() - df['a'].idxmax()) - 1

IIUC, you could get the index of the min and max, and subtract 2:
out = len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2
output: 3
If the is a chance that the max is before the min:
out = max(len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2, 0)
Alternative it the order of the min/max does not matter:
import numpy as np
out = np.ptp(df.reset_index()['a'].agg(['idxmin', 'idxmax']))-1
update: index of the 2nd largest/smallest:
# second largest
df['a'].nlargest(2).idxmin()
# 2
# second smallest
df['a'].nsmallest(2).idxmax()
# 3

Since they are numbers, we can dump down into numpy and compute the result:
arr = df.a.to_numpy()
np.abs(np.argmax(arr) - np.argmin(arr)) - 1
3

np.abs(df.loc[df.a==df.a.min()].index-df.loc[df.a==df.a.max()].index)-1

Related

Removing duplicates in dataframe via creating a list of their indices pandas

i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2

Element-wise division with accumulated numbers in Python?

The title may come across as confusing (honestly, not quite sure how to summarize it in a sentence), so here is a much better explanation:
I'm currently handling a dataFrame A regarding different attributes, and I used a .groupby[].count() function on a data column age to create a list of occurrences:
A_sub = A.groupby(['age'])['age'].count()
A_sub returns a Series similar to the following (the values are randomly modified):
age
1 316
2 249
3 221
4 219
5 262
...
59 1
61 2
65 1
70 1
80 1
Name: age, dtype: int64
I would like to plot a list of values from element-wise division. The division I would like to perform is an element value divided by the sum of all the elements that has the index greater than or equal to that element. In other words, for example, for age of 3, it should return
221/(221+219+262+...+1+2+1+1+1)
The same calculation should apply to all the elements. Ideally, the outcome should be in the similar type/format so that it can be plotted.
Here is a quick example using numpy. A similar approach can be used with pandas. The for loop can most likely be replaced by something smarter and more efficient to compute the coefficients.
import numpy as np
ages = np.asarray([316, 249, 221, 219, 262])
coefficients = np.zeros(ages.shape)
for k, a in enumerate(ages):
coefficients[k] = sum(ages[k:])
output = ages / coefficients
Output:
array([0.24940805, 0.26182965, 0.31481481, 0.45530146, 1. ])
EDIT: The coefficients initizaliation at 0 and the for loop can be replaced with:
coefficients = np.flip(np.cumsum(np.flip(ages)))
You can use the function cumsum() in pandas to get accumulated sums:
A_sub = A['age'].value_counts().sort_index(ascending=False)
(A_sub / A_sub.cumsum()).iloc[::-1]
No reason to use numpy, pandas already includes everything we need.
A_sub seems to return a Series where age is the index. That's not ideal, but it should be fine. The code below therefore operates on a series, but can easily be modified to work DataFrames.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
print(s)
res = s / s[::-1].cumsum()[::-1]
res = res.rename("cumsum div")
I saw your comment about missing ages in the index. Here is how you would add the missing indexes in the range from min to max index, and then perform the division.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
s_all_idx = s.reindex(index=range(s.index.min(), s.index.max() + 1), fill_value=0)
print(s_all_idx)
res = s_all_idx / s_all_idx[::-1].cumsum()[::-1]
res = res.rename("all idx cumsum div")

FInd consecutive index pandas

I have a dataframe-
cols = ['hops','frequency']
data = [[-13,2],[-8,2],[-5,1],[0,2],[2,1],[4,1],[7,1]]
data = np.asarray(data)
indices = np.arange(0,len(data))
df= pd.DataFrame(data, index=indices, columns=cols)
Now I want to check if the index of the hops related to maximum are consecutive or not.
for example here the max freq is 2 and the index having them is 0 1 3.Now we need to check if all the element are consecutive or not.In this case its not as the index have to be 0 1 2 to be consective .
Break your logic into parts and you will find constructing a solution easier.
First calculate the indices using Boolean indexing:
idx = df.index[df['frequency'] == df['frequency'].max()]
# Int64Index([0, 1, 3], dtype='int64')
Then calculate the differences between consecutive values:
diffs = np.diff(idx)
# array([1, 2], dtype=int64)
Finally, check if all the differences are equal to 1:
diff_one_check = (diffs == 1).all()
# False

Sorting pandas dataframe to get min value along diagonal

I have a panda dataframe, it is used for a heatmap. I would like the minimal value of each column to be along the diagonal.
I've sorted the columsn using
data = data.loc[:, data.min().sort_values().index]
This works. Now I just need to sort the values such that the index of the min value in the first column is row 0, then the min value of second column is row 1, and so on.
Example
import seaborn as sns
import pandas as pd
data = [[5,1,9],
[7,8,6],
[5,3,2]]
data = pd.DataFrame(data)
#sns.heatmap(data)
data = data.loc[:, data.min().sort_values().index]
#sns.heatmap(data) # Gives result in step 1
# Step1, Columsn sorted by min value, 1, 2, 5
data = [[1,9,5],
[8,6,7],
[3,2,5]]
data = pd.DataFrame(data)
#sns.heatmap(data)
# How do i perform step two, maintinaing column order.
# Step 2, Rows sorted by min value 1,2,7
data = [[1,9,5],
[3,2,5],
[8,6,7]]
data = pd.DataFrame(data)
sns.heatmap(data)
Is this possible in panda in a clever way?
Setup
data = pd.DataFrame([[5, 1, 9], [7, 8, 6], [5, 3, 2]])
You can accomplish this by using argsort of the diagonal elements of your sorted DataFrame, then indexing the DataFrame using these values.
Step 1
Use your initial sort:
data = data.loc[:, data.min().sort_values().index]
1 2 0
0 1 9 5
1 8 6 7
2 3 2 5
Step 2
Use np.argsort with np.diag:
data.iloc[np.argsort(np.diag(data))]
1 2 0
0 1 9 5
2 3 2 5
1 8 6 7
I'm not quite sure, but you've already done the following to sort column
data = data.loc[:, data.min().sort_values().index]
the same trick could also be applied to sort row
data = data.loc[data.min(axis=1).sort_values().index, :]
To move some values around so that the min value within each column is placed along the diagonal you could try something like this:
for i in range(len(data)):
min_index = data.iloc[:, i].idxmin()
if data.iloc[i,i] != data.iloc[min_index, i]:
data.iloc[i,i], data.iloc[min_index,i] = data.iloc[min_index, i], data.iloc[i,i]
Basically just swap the min with the diagonal.

Find location of specific value in dataframe of distances

I've been browsing for an answer to my issue but I can't seem to find a suitable solution. I have a dataframe with distances (NxN cells) and I find the minimum distance of the whole dataframe with:
min_distance = distances.values.min()
Now I need to find the location (which row and which column of the dataframe) of the min_distance. Any ideas?
EDIT
Minimal code
import numpy as np
import pandas as pd
distances=[]
for i in range(5):
distances.append([])
for j in range(5):
distances[i].append(np.random.randint(10))
distances=pd.DataFrame(distances)
min_distance = distances.values.min()
print "Minimum=", min_distance
print "Location of minimum value="
I depends on what form you want your result in. But a very straight forward approach would be to use stack and idxmin.
Like so:
Setup
import pandas as pd
df = pd.DataFrame([[2, 2, 2], [2, 1, 2], [2, 2, 2]],
columns=list('ABC'), index=list('abc'))
print df
A B C
a 2 2 2
b 2 1 2
c 2 2 2
We should expect the min to be 1 and the location to be row b columns B
Solution
df.stack().idxmin()
('b', 'B')
Now you could manipulate this to deliver this any other way. This just happens to deliver a tuple.
Generate example:
N = 4
df = pd.DataFrame(np.random.rand(N,N))
Find minimal index of flattened dataframe:
idx_min = df.values.flatten().argmin()
Simple arithmetic to get the row and column numbers back:
row = ((idx_min + 1) // N) - 1
column = idx_min - (row * N)

Categories

Resources