I'm trying to get the row of the median value for a column.
I'm using data.median() to get the median value for 'column'.
id 30444.5
someProperty 3.0
numberOfItems 0.0
column 70.0
And data.median()['column'] is subsequently:
data.median()['performance']
>>> 70.0
How can get the row or index of the median value?
Is there anything similar to idxmax / idxmin?
I tried filtering but it's not reliable in cases multiple rows have the same value.
Thanks!
You can use rank and idxmin and apply it to each column:
import numpy as np
import pandas as pd
def get_median_index(d):
ranks = d.rank(pct=True)
close_to_median = abs(ranks - 0.5)
return close_to_median.idxmin()
df = pd.DataFrame(np.random.randn(13, 4))
df
0 1 2 3
0 0.919681 -0.934712 1.636177 -1.241359
1 -1.198866 1.168437 1.044017 -2.487849
2 1.159440 -1.764668 -0.470982 1.173863
3 -0.055529 0.406662 0.272882 -0.318382
4 -0.632588 0.451147 -0.181522 -0.145296
5 1.180336 -0.768991 0.708926 -1.023846
6 -0.059708 0.605231 1.102273 1.201167
7 0.017064 -0.091870 0.256800 -0.219130
8 -0.333725 -0.170327 -1.725664 -0.295963
9 0.802023 0.163209 1.853383 -0.122511
10 0.650980 -0.386218 -0.170424 1.569529
11 0.678288 -0.006816 0.388679 -0.117963
12 1.640222 1.608097 1.779814 1.028625
df.apply(get_median_index, 0)
0 7
1 7
2 3
3 4
May be just : data[data.performance==data.median()['performance']].
Related
I have a pd.DataFrame of floats:
import numpy as np
import pandas as pd
pd.DataFrame(np.random.rand(5, 5))
0 1 2 3 4
0 0.795329 0.125540 0.035918 0.645191 0.819097
1 0.755365 0.333681 0.466814 0.892638 0.610947
2 0.545204 0.313946 0.538049 0.237857 0.365573
3 0.026813 0.867013 0.843798 0.923359 0.464778
4 0.514968 0.201144 0.853341 0.951981 0.214948
I'm trying to format all of them as percentage:
0 1 2 3 4
0 "25.60%" "11.55%" "98.62%" "73.16%" "38.85%"
1 "26.01%" "28.57%" "65.21%" "32.55%" "93.45%"
2 "19.99%" "41.97%" "57.21%" "61.26%" "83.34%"
3 "41.54%" "71.02%" "52.93%" "42.78%" "49.77%"
4 "33.77%" "70.48%" "36.64%" "97.42%" "83.19%"
or
0 1 2 3 4
0 25.60% 11.55% 98.62% 73.16% 38.85%
1 26.01% 8.57% 65.21% 32.55% 93.45%
2 19.99% 41.97% 57.21% 61.26% 83.34%
3 41.54% 1.02% 52.93% 42.78% 49.77%
4 33.77% 70.48% 36.64% 97.42% 83.19%
Many solutions exist, but for a single column, for example here. I'm trying to edit the values, so I'm not interested in changing the float display format.
How can I proceed?
Try this:
df = df.applymap('{:,.2%}'.format)
Or np.vectorize:
df[:] = np.vectorize('{:,.2%}'.format)(df)
If you want to display numeric data as percents, while keeping the underlying data numeric and not strings, then you can use dataframe styling:
https://stackoverflow.com/a/55624414
The objective was to get an average for the n-preceeding and n-succeding for a given index row.
For a given index, the get average of a list of index.
For example
index index of average
0 0,1,2
1 0,1,2,3
2 0,1,2,3,4
...
9 7,8,9,10,11
10 8,9,10,11,12
This can be achieved as below:
import pandas as pd
arr=[[6772],
[7182],
[8570],
[11078],
[11646],
[13426],
[16996],
[17514],
[18408],
[22128],
[22520],
[23532],
[26164],
[26590],
[30636],
[3119],
[32166],
[34774]]
df=pd.DataFrame(arr,columns=['a'])
df['cal']=0
idx_c=2
for idx in range(len(df)):
idx_l=idx-idx_c
idx_t=idx+idx_c
idx_l=0 if idx_l<0 else idx_l
idx_t=len(df) if idx_t>len(df) else idx_t
df.loc[idx,'cal']=df['a'][df.index.isin(range(idx_l,idx_t+1))].mean()
However, I wonder there is more efficient way of achieving the above task?
Series.rolling
The trick here is to use a rolling window of size 2*w + 1 with the optional parameter center=True to center the result of rolling computation. For example, if w=2 then the window size would be 2*w + 1 = 5 and result of rolling computation will be stored at the position 3.
w = 2
df['avg'] = df['a'].rolling(2 * w + 1, center=True, min_periods=1).mean()
print(df)
a avg
0 6772 7508.00
1 7182 8400.50
2 8570 9049.60
3 11078 10380.40
4 11646 12343.20
5 13426 14132.00
6 16996 15598.00
7 17514 17694.40
8 18408 19513.20
9 22128 20820.40
10 22520 22550.40
11 23532 24186.80
12 26164 25888.40
13 26590 22008.20
14 30636 23735.00
15 3119 25457.00
16 32166 25173.75
17 34774 23353.00
I know that there is a method .argmax() that returns the indexes of the maximum values across an axis.
But what if we want to get the indexes of the 10 highest values across an axis?
How could this be accomplished?
E.g.:
data = pd.DataFrame(np.random.random_sample((50, 40)))
You can use argsort:
s = pd.Series(np.random.permutation(30))
sorted_indices = s.argsort()
top_10 = sorted_indices[sorted_indices < 10]
print(top_10)
Output:
3 9
4 1
6 0
8 7
13 4
14 2
15 3
19 8
20 5
24 6
dtype: int64
IIUC, say, if you want to get the index of the top 10 largest numbers of column col:
data[col].nlargest(10).index
Give this a try. This will take the 10 largest values across a row and put them into a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_sample((50, 40)))
df2 = pd.DataFrame(np.sort(df.values)[:,-10:])
If i have a dataframe say
df = {'carx' : [merc,rari,merc,hond,fia,merc]
'cary' : [bent,maz,ben,merc,fia,fia]
'milesx' : [0,100,2,22,5,6]
'milesy' : [10,3,18,2,19,2]}
I then would like to plot the value from column milesx if corresponding index of carx has the value 'merc'. The same criteria applies for cary and milesy, else nothing should be plotted. How can i do this?
milesy and milesx should be plotted on the x-axis. The y-axis should just be some continuous values (1,2...).
IIUC, assuming you have following dataframe:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# input dictionary
df = {'carx' : ['merc','rari','merc','hond','fia','merc'],
'cary' : ['bent','maz','ben','merc','fia','fia'],
'milesx' : [0,100,2,22,5,6],
'milesy' : [10,3,18,2,19,2]}
# creating input dataframe
dataframe = pd.DataFrame(df)
print(dataframe)
Result:
carx cary milesx milesy
0 merc bent 0 10
1 rari maz 100 3
2 merc ben 2 18
3 hond merc 22 2
4 fia fia 5 19
5 merc fia 6 2
Then, you want to plot values given condition which can be done using function, and using apply:
def my_function(row):
if row['carx'] == 'merc':return row['milesx']
if row['cary'] == 'merc': return row['milesy']
else: return None
# filter those with only 'merc'
filtered = dataframe.apply(lambda row: my_function(row), axis=1)
print(filtered)
Result:
0 0.0
1 NaN
2 2.0
3 2.0
4 NaN
5 6.0
dtype: float64
You do not want to plot when neither of them are which would be NaN, so dropna() may be used:
# plotting
filtered.dropna().plot(kind='bar', legend=None);
I have a pandas DataFrame like following.
df = pandas.DataFrame(np.random.randn(5,5),columns=['1','2','3','4','5'])
1 2 3 4 5
0 0.877455 -1.215212 -0.453038 -1.825135 0.440646
1 1.640132 -0.031353 1.159319 -0.615796 0.763137
2 0.132355 -0.762932 -0.909496 -1.012265 -0.695623
3 -0.257547 -0.844019 0.143689 -2.079521 0.796985
4 2.536062 -0.730392 1.830385 0.694539 -0.654924
I need to get row and column indexes for following three groups. (In my original dataset there are no negative values)
value is greater than 2.0
value is between 1.0 - 2.0
value is less than 1.0
For e.g for "value is greater than 2.0" it should return [1,4]. I have tried using this which gives a boolean result.
df.values > 2
You can use np.where on the boolean result to extract the indices:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,5),columns=['1','2','3','4','5'])
condition = df.values > 2
print np.column_stack(np.where(condition))
For a df like this,
1 2 3 4 5
0 0.057347 0.722251 0.263292 -0.168865 -0.111831
1 -0.765375 1.040659 0.272883 -0.834273 -0.126997
2 -0.023589 0.046002 1.206445 0.381532 -1.219399
3 2.290187 2.362249 -0.748805 -1.217048 -0.973749
4 0.100084 0.671120 -0.211070 0.903264 -0.312815
Output:
[[3 0]
[3 1]]
Or get a list of row-column index pairs if necessary:
print map(list, np.column_stack(np.where(condition)))
Output:
[[3,0], [3,1]]