pandas: find percentile stats of a given column - python

I have a pandas data frame my_df, where I can find the mean(), median(), mode() of a given column:
my_df['field_A'].mean()
my_df['field_A'].median()
my_df['field_A'].mode()
I am wondering is it possible to find more detailed stats such as 90 percentile? Thanks!

You can use the pandas.DataFrame.quantile() function, as shown below.
import pandas as pd
import random
A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]
df = pd.DataFrame({ 'field_A': A, 'field_B': B })
df
# field_A field_B
# 0 90 72
# 1 63 84
# 2 11 74
# 3 61 66
# 4 78 80
# 5 67 75
# 6 89 47
# 7 12 22
# 8 43 5
# 9 30 64
df.field_A.mean() # Same as df['field_A'].mean()
# 54.399999999999999
df.field_A.median()
# 62.0
# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.
df.field_A.quantile(0.1) # 10th percentile
# 11.9
df.field_A.quantile(0.5) # same as median
# 62.0
df.field_A.quantile(0.9) # 90th percentile
# 89.10000000000001

assume series s
s = pd.Series(np.arange(100))
Get quantiles for [.1, .2, .3, .4, .5, .6, .7, .8, .9]
s.quantile(np.linspace(.1, 1, 9, 0))
0.1 9.9
0.2 19.8
0.3 29.7
0.4 39.6
0.5 49.5
0.6 59.4
0.7 69.3
0.8 79.2
0.9 89.1
dtype: float64
OR
s.quantile(np.linspace(.1, 1, 9, 0), 'lower')
0.1 9
0.2 19
0.3 29
0.4 39
0.5 49
0.6 59
0.7 69
0.8 79
0.9 89
dtype: int32

I figured out below would work:
my_df.dropna().quantile([0.0, .9])

You can even give multiple columns with null values and get multiple quantile values (I use 95 percentile for outlier treatment)
my_df[['field_A','field_B']].dropna().quantile([0.0, .5, .90, .95])

a very easy and efficient way is to call the describe function on the particular column
df['field_A'].describe()
this will give you the mean ,max ,median and the 75th percentile

Describe will give you quartiles, if you want percentiles, you can do something like
df['YOUR_COLUMN_HERE'].describe(percentiles=[.1, .2, .3, .4, .5, .6 , .7, .8, .9, 1])

Related

how to nested boxplot groupBy

I have a dataset of more than 50 features that correspond to the specific movement during leg rehabilitation. I compare the group that used our rehabilitation device with the group recovering without using it. The group includes patients with 3 diagnoses and I want to compare boxplots of before (red boxplot) and after (blue boxplot) for each diagnosis.
This is the snippet I was using and the output I am getting.
Control group data:
dataKONTR
Row DG DKK ... LOS_DCL_LB LOS_DCL_L LOS_DCL_LF
0 Williams1 distorze 0.0 ... 63 57 78
1 Williams2 distorze 0.0 ... 91 68 67
2 Norton1 LCA 1.0 ... 58 90 64
3 Norton2 LCA 1.0 ... 29 91 87
4 Chavender1 distorze 1.0 ... 61 56 75
5 Chavender2 distorze 1.0 ... 54 74 80
6 Bendis1 distorze 1.0 ... 32 57 97
7 Bendis2 distorze 1.0 ... 55 69 79
8 Shawn1 AS 1.0 ... 15 74 75
9 Shawn2 AS 1.0 ... 67 86 79
10 Cichy1 LCA 0.0 ... 45 83 80
This is the snippet I was using and the output I am getting.
temp = "c:/Users/novos/ŠKOLA/Statistika/data Mariana/%s.xlsx"
dataKU = pd.read_excel(temp % "VestlabEXP_KU", engine = "openpyxl", skipfooter= 85) # patients using our rehabilitation tool
dataKONTR = pd.read_excel(temp % "VestlabEXP_kontr", engine = "openpyxl", skipfooter=51) # control group
dataKU_diag = dataKU.dropna()
dataKONTR_diag = dataKONTR.dropna()
dataKUBefore = dataKU_diag[dataKU_diag['Row'].str.contains("1")] # Patients data ending with 1 are before rehab
dataKUAfter = dataKU_diag[dataKU_diag['Row'].str.contains("2")] # Patients data ending with 2 are before rehab
dataKONTRBefore = dataKONTR_diagL[dataKONTR_diag['Row'].str.contains("1")]
dataKONTRAfter = dataKONTR_diagL[dataKONTR_diag['Row'].str.contains("2")]
b1 = dataKUBefore.boxplot(column=list(dataKUBefore.filter(regex='LOS_RT')), by="DG", rot = 45, color=dict(boxes='r', whiskers='r', medians='r', caps='r'),layout=(2,4),return_type='axes')
plt.ylim(0.5, 1.5)
plt.suptitle("")
plt.suptitle("Before, KU")
b2 = dataKUAfter.boxplot(column=list(dataKUAfter.filter(regex='LOS_RT')), by="DG", rot = 45, color=dict(boxes='b', whiskers='b', medians='b', caps='b'),layout=(2,4),return_type='axes')
# dataKUPredP
plt.suptitle("")
plt.suptitle("After, KU")
plt.ylim(0.5, 1.5)
plt.show()
Output is in two figures (red boxplot is all the "before rehab" data and blue boxplot is all the "after rehab")
Can you help me how make the red and blue boxplots next to each other for each diagnosis?
Thank you for any help.
You can try to concatenate the data into a single dataframe:
dataKUPlot = pd.concat({
'Before': dataKUBefore,
'After': dataKUAfter,
}, names=['When'])
You should see an additional index level named When in the output.
Using the example data you posted it looks like this:
>>> pd.concat({'Before': df, 'After': df}, names=['When'])
Row DG DKK ... LOS_DCL_LB LOS_DCL_L LOS_DCL_LF
When
Before 0 Williams1 distorze 0.0 ... 63 57 78
1 Williams2 distorze 0.0 ... 91 68 67
2 Norton1 LCA 1.0 ... 58 90 64
3 Norton2 LCA 1.0 ... 29 91 87
4 Chavender1 distorze 1.0 ... 61 56 75
After 0 Williams1 distorze 0.0 ... 63 57 78
1 Williams2 distorze 0.0 ... 91 68 67
2 Norton1 LCA 1.0 ... 58 90 64
3 Norton2 LCA 1.0 ... 29 91 87
4 Chavender1 distorze 1.0 ... 61 56 75
Then you can plot all of the boxes with a single command and thus on the same plots, by modifying the by grouper:
dataKUAfter.boxplot(column=dataKUPlot.filter(regex='LOS_RT').columns.to_list(), by=['DG', 'When'], rot = 45, layout=(2,4), return_type='axes')
I believe that’s the only “simple” way, I’m afraid that looks a little confused:
Any other way implies manual plotting with matplotlib − and thus better control. For example iterate on all desired columns:
fig, axes = plt.subplots(nrows=2, ncols=3, sharey=True, sharex=True)
pos = 1 + np.arange(max(dataKUBefore['DG'].nunique(), dataKUAfter['DG'].nunique()))
redboxes = {f'{x}props': dict(color='r') for x in ['box', 'whisker', 'median', 'cap']}
blueboxes = {f'{x}props': dict(color='b') for x in ['box', 'whisker', 'median', 'cap']}
ax_it = axes.flat
for colname, ax in zip(dataKUBefore.filter(regex='LOS_RT').columns, ax_it):
# Making a dataframe here to ensure the same ordering
show = pd.DataFrame({
'before': dataKUBefore[colname].groupby(dataKUBefore['DG']).agg(list),
'after': dataKUAfter[colname].groupby(dataKUAfter['DG']).agg(list),
})
ax.boxplot(show['before'].values, positions=pos - .15, **redboxes)
ax.boxplot(show['after'].values, positions=pos + .15, **blueboxes)
ax.set_xticks(pos)
ax.set_xticklabels(show.index, rotation=45)
ax.set_title(colname)
ax.grid(axis='both')
# Hide remaining axes:
for ax in ax_it:
ax.axis('off')
You could add a new column to separate 'Before' and 'After'. Seaborn's boxplots can use that new column as hue. sns.catplot(kind='box', ...) creates a grid of boxplots:
import seaborn as sns
import pandas as pd
import numpy as np
names = ['Adams', 'Arthur', 'Buchanan', 'Buren', 'Bush', 'Carter', 'Cleveland', 'Clinton', 'Coolidge', 'Eisenhower', 'Fillmore', 'Ford', 'Garfield', 'Grant', 'Harding', 'Harrison', 'Hayes', 'Hoover', 'Jackson', 'Jefferson', 'Johnson', 'Kennedy', 'Lincoln', 'Madison', 'McKinley', 'Monroe', 'Nixon', 'Obama', 'Pierce', 'Polk', 'Reagan', 'Roosevelt', 'Taft', 'Taylor', 'Truman', 'Trump', 'Tyler', 'Washington', 'Wilson']
rows = np.array([(name + '1', name + '2') for name in names]).flatten()
dataKONTR = pd.DataFrame({'Row': rows,
'DG': np.random.choice(['AS', 'Distorze', 'LCA'], len(rows)),
'LOS_RT_A': np.random.randint(15, 100, len(rows)),
'LOS_RT_B': np.random.randint(15, 100, len(rows)),
'LOS_RT_C': np.random.randint(15, 100, len(rows)),
'LOS_RT_D': np.random.randint(15, 100, len(rows)),
'LOS_RT_E': np.random.randint(15, 100, len(rows)),
'LOS_RT_F': np.random.randint(15, 100, len(rows))})
dataKONTR = dataKONTR.dropna()
dataKONTR['When'] = ['Before' if r[-1] == '1' else 'After' for r in dataKONTR['Row']]
cols = [c for c in dataKONTR.columns if 'LOS_RT' in c]
df_long = dataKONTR.melt(value_vars=cols, var_name='Which', value_name='Value', id_vars=['When', 'DG'])
g = sns.catplot(kind='box', data=df_long, x='DG', col='Which', col_wrap=3, y='Value', hue='When')
g.set_axis_labels('', '') # remove the x and y labels

add a new column based on other row values in dataframe

I have this dataframe named "test" and a list of words list_w = ['monthly', 'moon']. I want to add a new column "revised cosine" such that: for each word present in list_w with 'weak' condition with cosine == 'Na', the revised cosine of their corresponding condition 'unrel_weak' will also be 'Na', similarly for each word present in list_w with 'strong' condition with cosine == 'Na', their revised cosine of their corresponding condition 'unrel_strong' will also be 'Na'
isi prime target condition meanRT cosine
0 50 weekly monthly strong 676.2 0.9
1 1050 weekly monthly strong 643.5 0.9
2 50 daily monthly weak 737.2 Na
3 1050 daily monthly weak 670.6 Na
4 50 bathtub monthly unrel_strong 692.2 0.1
5 1050 bathtub monthly unrel_strong 719.1 0.1
6 50 sponge monthly unrel_weak 805.8 0.3
7 1050 sponge monthly unrel_weak 685.7 0.3
8 50 crescent moon strong 625.0 Na
9 1050 crescent moon strong 537.2 Na
10 50 sunset moon weak 698.4 0.2
11 1050 sunset moon weak 704.3 0.2
12 50 premises moon unrel_strong 779.2 0.7
13 1050 premises moon unrel_strong 647.6 0.7
14 50 descent moon unrel_weak 686.0 0.5
15 1050 descent moon unrel_weak 725.4 0.5
My code is as below:
for w in list_w:
if test.loc[(test['target']==w) & (test['condition']=='strong'), 'cosine']=='Na':
test.loc[(test['target']==w) & (test['condition']=='unrel_strong'), 'cosine'] ='Na'
My code returns error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
My expected output should be like in the dataframe below (with "revised cosine" column added)
isi prime target condition meanRT cosine revised cosine
0 50 weekly monthly strong 676.2 0.9 0.9
1 1050 weekly monthly strong 643.5 0.9 0.9
2 50 daily monthly weak 737.2 Na Na
3 1050 daily monthly weak 670.6 Na Na
4 50 bathtub monthly unrel_strong 692.2 0.1 0.1
5 1050 bathtub monthly unrel_strong 719.1 0.1 0.1
6 50 sponge monthly unrel_weak 805.8 0.3 Na
7 1050 sponge monthly unrel_weak 685.7 0.3 Na
8 50 crescent moon strong 625.0 Na Na
9 1050 crescent moon strong 537.2 Na Na
10 50 sunset moon weak 698.4 0.2 0.2
11 1050 sunset moon weak 704.3 0.2 0.2
12 50 premises moon unrel_strong 779.2 0.7 Na
13 1050 premises moon unrel_strong 647.6 0.7 Na
14 50 descent moon unrel_weak 686.0 0.5 0.5
15 1050 descent moon unrel_weak 725.4 0.5 0.5
any ideas to help me out? I checked logical_and but they seem to work only with 2 conditions.
Overwritting the cosine colum is also fine, as long as the output is like the revised cosine. Thanks in advance!
This error message comes from the fact that you can't do if statements like this with Pandas.
Try like this:
for w in list_w:
for c in ["weak", "strong"]:
mask = (
(test["target"] == w) & (test["condition"] == c) & (test["cosine"] == "Na")
)
test.loc[mask, "revised cosine"] = "Na"
Solution
m = test['cosine'].eq('Na') & \
test['target'].isin(list_w) & \
test['condition'].isin(['weak', 'strong'])
i1 = test.set_index(['isi', 'target', 'condition']).index
i2 = test[m].set_index(['isi', 'target', test.loc[m, 'condition'].radd('unrel_')]).index
test['revised_cosine'] = test['cosine'].mask(i1.isin(i2), 'Na')
Explanations
Let us create a boolean mask m which holds True when the cosine column contains Na and at the same time target column contains one of the word from list_w and condition column is either weak or strong
>>> m
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 True
9 True
10 False
11 False
12 False
13 False
14 False
15 False
dtype: bool
Create a MultiIndex based on the columns isi, target and condition, lets call it i1. Filter the rows in test dataframe using mask m, add a prefix unrel_ to the filtered rows in condition column and create another MultiIndex i2 in the similar way.
>>> i1
MultiIndex([( 50, 'monthly', 'strong'),
(1050, 'monthly', 'strong'),
( 50, 'monthly', 'weak'),
(1050, 'monthly', 'weak'),
( 50, 'monthly', 'unrel_strong'),
(1050, 'monthly', 'unrel_strong'),
( 50, 'monthly', 'unrel_weak'),
(1050, 'monthly', 'unrel_weak'),
( 50, 'moon', 'strong'),
(1050, 'moon', 'strong'),
( 50, 'moon', 'weak'),
(1050, 'moon', 'weak'),
( 50, 'moon', 'unrel_strong'),
(1050, 'moon', 'unrel_strong'),
( 50, 'moon', 'unrel_weak'),
(1050, 'moon', 'unrel_weak')],
names=['isi', 'target', 'condition'])
>>> i2
MultiIndex([( 50, 'monthly', 'unrel_weak'),
(1050, 'monthly', 'unrel_weak'),
( 50, 'moon', 'unrel_strong'),
(1050, 'moon', 'unrel_strong')],
names=['isi', 'target', 'condition'])
Mask the values in cosine column using a boolean mask which can be created by testing the membership of i1 in i2
isi prime target condition meanRT cosine revised_cosine
0 50 weekly monthly strong 676.2 0.9 0.9
1 1050 weekly monthly strong 643.5 0.9 0.9
2 50 daily monthly weak 737.2 Na Na
3 1050 daily monthly weak 670.6 Na Na
4 50 bathtub monthly unrel_strong 692.2 0.1 0.1
5 1050 bathtub monthly unrel_strong 719.1 0.1 0.1
6 50 sponge monthly unrel_weak 805.8 0.3 Na
7 1050 sponge monthly unrel_weak 685.7 0.3 Na
8 50 crescent moon strong 625.0 Na Na
9 1050 crescent moon strong 537.2 Na Na
10 50 sunset moon weak 698.4 0.2 0.2
11 1050 sunset moon weak 704.3 0.2 0.2
12 50 premises moon unrel_strong 779.2 0.7 Na
13 1050 premises moon unrel_strong 647.6 0.7 Na
14 50 descent moon unrel_weak 686.0 0.5 0.5
15 1050 descent moon unrel_weak 725.4 0.5 0.5

Get the average value of the pixel across all the frames

I have 14400 values saved in a list which represent a 4 pixels vertical by 6 pixels horizontal and 600 frames.
Here are the values if anyone is interested
# len of blurry values 14400
data = np.array(blurry_values)
#shape of data (4, 6, 600)
shape = ( 4,6,600 )
data= data.reshape(shape)
#print(np.round(np.mean(data, axis=2),2))
[[0.89 0.37 0.45 0.44 0.51 0.52]
[0.5 0.47 0.53 0.48 0.48 0.53]
[0.49 0.5 0.5 0.53 0.48 0.54]
[0.48 0.51 0.45 0.55 0.5 0.49]]
However, when I confirm the sanity of the first average by doing the following
list1 = blurry_values[::23]
np.round(np.mean(list1),2)
I get 0.51 instead of 0.89
I am trying to get the average value of the pixel across all the frames. Why are these values different?
I don't know exactly why, but :
list1 = blurry_values[:600]
gives 0.89
list1 = blurry_values[600:1200]
gives 0.37
Python starts reshaping by first filling the last dimension, i believe..
Let us tackle this with a smaller array:
import numpy as np
np.random.seed(42)
values = np.random.randint(low=0, high=100, size=48)
shape = (2,4,6)
data = values.reshape(shape) # 2 frames of 4 pixels by 6 pixels each
print(data, '\n')
print(np.round(np.mean(data, axis=0),2), '\n') # average values across frames
list1 = values[::24]
print(np.round(np.mean(list1),2)) # average of first pixel across frames
Output:
[[[51 92 14 71 60 20]
[82 86 74 74 87 99]
[23 2 21 52 1 87]
[29 37 1 63 59 20]]
[[32 75 57 21 88 48]
[90 58 41 91 59 79]
[14 61 61 46 61 50]
[54 63 2 50 6 20]]]
[[41.5 83.5 35.5 46. 74. 34. ]
[86. 72. 57.5 82.5 73. 89. ]
[18.5 31.5 41. 49. 31. 68.5]
[41.5 50. 1.5 56.5 32.5 20. ]]
41.5
Since I haven't seen the code that produced blurry_values, I can't be 100% sure, but I'm guessing that you're re-shaping blurry_values wrongly.
In most programming scenarios, I would expect the pixel-height and pixel-width to be represented by the last two axes, and the frame to be represented by an axis preceding these two.
So, I'm guessing that your shape should have been shape = (600, 4, 6) instead of shape = (4, 6, 600).
In that case, you should be doing np.round(np.mean(data, axis=0),2) rather than np.round(np.mean(data, axis=2),2). BTW, that would also produce a shape of (4, 6).
Then, for your sanity check, you should be doing this:
list1 = blurry_values[::24] # Note that it's 24, not 23
np.round(np.mean(list1),2)
You should be checking whether the first value of np.round(np.mean(data, axis=0),2), with the first value of np.round(np.mean(list1),2). (I haven't tested it myself, though).

average of binned values

I have 2 separate dataframes and want to do correlation between them
Time temperature | Time ratio
0 32 | 0 0.02
1 35 | 1 0.1
2 30 | 2 0.25
3 31 | 3 0.17
4 34 | 4 0.22
5 34 | 5 0.07
I want to bin my data every 0.05 (from ratio), with time as index and do an average in each bin on all the temperature values that correspond to that bin.
I will therefore obtain one averaged value for each 0.05 point
anyone could help out please? Thanks!
****edit on how data look like**** (df1 on the left, df2 on the right)
Time device-1 device-2... | Time device-1 device-2...
0 32 34 | 0 0.02 0.01
1 35 31 | 1 0.1 0.23
2 30 30 | 2 0.25 0.15
3 31 32 | 3 0.17 0.21
4 34 35 | 4 0.22 0.13
5 34 31 | 5 0.07 0.06
This could work with the pandas library:
import pandas as pd
import numpy as np
temp = [32,35,30,31,34,34]
ratio = [0.02,0.1,0.25,0.17,0.22,0.07]
times = range(6)
# Create your dataframe
df = pd.DataFrame({'Time': times, 'Temperature': temp, 'Ratio': ratio})
# Bins
bins = pd.cut(df.Ratio,np.arange(0,0.25,0.05))
# get the mean temperature of each group and the list of each time
df.groupby(bins).agg({"Temperature": "mean", "Time": list})
Output:
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.1, 0.15] NaN []
(0.15, 0.2] 31.0 [3]
You can discard the empty bins with .dropna() like this:
df.groupby(bins).agg({"Temperature": "mean", "Time": list}).dropna()
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.15, 0.2] 31.0 [3]
EDIT: In the case of multiple machines, here is a solution:
import pandas as pd
import numpy as np
n_machines = 3
# Generate random data for temperature and ratios
temperature_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.randint(30,40,10))
for i in range(n_machines)} )
ratio_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.uniform(0.01,0.5,10))
for i in range(n_machines)} )
# If ratio is between 0 and 1, we get the bins spaced by .05
def get_bins(s):
return pd.cut(s,np.arange(0,1,0.05))
# Get bin assignments for each machine
bins = ratio_df.apply(get_bins,axis=1)
# Get the mean of each group for each machine
df = temperature_df.apply(lambda x: x.groupby(bins[x.name]).agg("mean"))
Then if you want to display the result, you could use the seaborn package:
import matplotlib.pyplot as plt
import seaborn as sns
df_reshaped = df.reset_index().melt(id_vars='index')
df_reshaped.columns = [ 'Ratio bin','Machine','Mean temperature' ]
sns.barplot(data=df_reshaped,x="Ratio bin",y="Mean temperature",hue="Machine")
plt.show()

Pandas bar plot with binned range

Is there a way to create a bar plot from continuous data binned into predefined intervals? For example,
In[1]: df
Out[1]:
0 0.729630
1 0.699620
2 0.710526
3 0.000000
4 0.831325
5 0.945312
6 0.665428
7 0.871845
8 0.848148
9 0.262500
10 0.694030
11 0.503759
12 0.985437
13 0.576271
14 0.819742
15 0.957627
16 0.814394
17 0.944649
18 0.911111
19 0.113333
20 0.585821
21 0.930131
22 0.347222
23 0.000000
24 0.987805
25 0.950570
26 0.341317
27 0.192771
28 0.320988
29 0.513834
231 0.342541
232 0.866279
233 0.900000
234 0.615385
235 0.880597
236 0.620690
237 0.984375
238 0.171429
239 0.792683
240 0.344828
241 0.288889
242 0.961686
243 0.094402
244 0.960526
245 1.000000
246 0.166667
247 0.373494
248 0.000000
249 0.839416
250 0.862745
251 0.589873
252 0.983871
253 0.751938
254 0.000000
255 0.594937
256 0.259615
257 0.459916
258 0.935065
259 0.969231
260 0.755814
and instead of a simple histogram:
df.hist()
I need to create a bar plot, where each bar will count a number of instances within a predefined range.
For example, the following plot should have three bars with the number of points which fall into: [0 0.35], [0.35 0.7] [0.7 1.0]
EDIT
Many thanks for your answers. Another question, how to order bins?
For example, I get the following result:
In[349]: out.value_counts()
Out[349]:
[0, 0.001] 104
(0.001, 0.1] 61
(0.1, 0.2] 32
(0.2, 0.3] 20
(0.3, 0.4] 18
(0.7, 0.8] 6
(0.4, 0.5] 6
(0.5, 0.6] 5
(0.6, 0.7] 4
(0.9, 1] 3
(0.8, 0.9] 2
(1, 1.001] 0
as you can see, the last three bins are not ordered. How to sort the data frame based on 'categories' or my bins?
EDIT 2
Just found how to solve it, simply with 'reindex()':
In[355]: out.value_counts().reindex(out.cat.categories)
Out[355]:
[0, 0.001] 104
(0.001, 0.1] 61
(0.1, 0.2] 32
(0.2, 0.3] 20
(0.3, 0.4] 18
(0.4, 0.5] 6
(0.5, 0.6] 5
(0.6, 0.7] 4
(0.7, 0.8] 6
(0.8, 0.9] 2
(0.9, 1] 3
(1, 1.001] 0
You can make use of pd.cut to partition the values into bins corresponding to each interval and then take each interval's total counts using pd.value_counts. Plot a bar graph later, additionally replace the X-axis tick labels with the category name to which that particular tick belongs.
out = pd.cut(s, bins=[0, 0.35, 0.7, 1], include_lowest=True)
ax = out.value_counts(sort=False).plot.bar(rot=0, color="b", figsize=(6,4))
ax.set_xticklabels([c[1:-1].replace(","," to") for c in out.cat.categories])
plt.show()
If you want the Y-axis to be displayed as relative percentages, normalize the frequency counts and multiply that result with 100.
out = pd.cut(s, bins=[0, 0.35, 0.7, 1], include_lowest=True)
out_norm = out.value_counts(sort=False, normalize=True).mul(100)
ax = out_norm.plot.bar(rot=0, color="b", figsize=(6,4))
ax.set_xticklabels([c[1:-1].replace(","," to") for c in out.cat.categories])
plt.ylabel("pct")
plt.show()
You may consider using matplotlib to plot the histogram. Unlike pandas' hist function, matplotlib.pyplot.hist accepts an array as input for the bins.
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
import pandas as pd
x = np.random.rand(120)
df = pd.DataFrame({"x":x})
bins= [0,0.35,0.7,1]
plt.hist(df.values, bins=bins, edgecolor="k")
plt.xticks(bins)
plt.show()
You can use pd.cut
bins = [0,0.35,0.7,1]
df = df.groupby(pd.cut(df['val'], bins=bins)).val.count()
df.plot(kind='bar')

Categories

Resources