Binning a column in a DataFrame into 10 percentiles - python

I am looking to qcut or cut my "Amount" column into bins of 10 percentiles. Basically the describe() feature but with 0-10%, 11-20%, 21-30%, 31-40%, 41-50%, 51-60%, 61-70%, 71-80%, 81-90%, 91-100% instead.
After the binning i'd like to create a column that shows 1-10 indicating the bin that particular amount is apart of.
I've tried using this code below, however, I do not believe it's achieving what I want.
groups = df.groupby(pd.cut(df['Amount'], 10)).size()
Here is my DataFrame!
df.shape
Out[5]: (1385, 2)
df.head(10)
Out[6]:
Amount New or Repeat Customer
0 23044 New
1 15509 New
2 6184 New
3 6184 New
4 5828 New
5 5461 New
6 5143 New
7 5027 New
8 4992 New
9 4698 Repeat

Use pd.qcut:
# Sample data
size = 100
df = pd.DataFrame({
'Amount': np.random.randint(5000, 20000, size),
'CustomerType': np.random.choice(['New', 'Repeat'], size)
})
# Binning
labels = ['0% to 10%'] + [f'{i+1}% to {i+10}%' for i in range(10, 100, 10)]
df['Bin'] = pd.qcut(df['Amount'], 10, labels=labels)
Result:
Amount CustomerType Bin
0 15597 Repeat 61% to 70%
1 14498 New 51% to 60%
2 6373 Repeat 0% to 10%
3 9901 Repeat 21% to 30%
4 18450 Repeat 91% to 100%
5 9337 Repeat 21% to 30%
6 19310 Repeat 91% to 100%
7 11198 New 31% to 40%
8 12485 New 41% to 50%
9 11130 New 31% to 40%

Related

How to get n previous rows in pandas, after using loc?

We have data representing workers billing history of payments and penalties after their work shifts. Sometimes penalty for the worker is wrong, because he had technical problems with his mobile app and in reality he attended the job. Later he gets his penalty refunded which goes with description 'balance_correction'.
The goal is to show n lines (rows) in data to to find pattern for what he got the penalty.
So here is the data:
d = {'balance_id': [70775,70775 ,70775,70775,70775], 'amount': [2500, 2450,-500,500,2700]
,'description':['payment_for_job_order_080ecd','payment_for_job_order_180eca'
,'penalty_for_being_absent_at_job','balance_correction','payment_for_job_order_120ecq']}
df1 = pd.DataFrame(data=d)
df1
balance_id amount description
0 70775 2500 payment_for_job_order_080ecd
1 70775 2450 payment_for_job_order_180eca
2 70775 -500 penalty_for_being_absent_at_job
3 70775 500 balance_correction
4 70775 2700 payment_for_job_order_120ecq
I try this:
df1.loc[df1['description']=='balance_correction'].iloc[:-2]
and get nothing. Also using shift doesn't help. If we need 2 roes to show, the result should be
balance_id amount description
1 70775 2450 payment_for_job_order_180eca
2 70775 -500 penalty_for_being_absent_at_job
What can solve the issue?
If the index on your data frame is sequential (0, 1, 2, 3, ...), you can filter by the index:
idx = df1.loc[df1['description'] == 'balance_correction'].index
df1.loc[(idx - 2).append(idx - 1)]

Discretization : converting continuous values into a certain number of categories

1 Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.
2 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
3 Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.
4 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
My codes are below
df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
the out put is below:
Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1
Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1
but -1925 is wrong...
The right answer should be like this.
How can I do...
Maybe a typo on line 1: df["Usage_Per_Year "]? There is a space at the end of the column name.
pd.cut bins values into equal size. That's why all of your bins have same size. It seems that you should compute the min and max of each group after binning.
Also, to bin value into equal frequency, you should use pd.qcut.
Example input:
import numpy as np
import pandas as pd
rng = np.random.default_rng(20210514)
df = pd.DataFrame({
'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
})
# 1
group_label = ['Low', 'Medium', 'High']
df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
bins=3, labels=group_label)
# 2
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
# 3
df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
q=3, labels=group_label)
# 4
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
Example output:
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 878 31 20905
Medium 107 20955 41196
High 15 41991 62668
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 334 31 4378
Medium 333 4449 11424
High 333 11442 62668

pandas qcut and pandas cut functions do not distribute the number of items uniformly

I'm using pandas.qcut and pandas.cut to distribute the number of items uniformly across deciles based on a probability calculated. However, the items do not get distributed uniformly across deciles in both cases. Below is my code in each case:
pd.qcut(df['prob'], q=10, labels=False, duplicates='drop')
pd.cut (df['prob'], bins=10, labels=False)
Below is what I get in each case:
for pd.qcut:
Decile Count of Items
0 20300
1 7000
2 13800
3 14000
4 13000
5 13800
6 13700
7 14600
8 19000
9 70000
for pd.cut:
Decile Count of Items
0 1700
1 19000
2 39000
3 39000
4 32000
5 3100
6 3000
7 100
8 20
9 25
I didn't put the exact numbers but the magnitude should give an idea. the probability ranges from 0.01 to 0.15.
How can I distribute the items evenly across deciles?

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Create subcolumns in pandas dataframe python

I have a dataframe with multiple columns
df = pd.DataFrame({"cylinders":[2,2,1,1],
"horsepower":[120,100,89,70],
"weight":[5400,6200,7200,1200]})
cylinders horsepower weight
0 2 120 5400
1 2 100 6200
2 1 80 7200
3 1 70 1200
i would like to create a new dataframe and make two subcolumns of weight with the median and mean while gouping it by cylinders.
example:
weight
cylinders horsepower median mean
0 1 100 5299 5000
1 1 120 5100 5200
2 2 70 7200 6500
3 2 80 1200 1000
For my example tables i have used random values. I cant manage to achieve that.
I know how to get median and mean its described here in this stackoverflow question.
:
df.weight.median()
df.weight.mean()
df.groupby('cylinders') #groupby cylinders
But how to create this subcolumn?
The following code fragment adds the two requested columns. It groups the rows by cylinders, calculates the mean and median of weight, and combines the original dataframe and the result:
result = df.join(df.groupby('cylinders')['weight']\
.agg(['mean', 'median']))\
.sort_values(['cylinders', 'mean']).ffill()
# cylinders horsepower weight mean median
#2 1 80 7200 5800.0 5800.0
#3 1 70 1200 5800.0 5800.0
#1 2 100 6200 4200.0 4200.0
#0 2 120 5400 4200.0 4200.0
You cannot have "subcolumns" for select columns in pandas. If a column has "subcolumns," all other columns must have "subcolumns," too. It is called multiindexing.

Categories

Resources