Generate buckets of a numerical variable using interquartile range in pandas - python

I have some data stored in my pandas dataframe that shows salary for a bunch of users and their category.
| category | user_id | salary |
|----------|-----------|--------|
| A | 546457568 | 49203 |
| C | 356835679 | 49694 |
| A | 356785637 | 48766 |
| B | 45668758 | 36627 |
| C | 686794 | 59508 |
| C | 234232376 | 32765 |
| C | 4356345 | 44058 |
| A | 9878987 | 9999999|
What i would like to do is generate a new column salary_bucket that shows a bucket for salary, that is determined from the upper/lower limits of the Interquartile range for salary.
e.g. calculate upper/lower limits according to q1 - 1.5 x iqr and q3 + 1.5 x iqr, then split this into 10 equal buckets and assign each row to the relevant bucket based on salary. I know from exploration that there is no data outside the lower limit , but for data above the upper limit I would like a seperate bucket such as outside_iqr.
In the end I would liek to get something like so:
| category | user_id | salary | salary_bucket |
|----------|-----------|--------|---------------|
| A | 546457568 | 49203 | 7 |
| C | 356835679 | 49694 | 7 |
| A | 356785637 | 48766 | 7 |
| B | 45668758 | 36627 | 3 |
| C | 686794 | 59508 | 5 |
| C | 234232376 | 32765 | 3 |
| C | 4356345 | 44058 | 4 |
| A | 9878987 | 9999999|outside_iqr |
(these buckets are not actually calculate just for illustration sake)
Is something like qcut useful here?

You can use pandas.cut to turn continuous data into categorical data.
# First, we need to calculate our IQR.
q1 = df.salary.quantile(0.25)
q3 = df.salary.quantile(0.75)
iqr = q3 - q1
# Now let's calculate upper and lower bounds.
lower = q1 - 1.5*iqr
upper = q3 + 1.5*iqr
# Let us create our bins:
num_bins = 10
bin_width = (upper - lower) / num_bins
bins = [lower + i*bin_width for i in range(num_bins)]
bins += [upper, float('inf')] # Now we add our last bin, which will contain any value greater than the upper-bound of the IQR.
# Let us create our labels:
labels = [f'Bucket {i}' for i in range(1,num_bins+1)]
labels.append('Outside IQR')
# Finally, we add a new column to the df:
df['salary_bucket'] = pd.cut(df.salary, bins=bins, labels=labels)
So basically, you'll need to generate your own list of buckets and labels according to what you require, and then pass those as arguments to pandas.cut.

Related

PySpark - How to group by rows and then map them using custom function

Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)

Generating means of combinations in multiindex dataframe (Pandas)

I have a multiindexed dataframe where the index levels have multiple categories, something like this:
|Var1|Var2|Var3|
|Level1|Level2|Level3|----|----|----|
| A | A | A | | | |
| A | A | B | | | |
| A | B | A | | | |
| A | B | B | | | |
| B | A | A | | | |
| B | A | B | | | |
| B | B | A | | | |
| B | B | B | | | |
In summary, and specifically in my case, Level 1 has 2 levels, Level 2 has 24, Level 3 has 6, and there are also Levels 4 (674) and Level 5 (9) (with some minor variation depending on specific higher-level values - Level1 == 1 actually has 24 Level2s, but Level1 == 2 has 23).
I need to generate all possible combinations of 3 at Level 5, then calculate their means for Vars 1-3.
I am trying something like this:
# Resulting df to be populated
df_result = pd.DataFrame([])
# Retrieving values at Level1
lev1s = df.index.get_level_values("Level1").unique()
# Looping through each Level1 value
for lev1 in lev1s:
# Filtering df based on Level1 value
df_lev1 = df.query('Level1 == ' + str(lev1))
# Repeating...
lev2s = df_lev1.index.get_level_values("Level2").unique()
for lev2 in lev2s:
df_lev2 = df_lev1.query('Level2 == ' + str(lev2))
# ... until Level3
lev3s = df_lev2.index.get_level_values("Level3").unique()
# Creating all combinations
combs = itertools.combinations(lev3s, 3)
# Looping through each combination
for comb in combs:
# Filtering values in combination
df_comb = df_wl.query('Level3 in ' + str(comb))
# Calculating means using groupby (groupby might not be necessary,
# but I don't believe it has much of an impact
df_means = df_comb.reset_index().groupby(['Level1', 'Level2']).mean()
# Extending resulting dataframe
df_result = df_result.append(df_means)
The thing is, after a little while, this process gets really slow. Since I have around 2 * 24 * 6 * 674 levels and 84 combinations (of 9 elements, 3 by 3), I am expecting more than 16 million df_meanss to be calculated.
Is there any more efficient way to do this?
Thank you.

Drop duplicates based on first level column in MultiIndex DataFrame

I have a MultiIndex Pandas DataFrame like so:
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
| | VECTOR | SEGMENTS | OVERALL | INDIVIDUAL |
| | | | TIP X | TIP Y | CURVATURE | TIP X | TIP Y | CURVATURE |
| 0 | (TOP, TOP) | 2 | 3.24 | 1.309 | 44 | 1.62 | 0.6545 | 22 |
| 1 | (TOP, BOTTOM) | 2 | 3.495 | 0.679 | 22 | 1.7475 | 0.3395 | 11 |
| 2 | (BOTTOM, TOP) | 2 | 3.495 | -0.679 | -22 | 1.7475 | -0.3395 | -11 |
| 3 | (BOTTOM, BOTTOM) | 2 | 3.24 | -1.309 | -44 | 1.62 | -0.6545 | -22 |
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
How can I drop duplicates based on all columns contained under 'OVERALL' or 'INDIVIDUAL'? So if I choose 'INDIVIDUAL' to drop duplicates from the values of TIP X, TIP Y, and CURVATURE under INDIVIDUAL must all match for it to be a duplicate?
And further, as you can see from the table 1 and 2 are duplicates that are simply mirrored about the x-axis. These must also be dropped.
Also, can I center the OVERALL and INDIVIDUAL headings?
EDIT: frame.drop_duplicates(subset=['INDIVIDUAL'], inplace=True) produces KeyError: Index(['INDIVIDUAL'], dtype='object')
You can pass pandas .drop_duplicates a subset of tuples for multi-indexed columns:
df.drop_duplicates(subset=[
('INDIVIDUAL', 'TIP X'),
('INDIVIDUAL', 'TIP Y'),
('INDIVIDUAL', 'CURVATURE')
])
Or, if your row indices are unique, you could use the following approach that saves some typing:
df.loc[df['INDIVIDUAL'].drop_duplicates().index]
Update:
As you suggested in the comments, if you want to do operations on the dataframe you can do that in-line:
df.loc[df['INDIVIDUAL'].abs().drop_duplicates().index]
Or for non-pandas functions, you can use .transform:
df.loc[df['INDIVIDUAL'].transform(np.abs).drop_duplicates().index]

Pandas inter-column referencing

I have some data as follows:
+--------+------+
| Reason | Keys |
+--------+------+
| x | a |
| y | a |
| z | a |
| y | b |
| z | b |
| x | c |
| w | d |
| x | d |
| w | d |
+--------+------+
I want to get the Reason corresponding to the first occurrence of each Key. Like here, I should get Reasons x,y,x,w for Keys a,b,c,d respectively. After that, I want to compute the percentage of each Reason, as in a metric for how many times each Reason occurs. Thus x = 2/4 = 50%. And w,y = 25% each.
For the percentage, I think I can use something like value_counts(normalize=True) * 100, based on the previous step. What is a good way to proceed?
You are right about the second step and the first step could be achieved by
summary = df.groupby("Keys").first()
You can using drop_duplicates
df.drop_duplicates(['Reason'])
Out[207]:
Reason Keys
0 x a
1 y a
2 z a
6 w d

error creating python table using Agate library

I am using the Agate library to create a table.
Using the command as :
table = agate.Table(cpi_rows, cpi_types, cpi_titles)
Sample values are as below :
cpi_rows[0]
[1.0,'Denmark','DNK',128.0,'EU',1.0,91.0,7.0,2.2,87.0,95.0,83.0,98.0,0.0,97.0,0.0,96.0,98.0,0.0,87.0,89.0,88.0,83.0,0.0,0.0,0.0]
cpi_tiles
['Country Rank','Country / Territory','WB Code','IFS Code','Region','Country Rank','CPI 2013 Score', 'Surveys Used','Standard Error', '90% Confidence interval Lower', 'Upper','Scores range MIN','MAX','Data sources AFDB','BF (SGI)','BF (BTI)','IMD','ICRG','WB','WEF','WJP','EIU','GI','PERC','TI','FH']
When I run the command, I am getting the error as :
ValueError: Column names must be strings or None.
Though all the names in cpi_titles are type strings only, I am unable to get the cause for error.
Just tried your code, and apart from a few corrections to names and stuff this worked without a problem
cpi_rows = [[]]
cpi_rows[0] =[1.0,'Denmark','DNK',128.0,'EU',1.0,91.0,7.0,2.2,87.0,95.0,83.0,98.0,0.0,97.0,0.0,96.0,98.0,0.0,87.0,89.0,88.0,83.0,0.0,0.0,0.0]
cpi_titles = ['Country Rank','Country / Territory','WB Code','IFS Code','Region','Country Rank','CPI 2013 Score', 'Surveys Used','Standard Error', '90% Confidence interval Lower', 'Upper','Scores range MIN','MAX','Data sources AFDB','BF (SGI)','BF (BTI)','IMD','ICRG','WB','WEF','WJP','EIU','GI','PERC','TI','FH']
table = agate.Table(cpi_rows, cpi_titles)
print table.print_structure()
The output is
| column | data_type |
| ----------------------------- | --------- |
| Country Rank | Number |
| Country / Territory | Text |
| WB Code | Text |
| IFS Code | Number |
| Region | Text |
| Country Rank_2 | Number |
| CPI 2013 Score | Number |
| Surveys Used | Number |
| Standard Error | Number |
| 90% Confidence interval Lower | Number |
| Upper | Number |
| Scores range MIN | Number |
| MAX | Number |
| Data sources AFDB | Number |
| BF (SGI) | Number |
| BF (BTI) | Number |
| IMD | Number |
| ICRG | Number |
| WB | Number |
| WEF | Number |
| WJP | Number |
| EIU | Number |
| GI | Number |
| PERC | Number |
| TI | Number |
| FH | Number |
Obviously, I don't have your definition of types which you want to apply to this data. The only other thing to note is that you have defined Country Rank twice in your column titles so Agate does warn you about this and relabel it.

Categories

Resources