I have one df that categorizes income into tiers across males and females and thousands of zip codes. I need to add a column to df2 that maps each person's income level by zip code (average, above average etc.).
The idea is to assign the highest cutoff exceeded by a given person's income, or assign to lowest tier by default
The income level for each tier also varies by zip code. For certain zip codes there are limited number of tiers (e.g. no very high incomes). There are also separate tiers for males by zip code not shown due to space.
I think I need to create some sort of dictionary, not sure how to handle this. any help would go a long way, thanks.
**Edit: The first df acts as a key, and I am looking to use it to assign the corresponding row value from the column 'Income Level' to df2
E.g. for a unique id in df2, compare df2['Annual Income'] to the matching id in df['Annual Income cutoff']. Then assign the highest possible Income level from df as a new row value in df2
import pandas as pd
import numpy as np
data = [['female',10009,'very high',10000000],['female',10009,'high',100000],['female',10009,'above average',75000],['female', 10009, 'average', 50000]]
df = pd.DataFrame(data, columns = ['Sex', 'Area Code', 'Income level', 'Annual Income cutoff'])
print(df)
Sex Area Code Income level Annual Income cutoff
0 female 10009 very high 10000000
1 female 10009 high 100000
2 female 10009 above average 75000
3 female 10009 average 50000
data_2 = [['female',10009, 98000], ['female', 10009, 56000]]
df2 = pd.DataFrame(data_2, columns = ['Sex', 'Area Code', 'Annual Income'])
print(df2)
Sex Area Code Annual Income
0 female 10009 98000
1 female 10009 56000
output_data = [['female',10009, 98000, 'above average'], ['female', 10009, 56000, 'average']]
final_output = pd.DataFrame(output_data, columns = ['Sex', 'Area Code', 'Annual Income', 'Income Level'])
print(final_output)
Sex Area Code Annual Income Income Level
0 female 10009 98000 above average
1 female 10009 56000 average
One way to do this is to use pd.merge_asof:
pd.merge_asof(df2.sort_values('Annual Income'),
df.sort_values('Annual Income cutoff'),
left_on = 'Annual Income',
right_on = 'Annual Income cutoff',
by=['Sex', 'Area Code'], direction = 'backward')
Output:
Sex Area Code Annual Income Income level Annual Income cutoff
0 female 10009 56000 average 50000
1 female 10009 98000 average 50000
Related
I want to extract specific columns that contain specific names. Below you can see my data
import numpy as np
import pandas as pd
data = {
'Names': ['Store (007) Total amount of Sales ',
'Store perc (65) Total amount of sales ',
'Mall store, aid (005) Total amount of sales',
'Increase in the value of sales / Additional seling (22) Total amount of sales',
'Dividends (0233) Amount of income tax',
'Other income (098) Total amount of Sales',
'Other income (0245) Amount of Income Tax',
],
'Sales':[10,10,9,7,5,5,5],
}
df = pd.DataFrame(data, columns = ['Names',
'Sales',
])
df
This data have some specific columns that I need to be selected in the separate data frame. Keywords for this selection are words Total amount of Sales or Total amount of sales . These words are placed after the second brackets ). Also please take into account that text is no trimmed so empty spaces are possible.
So can anybody help me how to solve this ?
Use Series.str.contains without test cases with case=False in boolean indexing:
df1 = df[df['Names'].str.contains('Total amount of Sales', case=False)]
print (df1)
Names Sales
0 Store (007) Total amount of Sales 10
1 Store perc (65) Total amount of sales 10
2 Mall store, aid (005) Total amount of sales 9
3 Increase in the value of sales / Additional se... 7
5 Other income (098) Total amount of Sales 5
Or if need test sales or Sales use:
df2 = df[df['Names'].str.contains('Total amount of [Ss]ales')]
I have two different dataframes, one containing the Net Revenue by SKU and Supplier and another one containing the stock of SKUs in each store. I need to get an average by Supplier of the stores that contains the SKUs that compouse up to 90% the net revenue of the supplier. It's a bit complicated but I will exemplify, and I hope it can make it clear. Please, note that if 3 SKUs compose 89% of the revenue, we need to consider another one.
Example:
Dataframe 1 - Net Revenue
Supplier
SKU
Net Revenue
UNILEVER
1111
10000
UNILEVER
2222
50000
UNILEVER
3333
500
PEPSICO
1313
680
PEPSICO
2424
10000
PEPSICO
2323
450
Dataframe 2 - Stock
Store
SKU
Stock
1
1111
1
1
2222
2
1
3333
1
2
1111
1
2
2222
0
2
3333
1
In this case, for UNILEVER, we need to discard SKU 3333 because its net revenue is not relevant (as 1111 and 2222 already compouse more than 90% of the total net revenue of UNILEVER). Coverage in this case will be 1.5 (we have 1111 in 2 stores and 2222 in one store: (1+2)/2).
Result is something like this:
Supplier
Coverage
UNILEVER
1.5
PEPSICO
...
Please, note that the real dataset has a different number of SKUs by supplier and a huge number of suppliers (around 150), so performance doesn't need to be PRIORITY but it has to be considered.
Thanks in advance, guys.
Calculate the cumulative sum grouping by Suppler and divide by the Supplier Total Revenue.
Then find each Supplier Revenue Threshold by getting the minimum Cumulative Revenue Percentage under 90%.
Then you can get the list of SKUs by Supplier and calculate the coverage.
import pandas as pd
df = pd.DataFrame([
['UNILEVER', '1111', 10000],
['UNILEVER', '2222', 50000],
['UNILEVER', '3333', 500],
['PEPSICO', '1313', 680],
['PEPSICO', '2424', 10000],
['PEPSICO', '2323', 450],
], columns=['Supplier', 'SKU', 'Net Revenue'])
total_revenue_by_supplier = df.groupby(df['Supplier']).sum().reset_index()
total_revenue_by_supplier.columns = ['Supplier', 'Total Revenue']
df = df.sort_values(['Supplier', 'Net Revenue'], ascending=[True, False])
df['cumsum'] = df.groupby(df['Supplier'])['Net Revenue'].transform(pd.Series.cumsum)
df = df.merge(total_revenue_by_supplier, on='Supplier')
df['cumpercentage'] = df['cumsum'] / df['Total Revenue']
min_before_threshold = df[df['cumpercentage'] >= 0.9][['Supplier', 'cumpercentage']].groupby('Supplier').min().reset_index()
min_before_threshold.columns = ['Supplier', 'Revenue Threshold']
df = df.merge(min_before_threshold, on='Supplier')
df = df[df['cumpercentage'] <= df['Revenue Threshold']][['Supplier', 'SKU', 'Net Revenue']]
df
I have a pandas dataframe where I wanted to find the percentage cutoff for most prevalent brand customer added to his cart.(New column: Percentage Cutoff for brand)
customerID
Date
Brand
Brand Count
Percentage Cutoff for brand
1
1-1-2021
Tomy Hilfigure
3
75%
1
1-1-2021
Adidas
1
75%
2
2-1-2021
Club Room
2
66%
2
2-1-2021
Levis
1
66%
2
3-1-2021
Adidas
4
50%
2
3-1-2021
Polo
4
50%
For customer 1, the percentage cutoff will be 75% as he has added 3 items of Tomy Hilfigure brand in his cart and 1 item of Adidas brand(25%) hence the percentage cutoff for the customer 1 is 75% for date 1-1-2021.
For customer 2, on date 2-1-2021, the percentage cutoff will be 66.67% as he added 2 items of Club room brand(66.66%) and 1 item of Levis brand(33%).
I am using pandas group by function but couldn't able to find the " Percentage Cutoff for brand".It would be great if you could give me a direction. Thank you.
Let me know if the cutoff calculation logic is not right, I used max_brand_count / total_brand_count
# Grouping by customerID and Date to calculate max and total brand count
gdf = df.groupby(['customerID', 'Date']) \
.agg(
max_brand_count=('Brand Count', 'max'),
total_brand_count=('Brand Count', 'sum')
) \
.reset_index()
# Calculate Percentage Cutoff for brand by dividing max and total brand counts
gdf['Percentage Cutoff for brand'] = gdf['max_brand_count'] / gdf['total_brand_count']
# Formatting it to percentage
gdf['Percentage Cutoff for brand'] = ['{:,.2%}'.format(val) for val in gdf['Percentage Cutoff for brand']]
Output of this groupby:
You can merge this to your original df if you want to have it all together.
final_df = df.merge(gdf, how='left', on=['customerID', 'Date'])
Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?
Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G
One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.
I am populating a DataFrame with an ordered dictionary, but the pandas DataFrame is alphabetically organizing the columns.
code
labels = income_data[0:-1:4]
year1 = income_data[1:-1:4]
key = eachTicker
value = OrderedDict(zip(labels, year1))
full_dict[key] = value
df = pd.DataFrame(full_dict)
print(df)
As you can see below full_dict is a zipped dictionary from multiple lists, namely : labels and year1
output of full_dict
print(full_dict)
OrderedDict([('AAPL', OrderedDict([('Total Revenue', 182795000), ('Cost of Revenue', 112258000), ('Gross Profit', 70537000), ('Research Development', 6041000), ('Selling General and Administrative', 11993000), ('Non Recurring', 0), ('Others', 0), ('Total Operating Expenses', 0), ('Operating Income or Loss', 52503000), ('Total Other Income/Expenses Net', 980000), ('Earnings Before Interest And Taxes', 53483000), ('Interest Expense', 0), ('Income Before Tax', 53483000), ('Income Tax Expense', 13973000), ('Minority Interest', 0), ('Net Income From Continuing Ops', 39510000), ('Discontinued Operations', 0), ('Extraordinary Items', 0), ('Effect Of Accounting Changes', 0), ('Other Items', 0), ('Net Income', 39510000), ('Preferred Stock And Other Adjustments', 0), ('Net Income Applicable To Common Shares', 39510000)]))])
The outputted DataFrame is ordered alphabetically and I do not know why. I want it to be ordered as in full_dict
code output
AAPL AMZN LNKD
Cost of Revenue 112258000 62752000 293797
Discontinued Operations 0 0 0
Earnings Before Interest And Taxes 53483000 99000 31205
Effect Of Accounting Changes 0 0 0
Extraordinary Items 0 0 0
Gross Profit 70537000 26236000 1924970
Income Before Tax 53483000 -111000 31205
Income Tax Expense 13973000 167000 46525
Interest Expense 0 210000 0
Minority Interest 0 0 -427
Net Income 39510000 -241000 -15747
Net Income Applicable To Common Shares 39510000 -241000 -15747
Net Income From Continuing Ops 39510000 -241000 -15747
Non Recurring 0 0 0
Operating Income or Loss 52503000 178000 36135
Other Items 0 0 0
Others 0 0 236946
Preferred Stock And Other Adjustments 0 0 0
Research Development 6041000 0 536184
Selling General and Administrative 11993000 26058000 1115705
Total Operating Expenses 0 0 0
Total Other Income/Expenses Net 980000 -79000 -4930
Total Revenue 182795000 88988000 2218767
This looks like a bug in the DataFrame ctor in that it's not respecting the key order when the orient is 'columns' a work around is to use from_dict and transpose the result when you specify the orient as 'index':
In [31]:
df = pd.DataFrame.from_dict(d, orient='index').T
df
Out[31]:
AAPL
Total Revenue 182795000
Cost of Revenue 112258000
Gross Profit 70537000
Research Development 6041000
Selling General and Administrative 11993000
Non Recurring 0
Others 0
Total Operating Expenses 0
Operating Income or Loss 52503000
Total Other Income/Expenses Net 980000
Earnings Before Interest And Taxes 53483000
Interest Expense 0
Income Before Tax 53483000
Income Tax Expense 13973000
Minority Interest 0
Net Income From Continuing Ops 39510000
Discontinued Operations 0
Extraordinary Items 0
Effect Of Accounting Changes 0
Other Items 0
Net Income 39510000
Preferred Stock And Other Adjustments 0
Net Income Applicable To Common Shares 39510000
EDIT
The bug is due to line 5746 in index.py:
def _union_indexes(indexes):
if len(indexes) == 0:
raise AssertionError('Must have at least 1 Index to union')
if len(indexes) == 1:
result = indexes[0]
if isinstance(result, list):
result = Index(sorted(result)) # <------ culprit
return result
When it constructs the index, it extracts the key using result = indexes[0] but then it checks if it's a list and if so sorts the result: result = Index(sorted(result)) this is why you get this result.
Issue here
duplicate issue