Pandas - Sum for each unique word - python

Updated. Instead of dict data, I change for a dataframe as input
I'm analyzing a DataFrame with approximately 10,000 rows and 2 columns.
The criteria of my analysis is based on whether certain words appear in a certain cell.
I believe I will be more successful if I know which words are most relevant in terms of values...
Foo data to be used as an example:
data = { 'product': ['Dell Notebook I7', 'Dell Notebook I3', 'Logitech mx keys', 'Logitech mx 2'],
'cost': [1000,1200,300,100]}
df_data = pd.DataFrame(data)
product
cost
0
Dell Notebook I7
1000
1
Dell Notebook I3
1200
2
Logitech mx keys
300
3
Logitech mx 2
100
Basically, the column product shows the product an description.
In the column cost shows the product cost.
What I want:
I would like to create another dataframe like this:
Desired Output:
unique_words
total_cost_for_unique_word
1
Dell
2200
4
Logitech
2200
5
Notebook
2200
2
I3
1200
3
I7
1000
7
mx
400
6
keys
300
0
2
100
Column unique_words with the list of each word that appears in the column product.
Column total_cost_for_unique_word with the sum of the values of products that contain that word.
I've tried searching for posts here from StackOverflow... Also, I've done google research, but I haven't found a solution. Maybe I still don't have the knowledge to find the answer.
If by any chance it has already been answered, please let me know and I will delete the post.
Thank you all.

You can split, explode, groupby.agg:
df_data = pd.DataFrame(data)
new_df = (df_data
.assign(unique_words=df['product'].str.split())
.explode('unique_words')
.groupby('unique_words', as_index=False)
.agg(**{'total cost': ('cost' ,'sum')})
.sort_values('total cost', ascending=False, ignore_index=True)
)
Output:
unique_words total cost
0 Dell 2200
1 Notebook 2200
2 I3 1200
3 I7 1000
4 Logitech 400
5 mx 400
6 keys 300
7 2 100

If you first split the product into a list of all words (default is " "):
df["product"] = df["product"].str.split()
You can then explode this (for each item in the list as a new line), group all these together and sum the costs, then sorting and renaming columns to suit your outcome:
df.explode("product").groupby("product",as_index=False).agg("sum").sort_values("cost", ascending=False).rename(columns={"product": "unique_words", "cost", "total_cost_for_unique_word"})

Related

Pandas datfaframe, how to order a column grouped by 2 other columns

I have a dataframe like this one, with many more rows:
zone
keyword
sales
nyc1
iphone
10
nyc1
smart tv
6
nyc1
iphone
12
nyc2
laptop
22
slc1
iphone
3
slc2
radio
5
la1
iphone
10
la1
tablet
22
la1
tablet
5
How can I get another dataframe where for each zone/keyword I get the sum of the sales column (grouped by zone/keyword) in descending order?
For this example it should look like this (I don't want to reorder based on the other 2 columns, only sales):
zone
keyword
sales
nyc1
iphone
22
nyc1
smart tv
6
nyc2
laptop
22
slc1
iphone
3
slc2
radio
5
la1
tablet
27
la1
iphone
10
I already grouped the columns using
df_sales = df_sales.groupby(['zone','keyword'])['sales'].sum()
But the result is a series with the sum-of-sales column not in order.
Using reset_index and sort_values does order by sales, but removes the groupby and order the whole dataframe...
.reset_index().sort_values('sales', ascending=False)
How can I get a dataframe like the one above?
After you complete your groupby, you can use sort_values
df_sales = df_sales.groupby(['zone','keyword'])['sales'].sum()
sorted_df = df_sales.sort_values(by=['zone'])
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
df_sales.groupby(['zone','keyword'])['sales'].sum().reset_index().sort_values('sales', ascending=False)
reset_index reverts series back to dataframe and after that you can sort values.
soulution 1 : using agg(sum)
To get DataFrame object instead, use double square brackets around sales.
df_sales = df_sales.groupby(['zone','keyword'])[['sales']].agg('sum').reset_index()
soulution 2 : using sum()
df_sales = df_sales.groupby(['zone','keyword'])['sales'].sum().reset_index()

Python pandas to select a row value after group by

May I know how to select a row of values which has max count number after grouping by a column
Examples:
STATE COUNTY POPULATION
1 5571 1000
2 3421 2000
3 6781 3000
2 1234 4000
2 3344 6600
1 5566 9900
I want to find the STATE with max number of count of county, select STATE and COUNTY to show only, without POPULATION.
Answer should be, but i dont know how to code it in python. Thanks for help
STATE COUNTY
2 3
Try:
u = df.groupby('STATE')['COUNTRY'].size()
v = u[u.index==u.idxmax()].reset_index()
v:
STATE COUNTRY 0 2 3
Approach:
Group by STATE and then use nunique if you want to count distinct values or size on COUNTRY Column.
get the index of the row where count is the max.

Comparing two dataframes without duplicates

I have two similar structured dataframes that represent two periods in time, say Jul 2020 and Aug 2020. The data in it is forecasted and/or realised revenue data from several company sources like CRM and accouting application. The columns contain data on clients, product, quantity, price, revenue, period, etc. Now I want to see what happened between these to months by comparing the two dataframes.
I tried to do this by renaming some of the columns like quantity, price and revenue and then merge the two dataframes on client, product and period. After that I calculate the difference on the quanity, price and revenue.
However I run into a problem... Suppose one specific customer has closed a contract with us to purchase two specific products (abc & xyz) every month for the next two years. That means that in our July forecast we can include these two items as revenue. In reality this list is much longer with other contracts and also expected revenue that is in the weighted pipeline.
This is a small extract from the total forecast for our specific client.
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
Now suppose this client descides to purchase a second product xyz and we get another contract for this. Than it looks like this for July:
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
2 A xyz 2020-07 contracted 1 50 50
Now suppose we are a month later and from our accounting sytem we drew the realised revenue that looks like this (so what we forecasted became reality):
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 realised 1 100 100
1 A xyz 2020-07 realised 2 50 100
And now I want to compare them by merging the two df's after renaming some of the columns.
def rename_column(df_name, col_name, first_forecast_period):
col_name = df_name.rename(columns={col_name: col_name + '_' + first_forecast_period}, inplace=True)
return df_name
rename_column(df_1, 'Stage', '1')
rename_column(df_1, 'Price', '1')
rename_column(df_1, 'Qty', '1')
rename_column(df_1, 'Rev', '1')
rename_column(df_2, 'Stage', '2')
rename_column(df_2, 'Price', '2')
rename_column(df_2, 'Qty', '2')
rename_column(df_2, 'Rev', '2')
result_1 = pd.merge(df_1, df_2, how ='outer')
And then some math to get the differences:
result_1['Qty_diff'] = result1['Quantity_2'] - result1['Quantity_1']
result_1['Price_diff'] = result1['Price_2'] - result1['Price_1']
result_1['Rev_diff'] = result1['Rev_2'] - result1['Rev_1']
This results in:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
So, the problem is that in the third line the realised part is included a second time. Since the forecast and the reality are the same, the outcome should have been:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 0 0 0 -1 0 -50
And therefor I get a total revenue difference of 100 (+50 and +50), instead of 0 (+50 and -50). Is there any way this can be solved with merging the two DF's or do I need to start thinking in another direction. If so, then any suggestions would be helpful! Thanks.
You should probably get the totals for client-product-period on both dfs to be safe. Assuming all rows in df_1 are 'contracted', you can do:
df_1 = (df_1.groupby(['Client', 'Prooduct', 'Period'])
.agg({'Stage': 'first', 'Qty': sum, 'Price': 'first', 'Rev': sum})
# if price can vary between rows of the same product-client
# .agg({'Stage': 'first', 'Qty': sum, 'Price': 'mean', 'Rev': sum})
# same for df_2
Now you can merge both dfs with:
df_merged = df_1.merge(df_2)
The result will add suffixes to duplicate columns, _x and _y for df_1 and df_2 respectively.

How to run a groupby based on result of other/previous groupby?

Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?
Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G
One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.

Creating a dictionary of categoricals in SQL and aggregating them in Python

I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125

Categories

Resources