Groupby top n records based on another column

Groupby top n records based on another column - python

Below is an example of the data set I am working with. I am trying to do a group by on Location for only the top 3 locations based on KGs.
Location KG Dollars
BKK 7 2
BKK 5 3
BKK 4 2
BKK 3 3
BKK 2 2
HKG 8 3
HKG 6 2
HKG 4 3
HKG 3 2
HKG 2 3
The output would look like below. Grouped on the location, summing both KG and Dollars for the top 3 KG records for each location.
Location KG Dollars
BKK 16 7
HKG 18 8
I've tried different types of groupbys, just having a problem only specifying the top n KG records for the groupby.

You could do
In [610]: df.groupby('Location').apply(lambda x:
x.nlargest(3, 'KG')[['KG', 'Dollars']].sum())
Out[610]:
KG Dollars
Location
BKK 16 7
HKG 18 8

Related

How to filter a dataframe and identify records based on a condition on multiple other columns

id zone price
0 0000001 1 33.0
1 0000001 2 24.0
2 0000001 3 34.0
3 0000001 4 45.0
4 0000001 5 51.0
I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here). dataframe consist of a certain id with 5 zones and 5 prices. these prices should follow the below pattern
p1 (price of zone 1) < p2< p3< p4< p5
if anything out of order we should identify and print anomaly records to a file.
here in this example p3 <p4 <p5 but p1 and p2 are erroneous. (p1 > p2 whereas p1 < p2 is expected)
therefore 1st 2 records should be printed to a file
likewise this has to be done to the entire dataframe for all unique ids in it
My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?

You can compute the diff per group after sorting the values to ensure the zones are increasing. If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:
s = (df.sort_values(by=['id', 'zone']) # sort rows
.groupby('id') # group by id
['price'].diff() # compute the diff
.le(0) # flag those ≤ 0 (not increasing)
)
df[s|s.shift(-1)] # slice flagged rows + previous row
Example output:
id zone price
0 1 1 33.0
1 1 2 24.0
Example input:
id zone price
0 1 1 33.0
1 1 2 24.0
2 1 3 34.0
3 1 4 45.0
4 1 5 51.0
5 2 1 20.0
6 2 2 24.0
7 2 3 34.0
8 2 4 45.0
9 2 5 51.0
saving to file
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')

Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column. Then you can just print out the prices that have fell in value:
import numpy as np
import pandas as pd
df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))
>>> df
id zone price increase
0 1 1 33 no change
1 1 2 24 dec
2 1 3 34 inc
3 1 4 45 inc
4 1 5 51 inc
5 2 1 34 no change
6 2 2 56 inc
7 2 3 22 dec
8 2 4 55 inc
9 2 5 77 inc
10 3 1 44 no change
11 3 2 55 inc
12 3 3 44 dec
13 3 4 66 inc
14 3 5 33 dec
>>> df.loc[df.increase.eq('dec')]
id zone price increase
1 1 2 24 dec
7 2 3 22 dec
12 3 3 44 dec
14 3 5 33 dec
I have added some extra ID's to try and mimic your real data.

Using Pandas to merge similar data

How do I merge similar data such as "recommendation" into one value?
df['Why you choose us'].str.lower().value_counts()
location 35
recommendation 23
recommedation 8
confort 7
availability 4
reconmmendation 3
facilities 3

print(df)
reason count
0 location 35
1 recommendation 23
2 recommedation 8
3 confort 7
4 availability 4
5 reconmmendation 3
6 facilities 3
.groupby(), partial string..transform() while finding the sum
df['groupcount']=df.groupby(df.reason.str[0:4])['count'].transform('sum')
reason count groupcount
0 location 35 35
1 recommendation 23 34
2 recommedation 8 34
3 confort 7 7
4 availability 4 4
5 reconmmendation 3 34
6 facilities 3 3
If needed to see string and partial string side by side. Try
df=df.assign(groupname=df.reason.str[0:4])
df['groupcount']=df.groupby(df.reason.str[0:4])['count'].transform('sum')
print(df)
reason count groupname groupcount
0 location 35 loca 35
1 recommendation 23 reco 34
2 recommedation 8 reco 34
3 confort 7 conf 7
4 availability 4 avai 4
5 reconmmendation 3 reco 34
6 facilities 3 faci 3
Incase you have multiple items in a row like you have in the csv; then
#Read csv
df=pd.read_csv(r'path')
#Create another column which is a list of values 'Why you choose us' in each row
df['Why you choose us']=(df['Why you choose us'].str.lower().fillna('no comment given')).str.split(',')
#Explode group to ensure each unique reason is int its own row but with all the otehr attrutes intact
df=df.explode('Why you choose us')
#remove any white spaces before values in the column group and value_counts
df['Why you choose us'].str.strip().value_counts()
print(df['Why you choose us'].str.strip().value_counts())
location 48
no comment given 34
recommendation 25
confort 8
facilities 8
recommedation 8
price 7
availability 6
reputation 5
reconmmendation 3
internet 3
ac 3
breakfast 3
tranquility 2
cleanliness 2
aveilable 1
costumer service 1
pool 1
comfort 1
search engine 1
Name: group, dtype: int64

Adding a subtotal column to a multilevel column table

This is my dataframe after pivoting:
Country London Shanghai
PriceRange 100-200 200-300 300-400 100-200 200-300 300-400
Code
A 1 1 1 2 2 2
B 10 10 10 20 20 20
Is it possible to add columns after every country to achieve the following:
Country London Shanghai All
PriceRange 100-200 200-300 300-400 SubTotal 100-200 200-300 300-400 SubTotal 100-200 200-300 300-400 SubTotal
Code
A 1 1 1 3 2 2 2 6 3 3 3 9
B 10 10 10 30 20 20 20 60 30 30 30 90
I know I can use margins=True, however that just adds a final grand total.
Are there any options that I can use to achieve this? THanks.

Let us using sum with join
s=df.sum(level=0,axis=1)
s.columns=pd.MultiIndex.from_product([list(s),['subgroup']])
df=df.join(s).sort_index(level=0,axis=1).assign(Group=df.sum(axis=1))
df
A B Group
1 2 3 subgroup 1 2 3 subgroup
Code
A 1 1 1 3 2 2 2 6 9
B 10 10 10 30 20 20 20 60 90

Python - Adding rows to timeseries dataset

I have a pandas dataframe containing retail sales data which shows the total number of a product sold each week and the stock left at the end of the week. Unfortunately, the dataset only shows a row when a product has been sold and the stock left changes.
I would like to bulk out the dataset so that for each week there is a line for each product being sold. I've shown an example of this below - how can this be done?
As-Is:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 3 3 7
To-Be:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 2 0 10
2 3 3 7

Create a dataframe using product from itertools with all the combinations of both columns 'Week' and 'Product' and use merge with your original data. Let's say your dataframe is called dfp:
from itertools import product
new_dfp = (pd.DataFrame(list(product(dfp.Week.unique(), dfp.Product.unique())),columns=['Week','Product'])
.merge(dfp,how='left'))
You get the missing row in new_dfp:
Week Product Sold Stock
0 1 1 1.0 10.0
1 1 2 1.0 10.0
2 1 3 1.0 10.0
3 2 1 2.0 8.0
4 2 2 NaN NaN
5 2 3 3.0 7.0
Now you fillna on both column with different values:
new_dfp['Sold'] = new_dfp['Sold'].fillna(0).astype(int) # because no sold in missing rows
new_dfp['Stock'] = new_dfp.groupby('Product')['Stock'].fillna(method='ffill').astype(int)
To fill 'Stock', you need to groupby product and use the method 'ffill' to put the same value than last 'week'. At the end, you get:
Week Product Sold Stock
0 1 1 1 10
1 1 2 1 10
2 1 3 1 10
3 2 1 2 8
4 2 2 0 10
5 2 3 3 7

Cumulative Sum using 2 columns

I am trying to create a column that does a cumulative sum using 2 columns , please see example of what I am trying to do :#Faith Akici
index lodgement_year words sum cum_sum
0 2000 the 14 14
1 2000 australia 10 10
2 2000 word 12 12
3 2000 brand 8 8
4 2000 fresh 5 5
5 2001 the 8 22
6 2001 australia 3 13
7 2001 banana 1 1
8 2001 brand 7 15
9 2001 fresh 1 6
I have used the code below , however my computer keep crashing , I am unsure if is the code or the computer. Any help will be greatly appreciated:
df_2['cumsum']= df_2.groupby('lodgement_year')['words'].transform(pd.Series.cumsum)
Update ; I have also used the code below , it worked and said exit code 0 . However with some warnings.
df_2['cum_sum'] =df_2.groupby(['words'])['count'].cumsum()

You are almost there, Ian!
cumsum() method calculates the cumulative sum of a Pandas column. You are looking for that applied to the grouped words. Therefore:
In [303]: df_2['cumsum'] = df_2.groupby(['words'])['sum'].cumsum()
In [304]: df_2
Out[304]:
index lodgement_year words sum cum_sum cumsum
0 0 2000 the 14 14 14
1 1 2000 australia 10 10 10
2 2 2000 word 12 12 12
3 3 2000 brand 8 8 8
4 4 2000 fresh 5 5 5
5 5 2001 the 8 22 22
6 6 2001 australia 3 13 13
7 7 2001 banana 1 1 1
8 8 2001 brand 7 15 15
9 9 2001 fresh 1 6 6
Please comment if this fails on your bigger data set, and we'll work on a possibly more accurate version of this.

If we only need to consider the column 'words', we might need to loop through unique values of the words
for unique_words in df_2.words.unique():
if 'cum_sum' not in df_2:
df_2['cum_sum'] = df_2.loc[df_2['words'] == unique_words]['sum'].cumsum()
else:
df_2.update(pd.DataFrame({'cum_sum': df_2.loc[df_2['words'] == unique_words]['sum'].cumsum()}))
above will result to:
>>> print(df_2)
lodgement_year sum words cum_sum
0 2000 14 the 14.0
1 2000 10 australia 10.0
2 2000 12 word 12.0
3 2000 8 brand 8.0
4 2000 5 fresh 5.0
5 2001 8 the 22.0
6 2001 3 australia 13.0
7 2001 1 banana 1.0
8 2001 7 brand 15.0
9 2001 1 fresh 6.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby top n records based on another column - python

You could do In [610]: df.groupby('Location').apply(lambda x: x.nlargest(3, 'KG')[['KG', 'Dollars']].sum()) Out[610]: KG Dollars Location BKK 16 7 HKG 18 8

Related

How to filter a dataframe and identify records based on a condition on multiple other columns

Using Pandas to merge similar data

Adding a subtotal column to a multilevel column table

Python - Adding rows to timeseries dataset

Cumulative Sum using 2 columns

Categories

Resources