How to make a Pandas count pivot table with related children items? - python

I'm stuck with a little problem with python and pandas dataframe.
I want to make a pivot table that count some relates items.
I got a dataframe with this structure.
+-----+------------+-----------+-----------+
| ID | Item_Type | Basket_ID | OwnerName |
+-----+------------+-----------+-----------+
| 3 | Basket | | |
| 336 | ChickenEgg | 3 | Henk |
| 841 | SomeEgg | 3 | Henk |
| 671 | EasterEgg | 3 | Piet |
| 9 | Basket | | |
| 336 | Orange | 9 | Piet |
| 841 | Banana | 9 | Piet |
| 671 | Strawberry | 9 | Herman |
| 888 | Apple | 9 | Herman |
| 821 | Apricots | 9 | NaN |
+-----+------------+-----------+-----------+
I want to count how many items are related to the ‘Basket’ item (Parent) and how often the ‘OwnerName’ appears with the related ‘Basket’ item.
I want my dataframe like below.
You can see the total item count from the Items that are related with the parent Item_Type ‘Basket’, and the total count how often the name appears.
You can also see how many ‘Total_Owners’ are and also item without a owner.
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| ID | Item_Type | Total_Items_Count | Henk_Count | Piet_Count | Herman_Count | Total_Owners | Total_NaN_Values |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| 3 | Basket | 3 | 2 | 1 | 0 | 3 | |
| 9 | Basket | 5 | 0 | 2 | 2 | 4 | 1 |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+

Answering your question requires multiple steps but the core idea is that you should use pivot_table.
The table is conceptually a multilevel index. Basket ID is the high-level index and 'ID' is the more granular level index. First thing you have to do is remove the lines where basket_id is missing so that the granularity of the table is consistent.
Let's say you named your dataframe df.
# Preparation steps
df = df[~df["Basket_ID"].isna()] # Remove the rows that shouldnt be counted.
df.loc[df["OwnerName"].isna(),"OwnerName"] = "unknown" # set missing to arbitrary value
# Make a pivot table
df = df.pivot_table(index=['Basket_ID'],columns=['OwnerName'],values=['Item_Type'],aggfunc='count').fillna(0)
From there onwards you should be able to calculate your remaining columns

Related

How do you append a set of Series with duplicate label names (index) to the end of a DataFrame as rows without using the deprecated append method?

I need to append a pandas Series as a row to the end of a pandas Dataframe. What makes this tricky is that I am using dates as my index, which are not unique in my case. This is what I want to be the result with the date values being the index.
+───────────+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───────────+─────────+──────────────+──────────+────────+
| 8/5/2015 | 1111 | 2 | FALSE | 8 |
| 8/5/2015 | 1076 | 3 | FALSE | 8 |
| 8/5/2015 | 1060 | 4 | FALSE | 8 |
| 8/6/2015 | 1540 | 5 | TRUE | 8 |
| 8/7/2015 | 1493 | 6 | TRUE | 8 |
| 8/7/2015 | 1060 | 0 | FALSE | 8 |
| 8/7/2015 | 1113 | 1 | FALSE | 8 |
| 8/8/2015 | 1027 | 2 | FALSE | 8 |
| 8/8/2015 | 1053 | 3 | FALSE | 8 |
| 8/8/2015 | 1051 | 4 | FALSE | 8 |
| 8/8/2015 | 1278 | 5 | TRUE | 8 |
| 8/8/2015 | 1086 | 6 | TRUE | 8 |
+───────────+─────────+──────────────+──────────+────────+
While this was easily possible with the append method, it is being deprecated and I am not sure that concat can replicate all of its functionality. (On a side note, why does the pandas team keep deprecating great functionality?).
My solution involves the loc method:
df.loc[len(df)] = series_row
df= df.rename(index={label_name: series_row.name})
In case you don't follow, we insert a new row at the end of the Dataframe. If we stop there, the label name will be an int value, specifically the size of the Dataframe.
df.loc[len(df)] = series_row
+───+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───+─────────+──────────────+──────────+────────+
| 1 | 1111 | 2 | FALSE | 8 |
+───+─────────+──────────────+──────────+────────+
To keep the append method's functionality, we need to rename the label to whatever we want which in this case was a date string.
df= df.rename(index={label_name: series_row.name})
+───────────+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───────────+─────────+──────────────+──────────+────────+
| 8/5/2015 | 1111 | 2 | FALSE | 8 |
+───────────+─────────+──────────────+──────────+────────+

Fuzzymatcher returns NaN for best_match_score

I'm observing odd behaviour while performing fuzzy_left_join from fuzzymatcher library. Trying to join two df, left one with 5217 records and right one with 8734, the all records with best_match_score is 71 records, which seems really odd . To achieve better results I even remove all the numbers and left only alphabetical charachters for joining columns. In the merged table the id column from the right table is NaN, which is also strange result.
left table - column for join "amazon_s3_name". First item - limonig
+------+---------+-------+-----------+------------------------------------+
| id | product | price | category | amazon_s3_name |
+------+---------+-------+-----------+------------------------------------+
| 1 | A | 1.49 | fruits | limonig |
| 8964 | B | 1.39 | beverages | studencajfuzelimonilimonetatrevaml |
| 9659 | C | 2.79 | beverages | studencajfuzelimonilimtreval |
+------+---------+-------+-----------+------------------------------------+
right table - column for join "amazon_s3_name" - last item - limoni
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| id | picture | amazon_s3_name |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| 191 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G.jpg | ahmadcajlimonidjindjifilxg |
| 192 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G40g.jpg | ahmadcajlimonidjindjifilxgg |
| 204 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Ahmadcajlimonidjindjifil20x2g40g00051265.jpg | ahmadcajlimonidjindjifilxgg |
| 1608 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Cajstudenfuzetealimonilimonovatreva15lpet.jpg | cajstudenfuzetealimonilimonovatrevalpet |
| 4689 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo.jpg | lesieursalatensosslimonimaslinovomaslo |
| 4690 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo05l500ml01301150.jpg | lesieursalatensosslimonimaslinovomaslolml |
| 4723 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Limoni.jpg | limoni |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
merged table - as we can see in the merged table best_match_score is NaN
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| id | best_match_score | __id_left | __id_right | price | category | amazon_s3_name_left | image_left | amazon_s3_name_left | image_right | amazon_s3_name_right |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| 0 | NaN | 0_left | None | 1.49 | Fruits | Limoni500g09700112 | NaN | limonig | NaN | NaN |
| 2 | NaN | 2_left | None | 1.69 | Bio | Morkovi1kgbr09700132 | NaN | morkovikgbr | NaN | NaN |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
You could give polyfuzz a try. Use the examples' setup, for example using TF-IDF or Bert, then run:
model = PolyFuzz(matchers).match(df1["amazon_s3_name"].tolist(), df2["amazon_s3_name"].to_list())
df1['To'] = model.get_matches()['To']
then merge:
df1.merge(df2, left_on='To', right_on='amazon_s3_name')

Pandas sample() with conditions

I have this dataframe (shortened) :
+-------+------------+--------+----------+-------+------+
| index | id_product | margin | supplier | price | seen |
+-------+------------+--------+----------+-------+------+
| 0 | 100000000 | 92.00 | 14 | 0.56 | 2 |
| 1 | 100000230 | 72.21 | 27 | 8.17 | 0 |
| 2 | 100001440 | 72.07 | 15 | 16.20 | 687 |
| 3 | 100002331 | 30.55 | 13 | 41.67 | 0 |
| 7 | 100001604 | 35.17 | 27 | 18.80 | 491 |
| ... | ... | ... | ... | ... | ... |
| 9830 | 100000320 | 77.78 | 18 | 13.33 | 0 |
| 9831 | 100000321 | 77.78 | 98 | 13.33 | 0 |
| 9832 | 100000443 | 77.78 | 17 | 13.33 | 4587 |
| 9834 | 100000292 | 88.13 | 3 | 10.56 | 0 |
| 9835 | 100000236 | 72.21 | 18 | 10.56 | 0 |
+-------+------------+--------+----------+-------+------+
What I am trying to do is to extract randomly 3 rows, using df.sample(3) maybe, but with this conditions :
the 3 rows selected should have 3 ecom_id different : (14,27,13) is good, (14,27,14) is not.
rows with higher margins should be privileged. I use weights='margin', it works fine.
rows with lower seen should be privileged. Is it possible to reverse the weight count with sample() to privilege lowest values ?
The 3 selected rows should be found in 3 different price slicing : first selected row should have a price < 20.0, second one should have a price between 30 and 50, and finally the third and last selected row should have a price > 80.
Is this possible ?
I have tried stuff like :
pr_1_pd = pr_pd.loc[pr_pd['price'] < 20]
pr_2_pd = pr_pd.loc[(pr_pd['price'] > 30) & (pr_pd['price'] < 50)]
pr_3_pd = pr_pd.loc[pr_pd['price'] > 80]
pr_1_pd = pr_1_pd.sort_values(by=['margin','seen'],ascending=[False,True])
pr_2_pd = pr_2_pd.sort_values(by=['margin','seen'],ascending=[False,True])
pr_3_pd = pr_3_pd.sort_values(by=['margin','seen'],ascending=[False,True])
But I'm not sure how to combine all the filters together
the 3 rows selected should have 3 ecom_id different : (14,27,13) is good, (14,27,14) is not.
Setting replace=False in pd.sample should achieve this if ecom_id is unique.
rows with lower seen should be privileged. Is it possible to reverse the weight count with sample() to privilege lowest values ?
You could invert the weights new_weight = 1 / seen to achieve this.
The 3 selected rows should be found in 3 different price slicing : first selected row should have a price < 20.0, second one should have a price between 30 and 50, and finally the third and last selected row should have a price > 80.
You'll have to sample from pr_1_pd, pr_2_pd, and pr_3_pd individually and then combine the results using pd.concat to achieve this.

Is there any way make column for each category under each group?

There are two column both are categorical. I want to groupby by first columns lets age_group and my other column Engagement_category should make new column by each element.
I did the groupby on first column age_group.
| age_group | Engagement_category |
|--------------|---------------------|
| 21-26 | Not Engaged |
| 27-32 | Disengaged |
| 33-38 | Engaged |
| 39-44 | Disengaged |
| 45-50 | Not Engaged |
| 50 and Above | Engaged |
group = df.groupby('age_group')
The below one is required output:
| age_group | Engaged | Nearly Engaged | Not Engaged | Disengaged |
|-----------|---------|----------------|-------------|------------|
| 21-26 | 3 | 4 | 1 | 1 |
| 27-32 | 4 | 0 | 4 | 0 |
| 33-38 | 2 | 0 | 1 | 1 |
Thank you.
You want to group by the first two columns then use the "COUNT" function to aggregate across the dataframe. Try this:
df.groupby(['Age', 'Engagement_category']).agg(['count'])

How to Group Data with Matching Values

I am new to using Pandas and I am trying to restructure a dataframe to remove the duplicates in my first column, while also keeping the number of each duplicate, and taking the sum of values in the second column.
For example, I would like the conversion to look something like this:
[In]:
+---+------+-------+
| | Name | Value |
+---+------+-------+
| 0 | A | 5 |
| 1 | B | 5 |
| 2 | C | 10 |
| 3 | A | 15 |
| 4 | A | 5 |
| 5 | C | 10 |
+---+------+-------+
[Out]:
+---+------+--------+-------+
| | Name | Number | Total |
+---+------+--------+-------+
| 0 | A | 3 | 25 |
| 1 | B | 1 | 5 |
| 2 | C | 2 | 20 |
+---+------+--------+-------+
So far, I haven't been able to find an efficient method to do this. (Or even a working method.)
I will be working with several hundred thousand rows, so I will need to find a pretty efficient method.
The pandas agg function on a groupby is what you want.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
Here is an example:
import pandas as pd
df=pd.DataFrame({'Name':['A','B','C','A','A','C'],
'Value':[5,5,10,15,5,10]})
df.groupby('Name').agg(['count','sum'])
Hope that helps.

Categories

Resources