How to Group Data with Matching Values - python

I am new to using Pandas and I am trying to restructure a dataframe to remove the duplicates in my first column, while also keeping the number of each duplicate, and taking the sum of values in the second column.
For example, I would like the conversion to look something like this:
[In]:
+---+------+-------+
| | Name | Value |
+---+------+-------+
| 0 | A | 5 |
| 1 | B | 5 |
| 2 | C | 10 |
| 3 | A | 15 |
| 4 | A | 5 |
| 5 | C | 10 |
+---+------+-------+
[Out]:
+---+------+--------+-------+
| | Name | Number | Total |
+---+------+--------+-------+
| 0 | A | 3 | 25 |
| 1 | B | 1 | 5 |
| 2 | C | 2 | 20 |
+---+------+--------+-------+
So far, I haven't been able to find an efficient method to do this. (Or even a working method.)
I will be working with several hundred thousand rows, so I will need to find a pretty efficient method.

The pandas agg function on a groupby is what you want.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
Here is an example:
import pandas as pd
df=pd.DataFrame({'Name':['A','B','C','A','A','C'],
'Value':[5,5,10,15,5,10]})
df.groupby('Name').agg(['count','sum'])
Hope that helps.

Related

Intersect two dataframes in Pandas with respect to first dataframe?

I want to intersect two Pandas dataframes (1 and 2) based on two columns (A and B) present in both dataframes. However, I would like to return a dataframe that only has data with respect to the data in the first dataframe, omitting anything that is not found in the second dataframe.
So for example:
Dataframe 1:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 2 | Extra | Columns | In | 1 |
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Dataframe 2:
A | B | Extra | Columns | In | 2 |
----------------------------------
1 | 3 | Extra | Columns | In | 2 |
1 | 4 | Extra | Columns | In | 2 |
1 | 5 | Extra | Columns | In | 2 |
should return:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Is there a way I can do this simply?
You can use df.merge:
df = df1.merge(df2, on=['A','B'], how='inner').drop('2', axis=1)
how='inner' is default. Just put it there for your understanding of how df.merge works.
As #piRSquared suggested, you can do:
df1.merge(df2[['A', 'B']], how='inner')

Is there any way make column for each category under each group?

There are two column both are categorical. I want to groupby by first columns lets age_group and my other column Engagement_category should make new column by each element.
I did the groupby on first column age_group.
| age_group | Engagement_category |
|--------------|---------------------|
| 21-26 | Not Engaged |
| 27-32 | Disengaged |
| 33-38 | Engaged |
| 39-44 | Disengaged |
| 45-50 | Not Engaged |
| 50 and Above | Engaged |
group = df.groupby('age_group')
The below one is required output:
| age_group | Engaged | Nearly Engaged | Not Engaged | Disengaged |
|-----------|---------|----------------|-------------|------------|
| 21-26 | 3 | 4 | 1 | 1 |
| 27-32 | 4 | 0 | 4 | 0 |
| 33-38 | 2 | 0 | 1 | 1 |
Thank you.
You want to group by the first two columns then use the "COUNT" function to aggregate across the dataframe. Try this:
df.groupby(['Age', 'Engagement_category']).agg(['count'])

Reshaping pandas data frame based on a split on column name

I have a dataframe like so:
ID | Timestamp | x.val_1 | x.val_2 | y.val_1 | y.val_2 | ...
1js | 1pm | 0 | 0 | 10 | 3 | ...
1js | 2pm | 3 | 4 | 11 | 1 | ...
I would like to transform it into this data frame:
ID | Timestamp | LABEL | x | y |
1js | 1pm | val_1 | 0 | 10 |
1js | 1pm | val_2 | 0 | 3 |
1js | 2pm | val_1 | 3 | 11 |
1js | 2pm | val_2 | 4 | 1 |
...
I know that I can stack pandas dataframes, but here the columns have to be split to be able to do that, and I am wondering if there is an elegant way of handling this.
I looked at this question: How to melt 2 columns at the same time?
However, that does not address my question. In the example above, the answer assumes limited number of known columns. In my case, I have a lot of columns, in format "label.value_type", and I can have multiple labels and multiple value types, so the only guarantee is that structure. I would like a systematic way of reshaping this data frame using "." split in the column, and the question above does not address it.

How to make a Pandas count pivot table with related children items?

I'm stuck with a little problem with python and pandas dataframe.
I want to make a pivot table that count some relates items.
I got a dataframe with this structure.
+-----+------------+-----------+-----------+
| ID | Item_Type | Basket_ID | OwnerName |
+-----+------------+-----------+-----------+
| 3 | Basket | | |
| 336 | ChickenEgg | 3 | Henk |
| 841 | SomeEgg | 3 | Henk |
| 671 | EasterEgg | 3 | Piet |
| 9 | Basket | | |
| 336 | Orange | 9 | Piet |
| 841 | Banana | 9 | Piet |
| 671 | Strawberry | 9 | Herman |
| 888 | Apple | 9 | Herman |
| 821 | Apricots | 9 | NaN |
+-----+------------+-----------+-----------+
I want to count how many items are related to the ‘Basket’ item (Parent) and how often the ‘OwnerName’ appears with the related ‘Basket’ item.
I want my dataframe like below.
You can see the total item count from the Items that are related with the parent Item_Type ‘Basket’, and the total count how often the name appears.
You can also see how many ‘Total_Owners’ are and also item without a owner.
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| ID | Item_Type | Total_Items_Count | Henk_Count | Piet_Count | Herman_Count | Total_Owners | Total_NaN_Values |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| 3 | Basket | 3 | 2 | 1 | 0 | 3 | |
| 9 | Basket | 5 | 0 | 2 | 2 | 4 | 1 |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
Answering your question requires multiple steps but the core idea is that you should use pivot_table.
The table is conceptually a multilevel index. Basket ID is the high-level index and 'ID' is the more granular level index. First thing you have to do is remove the lines where basket_id is missing so that the granularity of the table is consistent.
Let's say you named your dataframe df.
# Preparation steps
df = df[~df["Basket_ID"].isna()] # Remove the rows that shouldnt be counted.
df.loc[df["OwnerName"].isna(),"OwnerName"] = "unknown" # set missing to arbitrary value
# Make a pivot table
df = df.pivot_table(index=['Basket_ID'],columns=['OwnerName'],values=['Item_Type'],aggfunc='count').fillna(0)
From there onwards you should be able to calculate your remaining columns

How to merge rows with same string, but sum up the rows connected

I have the following DataFrame
| name | number |
|------|--------|
| a | 1 |
| a | 1 |
| a | 1 |
| b | 2 |
| b | 2 |
| b | 2 |
| c | 3 |
| c | 3 |
| c | 3 |
| d | 4 |
| d | 4 |
| d | 4 |
I wish to merge all the rows by string, but have their number value added up and kept in line with the name..
Output desired..
| name | number |
|------|--------|
| a | 3 |
| b | 6 |
| c | 9 |
| d | 12 |
It seems you need groupby and aggregate sum:
df = df.groupby('name', as_index=False)['number'].sum()
#or
#df = df.groupby('name')['number'].sum().reset_index()
Assuming DataFrame is your table name
Select name, SUM(number) [number] FROM DataFrame GROUP BY name
Insert the result after deleting the original rows

Categories

Resources