Reshaping pandas data frame based on a split on column name - python

I have a dataframe like so:
ID | Timestamp | x.val_1 | x.val_2 | y.val_1 | y.val_2 | ...
1js | 1pm | 0 | 0 | 10 | 3 | ...
1js | 2pm | 3 | 4 | 11 | 1 | ...
I would like to transform it into this data frame:
ID | Timestamp | LABEL | x | y |
1js | 1pm | val_1 | 0 | 10 |
1js | 1pm | val_2 | 0 | 3 |
1js | 2pm | val_1 | 3 | 11 |
1js | 2pm | val_2 | 4 | 1 |
...
I know that I can stack pandas dataframes, but here the columns have to be split to be able to do that, and I am wondering if there is an elegant way of handling this.
I looked at this question: How to melt 2 columns at the same time?
However, that does not address my question. In the example above, the answer assumes limited number of known columns. In my case, I have a lot of columns, in format "label.value_type", and I can have multiple labels and multiple value types, so the only guarantee is that structure. I would like a systematic way of reshaping this data frame using "." split in the column, and the question above does not address it.

Related

Intersect two dataframes in Pandas with respect to first dataframe?

I want to intersect two Pandas dataframes (1 and 2) based on two columns (A and B) present in both dataframes. However, I would like to return a dataframe that only has data with respect to the data in the first dataframe, omitting anything that is not found in the second dataframe.
So for example:
Dataframe 1:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 2 | Extra | Columns | In | 1 |
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Dataframe 2:
A | B | Extra | Columns | In | 2 |
----------------------------------
1 | 3 | Extra | Columns | In | 2 |
1 | 4 | Extra | Columns | In | 2 |
1 | 5 | Extra | Columns | In | 2 |
should return:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Is there a way I can do this simply?
You can use df.merge:
df = df1.merge(df2, on=['A','B'], how='inner').drop('2', axis=1)
how='inner' is default. Just put it there for your understanding of how df.merge works.
As #piRSquared suggested, you can do:
df1.merge(df2[['A', 'B']], how='inner')

Is there any way make column for each category under each group?

There are two column both are categorical. I want to groupby by first columns lets age_group and my other column Engagement_category should make new column by each element.
I did the groupby on first column age_group.
| age_group | Engagement_category |
|--------------|---------------------|
| 21-26 | Not Engaged |
| 27-32 | Disengaged |
| 33-38 | Engaged |
| 39-44 | Disengaged |
| 45-50 | Not Engaged |
| 50 and Above | Engaged |
group = df.groupby('age_group')
The below one is required output:
| age_group | Engaged | Nearly Engaged | Not Engaged | Disengaged |
|-----------|---------|----------------|-------------|------------|
| 21-26 | 3 | 4 | 1 | 1 |
| 27-32 | 4 | 0 | 4 | 0 |
| 33-38 | 2 | 0 | 1 | 1 |
Thank you.
You want to group by the first two columns then use the "COUNT" function to aggregate across the dataframe. Try this:
df.groupby(['Age', 'Engagement_category']).agg(['count'])

Splitting datasets of various size into separate dataframes

I am new in Python and pandas.
I have a .csv file exported from some measurement equipment that has given me all the measurements taken over the course of a day in a single .csv file.
I have already managed to produce a fairly tidy dataframe but I cannot work out. How to separate the measurements??
The dataframe is structured as follows:
+-------------------------+-------------------+--------------------+-----------+------+-------------+------+--------------+
| SetupTitle | measurement_type | nan | nan | nan | nan | nan | nan |
| MetaData | TestRecord | measurement number | nan | nan | nan | nan | nan |
| DataName | voltage 1 | voltage 2 | current 1 | ... | | | data name 8 |
| DataValues | data 1 | ... | | | | | data 8 |
| ... | | | | | | | |
| hundreds of data points | | | | | | | |
| ... | | | | | | | |
| SetupTitle | measurement type | nan | nan | nan | nan | nan | nan |
| etc... | | | | | | | |
+-------------------------+-------------------+--------------------+-----------+------+-------------+------+--------------+
I would like to split each measurement into individual dataframes by using the "SetupTitle" value as a start point, but I'm not sure how to iterate through the column or how to extract the rest of the columns from each.
I think once they are split up I will be able to remove the setup row and metadata row and use them to name the dataframe which will give me a nice dataset.
You can use cumsum to count the occurrences of a specific value and groupby to separate them:
s = df[name_of_column].eq('SetupTitle').cumsum()
then value of s will change every time there's a SetupTitle in your column. And you can access the blocks by:
# say we want to store them in a dict:
blocks = {}
for num_block, block in df.groupby(s):
# do whatever you want with the group
blocks[num_block] = block

How to make a Pandas count pivot table with related children items?

I'm stuck with a little problem with python and pandas dataframe.
I want to make a pivot table that count some relates items.
I got a dataframe with this structure.
+-----+------------+-----------+-----------+
| ID | Item_Type | Basket_ID | OwnerName |
+-----+------------+-----------+-----------+
| 3 | Basket | | |
| 336 | ChickenEgg | 3 | Henk |
| 841 | SomeEgg | 3 | Henk |
| 671 | EasterEgg | 3 | Piet |
| 9 | Basket | | |
| 336 | Orange | 9 | Piet |
| 841 | Banana | 9 | Piet |
| 671 | Strawberry | 9 | Herman |
| 888 | Apple | 9 | Herman |
| 821 | Apricots | 9 | NaN |
+-----+------------+-----------+-----------+
I want to count how many items are related to the ‘Basket’ item (Parent) and how often the ‘OwnerName’ appears with the related ‘Basket’ item.
I want my dataframe like below.
You can see the total item count from the Items that are related with the parent Item_Type ‘Basket’, and the total count how often the name appears.
You can also see how many ‘Total_Owners’ are and also item without a owner.
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| ID | Item_Type | Total_Items_Count | Henk_Count | Piet_Count | Herman_Count | Total_Owners | Total_NaN_Values |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| 3 | Basket | 3 | 2 | 1 | 0 | 3 | |
| 9 | Basket | 5 | 0 | 2 | 2 | 4 | 1 |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
Answering your question requires multiple steps but the core idea is that you should use pivot_table.
The table is conceptually a multilevel index. Basket ID is the high-level index and 'ID' is the more granular level index. First thing you have to do is remove the lines where basket_id is missing so that the granularity of the table is consistent.
Let's say you named your dataframe df.
# Preparation steps
df = df[~df["Basket_ID"].isna()] # Remove the rows that shouldnt be counted.
df.loc[df["OwnerName"].isna(),"OwnerName"] = "unknown" # set missing to arbitrary value
# Make a pivot table
df = df.pivot_table(index=['Basket_ID'],columns=['OwnerName'],values=['Item_Type'],aggfunc='count').fillna(0)
From there onwards you should be able to calculate your remaining columns

How to Group Data with Matching Values

I am new to using Pandas and I am trying to restructure a dataframe to remove the duplicates in my first column, while also keeping the number of each duplicate, and taking the sum of values in the second column.
For example, I would like the conversion to look something like this:
[In]:
+---+------+-------+
| | Name | Value |
+---+------+-------+
| 0 | A | 5 |
| 1 | B | 5 |
| 2 | C | 10 |
| 3 | A | 15 |
| 4 | A | 5 |
| 5 | C | 10 |
+---+------+-------+
[Out]:
+---+------+--------+-------+
| | Name | Number | Total |
+---+------+--------+-------+
| 0 | A | 3 | 25 |
| 1 | B | 1 | 5 |
| 2 | C | 2 | 20 |
+---+------+--------+-------+
So far, I haven't been able to find an efficient method to do this. (Or even a working method.)
I will be working with several hundred thousand rows, so I will need to find a pretty efficient method.
The pandas agg function on a groupby is what you want.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
Here is an example:
import pandas as pd
df=pd.DataFrame({'Name':['A','B','C','A','A','C'],
'Value':[5,5,10,15,5,10]})
df.groupby('Name').agg(['count','sum'])
Hope that helps.

Categories

Resources