Flatten a Dataframe that is pivoted - python

I have the following code that is taking a single column and pivoting it into multiple columns. There are blanks in my result that I am trying to remove but I am running into issues with the wrong values being applied to rows.
task_df = task_df.pivot(index=pivot_cols, columns='Field')['Value'].reset_index()
task_df[['Color','Class']] = task_df[['Color','Class']].bfill()
task_df[['Color','Class']] = task_df[['Color','Class']].ffill()
task_df = task_df.drop_duplicates()
Start
Current
Desired

This is basically merging all rows having the same name or id together. You can do it with this:
mergers = {'ID': 'first', 'Color': 'sum', 'Class': 'sum'}
task_df = task_df.groupby('Name', as_index=False).aggregate(mergers).reindex(columns=task_df.columns).sort_values(by=['ID'])

Related

Apply separate formulas to separate columns but for the same row in Multi Index Dataframe?

I'm working with a multi index dataframe and I am trying to apply two different formulas and append the calculation as a new row in the bottom ("Total").
However, because I am working with multi index dataframe, the calculation would differ depending on which sub-header it is.
Dataframe:
data = {'ID1': [1.0, 2.0,1.2,1.7,0.8,0.9], 'ID2': [5.0,3.0,7.0,2.0,1.0,6.0], 'Date': ['10-31-2022','12-31-2022','10-31-2022','12-31-2022','10-31-2022','12-31-2022'], 'Name': ['John', 'John', 'Kat', 'Kat', 'Adam','Adam']}
df1 = pd.DataFrame (data= data)
Creating multi index:
df1 = df1.set_index(["Date", "Name"])
df1 = df1.unstack().swaplevel(0,1, axis = 1).sort_index(axis = 1, level = 0, sort_remaining = False)
df:
In the "Total" row, I am trying to apply the cumprod function to the ID1 columns, and ID2 columns will simply be the mean.
I am able to calculate the columns by filtering to ID1 or ID2 only, however I would need to create separate rows for them.
My output:
Expected output:
Note: the numbers in the total row are random, just putting it there for visual reference.
Calculation for ID1 columns:
df_cumprod = (1 + df.filter(like='ID1')).cumprod(skipna = True) - 1
df_cumprod2 = df_cumprod.tail(1)
df_cumprod2.rename({df_cumprod2.index[-1]: 'Total'}, inplace = True)
df.append(df_cumprod2)
Calculation for ID2 columns:
df.loc['Total'] = df.iloc[:, df.columns.get_level_values(1)=='ID2'].mean()
This creates two "Total" rows, is there a way to merge them together so that they appear side by side in the same row (as shown in the expected output above)? Or apply two different formulas based on the condition of the column name to the same row?

Groupby, sum, reset index & keep first all together

I am using the following code and my goal is to group by 2 columns (out of tens of them), then keep the first value of all the other columns while summing the values of two other columns. And it doesn't really work no matter the combination that I tried.
Code used:
df1 = df.groupby(['col_1', 'Col_2'], as_index = False)[['Age', 'Income']].apply(sum).first()
The error that I am getting is the following which just leads me to believe that this can be done with a slightly different version of the code that I used.
TypeError: first() missing 1 required positional argument: 'offset'
Any suggestions would be more than appreciated!
You can use agg with configuring corresponding functions for each column.
group = ['col_1', 'col_2']
(df.groupby(group, as_index=False)
.agg({
**{x: 'first' for x in df.columns[~df.columns.isin(group)]}, # for all columns other than grouping column
**{'Age': 'sum', 'Income': 'sum'} # Overwrite aggregation for specific columns
})
)
This part { **{...}, **{...} } will generate
{
'Age': 'sum',
'Income': 'sum',
'othercol': 'first',
'morecol': 'first'
}

Python Pandas - How to find rows where a column value is different from two data frame

I am trying to get rows where values in a column are different from two data frames.
For example, let say we have these two dataframe below:
import pandas as pd
data1 = {'date' : [20210701, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0]}
data2 = {'date' : [20210701, 20210702, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0,0]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
As you can see Dave has different values in column 'a' on 20210704, and Sue has different values in column 'a' on 020210705. Therefore, my desire output should be something like:
import pandas as pd
output = {'date' : [20210704, 20210705],
'name': ['Dave', 'Sue'],
'a_from_old' : [0,1]}
df_output = pd.DataFrame(output)
If I am not mistaken, what I am asking for is pretty much the same thing as minus statement in SQL unless I am missing some edge cases.
How can I find those rows with the same date and name but different values in a column?
Update
I found an edge case that some data is not even in another data frame, I want to find the ones that are in both data frame but the value in column 'a' is different.
I edited the sample data set to take the edge case into account.
(Notice that Dave on 20210702 is not appear on the final output because the data is not in the first data frame).
Another option but with an inner merge and keep only rows where the a from df1 does not equal the a from df2:
df3 = (
df1.merge(df2, on=['date', 'name'], suffixes=('_from_old', '_df2'))
.query('a_from_old != a_df2') # df1 `a` != df2 `a`
.drop('a_df2', axis=1) # Remove column with df2 `a` values
)
df3:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
try left merge() with indicator=True then filterout results with query() then drop extra column by drop() and rename 'a' to 'a_from_old' by using rename():
out=(df1.merge(df2,on=['date','name','a'],how='left',indicator=True)
.query("_merge=='left_only'").drop('_merge',1)
.rename(columns={'a':'a_from_old'}))
output of out:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
Note: If there are many more columns that you want to rename then pass:
suffixes=('_from_old', '') in the merge() method as a parameter

Joining dataframes based on values, pandas

I have two data frames, let's say A and B. A has the columns ['Name', 'Age', 'Mobile_number'] and B has the columns ['Cell_number', 'Blood_Group', 'Location'], with 'Mobile_number' and 'Cell_number' having common values. I want to join the 'Location' column only onto A based off the common values in 'Mobile_number' and 'Cell_number', so the final DataFrame would have A={'Name':,'Age':,'Mobile_number':,'Location':]
a = {'Name': ['Jake', 'Paul', 'Logan', 'King'], 'Age': [33,43,22,45], 'Mobile_number':[332,554,234, 832]}
A = pd.DataFrame(a)
b = {'Cell_number': [832,554,123,333], 'Blood_group': ['O', 'A', 'AB', 'AB'], 'Location': ['TX', 'AZ', 'MO', 'MN']}
B = pd.DataFrame(b)
Please suggest. A colleague suggest to use pd.Join but I don't understand how.
Thank you for your time.
the way i see it, you want to merge a dataframe with a part of another dataframe, based on some common column.
first you have to make sure the common column share the same name:
B['Mobile_number'] = B['Cell_number']
then create a dataframe that contains only the relevant columns (the indexing column and the relevant data column):
B1 = B[['Mobile_number', 'Location']]
and at last you can merge them:
merged_df = pd.merge(A, B1, on='Mobile_number')
note that this usage of pd.merge will take only rows with Mobile_number value that exists in both dataframes.
you can look at the documentation of pd.merge to change how exactly the merge is done, what to include etc..

Subset of a Pandas Dataframe consisting of rows with specific column values

I'm having a problem with a single line of my code.
Here is what I'd like to achieve:
reading_now is a string consisting of 3 characters
df2 is a data frame that is a subset of df1
I'd like df2 to consist of rows from df1 where the first three characters of the value in column "Code" is equal to "reading_now"
I tried using the following two lines with no success:
*df2 = df1.loc[(df1['Code'])[0:3] == reading_now]*
*df2 = df1[(str(df1.Code)[0:3] == reading_now)]*
Looks like you were really close with your 2nd attempt.
You could solve this a couple of different ways.
reading_now = 'AAA'
df1 = pd.DataFrame([{'Code': 'AAA'}, {'Code': 'BBB'}, {'Code': 'CCC'}])
solution:
df2 = df1[df1['Code'].str.startswith(reading_now)]
or
df2 = df1[df1['Code'][0:3] == reading_now]
The df2 dataframe will contain the row that starts with the reading_now string.
You could use
df2 = df1[df1['Code'].str[0:3] == reading_now]
For example:
data = ['abcd', 'cbdz', 'abcz', 'bdaz']
df1 = pd.DataFrame(data, columns=['Code'])
df2 = df1[df1['Code'].str[0:3] == 'abc']
df2 will result in a dataframe with 'Code' column containing 'abcd' and 'abcz'

Categories

Resources