Grouping DataFrame combining column values - python

d = {'Location':['Fridge','Fridge','Pantry','Pantry'], 'Food': ['Juice','Ham','Rice','Oil']}
df = pd.DataFrame(d)
I want to create a new DataFrame that groups all the Locations but combines their respective foods. So under 'Fridge' I would see 'Juice, Ham'. Groupby seems the logical function, but I can't get the foods combined.

Looks like a job for .groupby() & agg()!
import pandas as pd
df = pd.DataFrame(
{
"Location":["Fridge","Fridge","Pantry","Pantry"],
"Food": ["Juice","Ham","Rice","Oil"]
}
)
# Group by col1 and col2 and combine col3 values
grouped = df.groupby(["Location", "Food"]).agg({"Location": lambda x: ", ".join(x)}).\
drop("Location",axis=1)\
# .reset_index()
print(grouped)
This solution results in a Multi-Index df, however, you could uncomment the reset_index() if you want to fill out the Locations and convert the index into columns.
Hope this helps!

df_new = df.groupy(['Location'])['Food'].apply(','.join).reset_index()

Related

Group by the all the columns except the first one, but aggregate as list the first column

Let's say, I have this dataframe:
df = pd.DataFrame({'col_1': ['yes','no'], 'test_1':['a','b'], 'test_2':['a','b']})
What I want, is to group by all the columns except the first one and aggregate the results where the group by is the same.
This is what I'm trying:
col_names = df.columns.to_list()
df_out = df.groupby([col_names[1:]])[col_names[0]].agg(list)
This is my end data frame goal:
df = pd.DataFrame({'col_1': [['yes','no']], 'test_1':['a'], 'test_2':['b']})
And, if I have more rows, I want it to behave with the same principle, join in list the groups that are the same based on the column [1:] (from the second till end.
Using pandas agg() method
df = df.groupby(df.columns.difference(["col_1"]).tolist()).agg(
lambda x: x.tolist()).reset_index()

How to rank (in percent) each column in a dataframe in place?

The df is as shown below...
The below code can only rank one column in place. I would like to rank all columns and post the rank values in a separate df
df['rank_2020-06-23'] = df['2020-06-23'].rank(pct=True)
print(df)
Something like that should work:
df_ranks=pd.concat([pd.DataFrame(df[col].rank(pct=True)) for col in df.columns], axis=1)
It's simply using your function in a list comprehension, storing the results in dataframes to get a list of dataframes:
list_df_ranks=[pd.DataFrame(df[col].rank(pct=True)) for col in df.columns]
Then merging into one:
df_ranks=pd.concat(list_df_ranks, axis=1)

How to merge columns interspersing the data?

I'm new to python and pandas and working to create a Pandas MultiIndex with two independent variables: flow and head which create a dataframe and I have 27 different design points. It's currently organized in a single dataframe with columns for each variable and rows for each design point.
Here's how I created the MultiIndex:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
I then created three columns of data:
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
And want to merge these into a single column of data such that I can use this command:
pc = pd.DataFrame(data, index=index)
Using the three columns with the same indexes for the rows (0-27), I want to merge the columns into a single column such that the data is interspersed. If I call the columns col1, col2 and col3 and I denote the index in parentheses such that col1(0) indicates column1 index 0, I want the data to look like:
col1(0)
col2(0)
col3(0)
col1(1)
col2(1)
col3(1)
col1(2)...
it is a bit confusing. But what I understood is that you are trying to do this:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
data = pd.concat[df0, df1, df2]
pc = pd.DataFrame(data=data, index=index)

add selected columns from two pandas dfs

I have two pandas dataframes a_df and b_df. a_df has columns ID, atext, and var1-var25, while b_df has columns ID, atext, and var1-var 25.
I want to add ONLY the corresponding vars from a_df and b_df and leave ID, and atext alone.
The code below adds ALL the corresponding columns. Is there a way to get it to add just the columns of interest?
absum_df=a_df.add(b_df)
What could I do to achieve this?
Use filter:
absum_df = a_df.filter(like='var').add(b_df.filter(like='var'))
If you want to keep additional columns as-is, use concat after summing:
absum_df = pd.concat([a_df[['ID', 'atext']], absum_df], axis=1)
Alternatively, instead of subselecting columns from a_df, you could instead just drop the columns in absum_df, if you want to add all columns from a_df not in absum_df:
absum_df = pd.concat([a_df.drop(absum_df.columns axis=1), absum_df], axis=1)
You can subset a dataframe to particular columns:
var_columns = ['var-{}'.format(i) for i in range(1,26)]
absum_df=a_df[var_columns].add(b_df[var_columns])
Note that this will result in a dataframe with only the var columns. If you want a dataframe with the non-var columns from a_df, and the var columns being the sum of a_df and b_df, you can do
absum_df = a_df.copy()
absum_df[var_columns] = a_df[var_columns].add(b_df[var_columns])

Drop columns that aren't common between two dataframes?

I have two dataframes that have many columns in column but a few that do not exist in both. I would like to create a dataframe that only has the columns that are in common between both dataframes. So for example:
list(df1)
['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Captain']
list(df2)
['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Countess']
And I would like to go to:
['Survived', 'Age', 'Title_Mr', 'Title_Mrs']
Since Title_Mr and Title_Mrs are in both df1 and df2. I've figured out how to do it by manually entering in the columns names like so:
df1 = df1.drop(['Title_Captain'], axis=1)
But I'd like to find a more robust solution where I don't have to manually enter the column names. Suggestions?
Using the comments of #linuxfan and #PadraicCunningham we can get a list of common columns:
common_cols = list(set(df1.columns).intersection(df2.columns))
Edit: #AdamHughes' answer made me consider preserving the column order. If that is important you could do this instead:
common_cols = [col for col in set(df1.columns).intersection(df2.columns)]
To get another DataFrame with just those columns you use that list to select only those columns from df1:
df3 = df1[common_cols]
According to http://pandas.pydata.org/pandas-docs/stable/indexing.html:
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be
raised.
df1 = df1.drop([col for col in df1.columns if col in df1.columns and col in df2.columns], axis=1)
You don't necessarily need to drop the columns, just select the columns of interest:
In [204]:
df1 = pd.DataFrame(columns=['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Captain'])
df2 = pd.DataFrame(columns=['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Countess'])
# create a list of the common columns using set and intersection
common_cols=list(set.intersection(set(df1), set(df2)))
# use this list to perform column selection
df1[common_cols]
['Title_Mr', 'Age', 'Survived', 'Title_Mrs']
Out[204]:
Empty DataFrame
Columns: [Title_Mr, Age, Survived, Title_Mrs]
Index: []

Categories

Resources