I have a dataframe with dimensions [1,126], where each column corresponds to a specific economic variable and these economic variables fall into one of 8 groups like Output, Labor, Housing etc. I have a separate dataframe where this group allocation is described.
Is it possible to aggregate the values of initial dataframe into a new [1,8] array according to the groups? I have no prior knowledge on the number of variables belonging to each group.
here is the code for replication on smaller scale:
data = {'RPI':[1], 'IP':[1], 'Labor1':[2], 'Labor2':[2], 'Housing1':[3], 'Housing2':[3]}
df = pd.DataFrame(data)
groups = {'Description':['RPI','IP','Labor1','Labor2','Housing1','Housing2'],
'Groups':['Real','Real','Labor','Labor','Housing','Housing']}
groups = pd.DataFrame(groups)
The final version should look like smth like this:
aggregate = {'Real':[2],'Labor':[4],'Housing':[6]}
aggregate = pd.DataFrame(aggregate)
You can merge the group to the description, then groupby and sum.
(df.T
.rename({0:'value'}, axis=1)
.merge(groups, left_index=True, right_on='Description')
.groupby('Groups')['value'].sum())
returns
Groups
Housing 6
Labor 4
Real 2
Name: value, dtype: int64
Related
I am cleaning a database of movies. It was formed by merging 4 CSVs (4 streaming services' movies) into one. There are some movies that exist on two or more streaming services, like both Prime and Hulu.
I was able to merge the rest of the columns with:
movies.groupby(compareColumns, group_keys=False)[allColumns].apply(lambda x: x.ffill().bfill())
But now I'm left with rows that are practically identical except for their onPrime/onNetflix value (0=not available on service, 1=available on service).
For example, two rows I have are:
name
onPrime
onHulu
otherColumn
Movie 1
1
0
X
Movie 1
0
1
X
How do I systematically merge the two rows to have desired output below? (I have other columns that I don't want to be impacted)
Desired output:
name
onPrime
onHulu
otherColumn
Movie 1
1
1
X
Not sure how I could do it through sum, bfill, ffill, or any built-in function.
I tried filledgroups.fillna(value=0, axis=0, inplace=True, limit=1), where filledgroups is just a dataframe of two of the rows for trial, but it filled in 0s for other columns, whereas I only want to replace the 0s of onPrime/onHulu with 1s.
Grouping by name should do the trick..
df_grouped = df.groupby('name').max().reset_index()
With that approach, you group by name and aggregate using the max() function for all the columns.
If you wanted to apply differents aggregations to other columns you could use agg():
df_grouped = df.groupby('name').agg({'onPrime': 'max', 'onHulu': 'max', 'otherColumn': 'first'}).reset_index()
When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.
Say we have a dataframe with the following information:
Name Type ID
0 Book1 ebook 1
1 Book2 paper 2
2 Book3 paper 3
3 Book1 ebook 1
4 Book2 paper 2
if we do
df.groupby(["Name", "Type"]).sum()
we get a DataFrame:
ID
Name Type
Book1 ebook 2
Book2 paper 4
Book3 paper 3
which contains a MultiIndex with the columns used in the groupby:
MultiIndex([('Book1', 'ebook'),
('Book2', 'paper'),
('Book3', 'paper')],
names=['Name', 'Type'])
and one column called ID.
but if I apply a size() function, the result is a Series:
Name Type
Book1 ebook 2
Book2 paper 2
Book3 paper 1
dtype: int64
And at last, if I do a pct_change(), we get only the resulting DataFrame column:
ID
0 NaN
1 NaN
2 NaN
3 0.0
4 0.0
TL;DR. I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.
From the document
Size:
Returns
Series
Number of rows in each group.
For the sum , since you did not pass the column for sum, so it will return the data frame without the groupby key
df.groupby(["Name", "Type"])['ID'].sum() # return Series
Function like diff and pct_change is not agg, it will return the value with the same index as original dataframe, for count , mean, sum they are agg, return with the value and groupby key as index
The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input
import numpy as np
np.array([1,2,3]).sum()
#6
np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)
The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).
gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...
The other important part here is that we have a DataFrameGroupBy object. There are also SeriesGroupBy objects, and that difference can change the return.
gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>
So what happens when you aggregate?
With a DataFrameGroupBy when you choose an aggregation (like sum) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.
gp.sum()
# ID
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
On the other hand if you use a SeriesGroupBy object (select a single column with []) then you'll get a Series back, again with the index of unique group keys.
df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
#Name: ID, dtype: int64
For aggregations that return arrays (like cumsum, pct_change) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment
df['ID_pct_change'] = gp.pct_change()
# Name Type ID ID_pct_change
#0 Book1 ebook 1 NaN
#1 Book2 paper 2 NaN
#2 Book3 paper 3 NaN
#3 Book1 ebook 1 0.0 # Calculated from row 0 and aligned.
#4 Book2 paper 2 0.0
But what about size? That one is a bit weird. The size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result pandas will always return a Series. Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.
gp.size()
#Name Type
#Book1 ebook 2
#Book2 paper 2
#Book3 paper 1
#dtype: int64
Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align. You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input
gp.transform('sum')
# ID
#0 2 # Row 0 is Book1 ebook which has a group sum of 2
#1 4
#2 3
#3 2 # Row 3 is also Book1 ebook which has a group sum of 2
#4 4
I have a pandas dataframe. The final column in the dataframe is the max value of the RelAb column for each unique group (in this case, a species assignment) in the dataframe as obtained by:
df_melted['Max'] = df_melted.groupby('Species')['RelAb'].transform('max')
As you can see, the max value is represented in all rows of the group. Each group contains a large number of rows. I have the df sorted by max values, for which there are about 100 rows per max value. My goal is to obtain the top 20 groups based on the max value (i.e. a df with 100 X 20 rows - 2000 rows). I do not want to drop individual rows from groups in the dataframe, rather entire groups.
I am pasting a subset of the dataframe where the max for a group changes from one "Max" value to the next:
My feeling is that I need to convert the max so that the one value represents the entire group and then sort based on that column, perhaps as such?
For context, the reason I am doing this is because I am planning to make a stacked barchart with the most abundant species in the table for each sample. Right now, there are just way too many species, so it makes the stacked bar chart uninformative.
One way to do it:
aux = (df_melted.groupby('Species')['RelAb']
.max()
.nlargest(20, keep='all')
.to_list())
top20 = df_melted.loc[df_melted['Max'].isin(aux), :].copy()
I have a large dataset which has among others a binary variable:
Transactions['has_acc_id_and_cus_id'].value_counts()
1 1295130
0 823869
Name: has_acc_id_and_cus_id, dtype: int64
When I groupby this dataset --Transactions-- using this particular binary variable as one grouping variable I get a grouped dataset --df100-- that has only one level of the aforementioned binary variable.
df100 = Transactions.groupby(['acc_reg_year', 'acc_reg_month', 'year', 'month',\
'has_acc_id_and_cus_id'])[['net_revenue']].agg(['sum', 'mean', 'count'])
df100['has_acc_id_and_cus_id'].value_counts()
1 1421
Name: has_acc_id_and_cus_id, dtype: int64
If you really want to just groupby on has_acc_id_and_cus_id then the command you want will be...
df100 = Transactions[['has_acc_id_and_cus_id', 'net_revenue']].groupby(['has_acc_id_and_cus_id']).agg(['sum', 'mean', 'count'])
This subsets just the variable you want to summarise by (has_acc_id_and_cus_id) and the variable you wish to summarise (net_revenue)...
Transactions[['has_acc_id_and_cus_id', 'net_revenue']]
...you then group these by has_acc_id_and_cus_id...
Transactions[['has_acc_id_and_cus_id', 'net_revenue']].groupby('has_acc_id_and_cus_id')
...before you then apply the agg() function to get the desired statistics.
The mistake you made, based on your stated aim of summarising by has_acc_id_and_cus_id alone, was having four other variables you were grouping by (acc_reg_year, acc_reg_month, year and month).
If you do actually want the summary by has_acc_id_and_cus_id within all the others then your original code was correct, but perhaps there are missing values in one or more of acc_reg_year, acc_reg_month, year and month when has_acc_id_and_cus_id == 0, so check your data...
Transactions[Transactions[`has_acc_id_and_cus_id`] == 0][[`acc_reg_year`, `acc_reg_month`, `year`, `month`]].head(100)
I have a pandas DataFrame with columns patient_id, patient_sex, patient_dob (and other less relevant columns). Rows can have duplicate patient_ids, as each patient may have more than one entry in the data for multiple medical procedures. I discovered, however, that a great many of the patient_ids are overloaded, i.e. more than one patient has been assigned to the same id (evidenced by many instances of a single patient_id being associated with multiple sexes and multiple days of birth).
To refactor the ids so that each patient has a unique one, my plan was to group the data not only by patient_id, but by patient_sex and patient_dob as well. I figure this must be sufficient to separate the data into individual users (and if two patients with the same sex and dob just happened to be assigned the same id, then so be it.
Here is the code I currently use:
# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()
# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
# index is a tuple of the three column values, so this should get me a unique
# patient id for each patient
indv_patients.loc[index, new_patient_id] = str(hash(index))
# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)
# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)
patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})
The problem is that with over 7 million patients, this is way too slow a solution, the biggest bottleneck being the for-loop. So my question is, is there a better way to fix these overloaded ids? (The actual id doesn't matter, so long as its unique for each patient)
I don't know what the values for the columns are but have you tried something like this?
patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)
This should create a new column and you can then use groupby with the new_patient_id