I'm creating columns with aggregated values with the data from Pandas Dataframe using groupby() and reset_index() functions like that:
df=data.groupby(["subscription_id"])["count_boxes"].sum().reset_index(name="amount_boxes")
df1=data.groupby(["subscription_id"])["product"].count().reset_index(name="count_product")
Want to combine all these aggregated columns ("amount_boxes" and "count_product") in one dataframe with groupby column "subscription_id". Is there any way to do that ithin a function rather than merging the dataframes?
Let's look at using .agg with a dictionary of column and aggregation function.
(df.groupby('Subscription_id')
.agg({'count_boxes':'sum','product':'count'})
.reset_index()
.rename(columns={'count_boxes':'amount_boxes','product':'count_product'}))
Sample Output:
Subscription_id amount_boxes count_product
0 1 16 2
1 2 39 6
2 3 47 7
Related
I have a dataframe with 78 columns, but i want to melt just 10 consecutives. Is there any way to select that columns range and leave others just like they are?
I am looking for some short-cut to reduce the manual grouping required:
I have a dataframe with many columns. When grouping the dataframe by 'Level', I want to group two columns using nunique(),but all other columns (ca. 60 columns representing years from 2021 onward) using mean().
Does anyone have an idea how to define 'the rest' of the columns?
Thanks!
I would do it following way
import pandas as pd
df = pd.DataFrame({'X':[1,1,1,2,2,2],'A':[1,2,3,4,5,6],'B':[1,2,3,4,5,6],'C':[7,8,9,10,11,12],'D':[13,14,15,16,17,18],'E':[19,20,21,22,23,24]})
aggdct = dict.fromkeys(df.columns, pd.Series.mean)
del aggdct['X']
aggdct['A'] = pd.Series.nunique
print(df.groupby('X').agg(aggdct))
output
A B C D E
X
1 3 2 8 14 20
2 3 5 11 17 23
Explanation: I prepare dict with information how to agg using dict.fromkeys which does provide dict with keys being names of column and values being pd.Series.mean functions, then remove column to be used in groupby and changing selected column to hold pd.Series.nunique rather than pd.Series.mean
I have a list of say 50 dataframes 'list1', each dataframe has columns 'Speed' and 'Value', like this;
Speed Value
1 12
2 17
3 19
4 21
5 25
I am trying to get the standard deviation of 'Value' for each speed, across all dataframes. The end goal is get a list or df of standard deviation for each speed, like this;
Speed Standard Deviation
1 1.23
2 2.5
3 1.98
4 5.6
5 5.77
I've tried to pull the values into a new dataframe using a for loop, to then use 'statistics.stdev' on but I can't seem to get it working. Any help is really appreciatted!
Update!
pd.concat([d.set_index('Speed').values for d in df_power], axis=1).std(1)
This worked. Although, I forgot to mention that the values for Speed are not always the same between dataframes. Some dataframes miss a few and this ends up returning nan in those instances.
You can concat and use std:
list_df = [df1, df2, df3, ...]
pd.concat([d.set_index('Speed') for d in list_dfs], axis=1).std(1)
You'll want to concatenate, groupby speed, and take the standard deviation.
1) Concatenate your dataframes
list1 = [df_1, df_2, ...]
full_df = pd.concat(list1, axis=0) # stack all dataframes
2) Groupby speed and take the standard deviation
std_per_speed_df = full_df.groupby('speed')[['value']].std()
If the dataframes are all saved on the same folder you can use pd.concat +groupby as already suggested or you can use dask
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv("data/*")
out = df.groupby("Speed")["Value"].std()\
.compute()\
.reset_index(name="Standard Deviation")
How do one get a SQL-style grouped output when grouping following data:
item frequency
A 5
A 9
B 2
B 4
C 6
df.groupby(by = ["item"]).sum()
results in this:
item frequency
A 14
B 6
C 6
In pandas it is achieved by setting as_index=False. But dask doesn't support this argument in groupby. It currently omits item column and returns the series with frequency column.
Perhaps call .reset_index afterwards?
After creating a DataFrame with some duplicated cell values in column with the name 'keys':
import pandas as pd
df = pd.DataFrame({'keys': [1,2,2,3,3,3,3],'values':[1,2,3,4,5,6,7]})
I go ahead and create two more DataFrames which are the consolidated versions of the original DataFrame df. Those newly created DataFrames will have no duplicated cell values under the 'keys' column:
df_sum = df_a.groupby('keys', axis=0).sum().reset_index()
df_mean = df_b.groupby('keys', axis=0).mean().reset_index()
As you can see df_sum['values'] cells values were all summed together.
While df_mean['values'] cell values were averaged with mean() method.
Lastly I rename the 'values' column in both dataframes with:
df_sum.columns = ['keys', 'sums']
df_mean.columns = ['keys', 'means']
Now I would like to copy the df_mean['means'] column into the dataframe df_sum.
How to achieve this?
The Photoshoped image below illustrates the dataframe I would like to create. Both 'sums' and 'means' columns are merged into a single DataFrame:
There are several ways to do this. Using the merge function off the dataframe is the most efficient.
df_both = df_sum.merge(df_mean, how='left', on='keys')
df_both
Out[1]:
keys sums means
0 1 1 1.0
1 2 5 2.5
2 3 22 5.5
I think pandas.merge() is the function you are looking for. Like pd.merge(df_sum, df_mean, on = "keys"). Besides, this result can also be summarized on one agg function as following:
df.groupby('keys')['values'].agg(['sum', 'mean']).reset_index()
# keys sum mean
#0 1 1 1.0
#1 2 5 2.5
#2 3 22 5.5