I have a pandas groupby command which looks like this:
df.groupby(['year', 'month'], as_index=False).agg({'users':sum})
Is there a way I can name the agg output something other than 'users' during the groupby command? For example, what if I wanted the sum of users to be total_users? I could rename the column after the groupby is complete, but wonder if there is another way.
I like #Alexander answer, but there is also add_prefix:
df.groupby(['year','month']).agg({'users':sum}).add_prefix('total_')
Per the docs:
If a dict is passed, the keys will be used to name the columns.
Otherwise the function’s name (stored in the function object) will be
used.
In [58]: grouped['D'].agg({'result1' : np.sum, ....:
'result2' : np.mean})
In your case:
df.groupby(['year', 'month'], as_index=False).users.agg({'total_users': np.sum})
Related
I am running a pearson correlation on my data set (from Excel) and this is the order the results come out as:
What I was wondering is if it is possible to get the n_hhld_trip as my first column as it is my dependent variable.
Below is my code that I have so far but not sure how to make it reflect the changes I want. I tried moving the variables in the pivot table command but that didn't do it:
zone_sum_mean_combo = pd.pivot_table(
read_excel,
index=['Zone'],
aggfunc={'Household ID': np.mean, 'dwtype': np.mean, 'n_hhld_trip': np.sum,
'expf': np.mean, 'n_emp_ft': np.sum, 'n_emp_home': np.sum,
'n_emp_pt': np.sum, 'n_lic': np.sum, 'n_pers': np.sum,
'n_student': np.sum, 'n_veh': np.sum}
)
index_reset = zone_sum_mean_combo.reset_index()
print(index_reset)
pearson_correlation = index_reset.corr(method='pearson')
print(pearson_correlation)
Sometimes it can be easier to hardcode the column order after everything is done:
df = df[["my_first_column", "my_second_column"]]
In your case, I think it's easier to just manipulate them:
columns = list(df.columns)
columns.remove("n_hhld_trip")
columns.insert(0, "n_hhld_trip")
df = df[columns]
Try to set_index and reset_index:
df.set_index('n_hhld_trip', append=True).reset_index(level=-1)
I am trying to group by a variable in pandas, but it does not seem to work.
The variable is just a list of several column headers, and it is much easier to write the variable each time for the purposes of analysis rather than list the columns for each groupby.
Trying to turn this:
df_grouped = (df.groupby(['Column1','Column2','Column3','Column4'])
[compvars].sum()).reset_index()
Into this:
groupbyvars=['Column1','Column2','Column3','Column4']
df_grouped = (df.groupby([groupbyvars])
[compvars].sum()).reset_index()
As groupbyvars is already a list, we can replace :
df_grouped = (df.groupby([groupbyvars])
[compvars].sum()).reset_index()
by :
df_grouped = (df.groupby(groupbyvars)
[compvars].sum()).reset_index()
frame = frame2.groupby(['name1', 'name2', 'date', 'agID','figi', 'exch', 'figi', 'marketSector','name','fx_currency', 'id_type', 'id', 'currency']).agg({'call_agreed_amount' : 'sum' , 'pledge_current_market_value' : 'sum', 'pledge_quantity' : 'sum', 'pledge_adjusted_collateral_value' : 'sum', 'count' : 'count'})
print(frame.head())
for value in frame['call_currency']:
doStuff()
In the code above, all columns exist before the groupby statement. After the groupby statement is executed, the frame.head() returns all of the same columns. My code fails at my for loop with a KeyError trying to access frame['call_currency'], which 100% exists in frame.
After troubleshooting myself, I realized that python's groupby function returns a GroupBy object with a hierarchical index. The grouped columns were added as the index for the aggregate values. In order to fix this, I added .reset_index() to the end of my groupby statement.
Why could this be? My data seems pretty simple and straightforward, it's a 1 column dataframe of ints, but .describe only returns count, unique, top, freq... not max, min, and other expected outputs.
(Note .describe() functionality is as expected in other projects/datasets)
It seems pandas doesn't recognize your data as int.
Try to do this explicitly:
print(df.astype(int).describe())
Try:
df.agg(['count', 'nunique', 'min', 'max'])
You can add or remove the different aggregation functions to that list.
And when I have quite a few columns I personally like to transpose it:
df.agg(['count', 'nunique', 'min', 'max']).transpose()
To reduce the aggregations on a subset of columns you different ways to do it.
By containig a word: example 'ID'
df.filter(like='ID').agg(['count', 'nunique'])
By type of data:
df.select_dtypes(include=['int']).agg(['count', 'nunique'])
df.select_dtypes(exclude=['float64']).agg(['count', 'nunique'])
try to change your features into numerical values to return all the statics you need :
df1['age'] = pd.to_numeric(df1['age'], errors='coerce')
I currently have a pandas Series with dtype Timestamp, and I want to group it by date (and have many rows with different times in each group).
The seemingly obvious way of doing this would be something similar to
grouped = s.groupby(lambda x: x.date())
However, pandas' groupby groups Series by its index. How can I make it group by value instead?
grouped = s.groupby(s)
Or:
grouped = s.groupby(lambda x: s[x])
Three methods:
DataFrame: pd.groupby(['column']).size()
Series: sel.groupby(sel).size()
Series to DataFrame:
pd.DataFrame( sel, columns=['column']).groupby(['column']).size()
For anyone else who wants to do this inline without throwing a lambda in (which tends to kill performance):
s.to_frame(0).groupby(0)[0]
You should convert it to a DataFrame, then add a column that is the date(). You can do groupby on the DataFrame with the date column.
df = pandas.DataFrame(s, columns=["datetime"])
df["date"] = df["datetime"].apply(lambda x: x.date())
df.groupby("date")
Then "date" becomes your index. You have to do it this way because the final grouped object needs an index so you can do things like select a group.
To add another suggestion, I often use the following as it uses simple logic:
pd.Series(index=s.values).groupby(level=0)