Puzzled over the behavior of Pandas in groupby - python

I have a large dataset which has among others a binary variable:
Transactions['has_acc_id_and_cus_id'].value_counts()
1 1295130
0 823869
Name: has_acc_id_and_cus_id, dtype: int64
When I groupby this dataset --Transactions-- using this particular binary variable as one grouping variable I get a grouped dataset --df100-- that has only one level of the aforementioned binary variable.
df100 = Transactions.groupby(['acc_reg_year', 'acc_reg_month', 'year', 'month',\
'has_acc_id_and_cus_id'])[['net_revenue']].agg(['sum', 'mean', 'count'])
df100['has_acc_id_and_cus_id'].value_counts()
1 1421
Name: has_acc_id_and_cus_id, dtype: int64

If you really want to just groupby on has_acc_id_and_cus_id then the command you want will be...
df100 = Transactions[['has_acc_id_and_cus_id', 'net_revenue']].groupby(['has_acc_id_and_cus_id']).agg(['sum', 'mean', 'count'])
This subsets just the variable you want to summarise by (has_acc_id_and_cus_id) and the variable you wish to summarise (net_revenue)...
Transactions[['has_acc_id_and_cus_id', 'net_revenue']]
...you then group these by has_acc_id_and_cus_id...
Transactions[['has_acc_id_and_cus_id', 'net_revenue']].groupby('has_acc_id_and_cus_id')
...before you then apply the agg() function to get the desired statistics.
The mistake you made, based on your stated aim of summarising by has_acc_id_and_cus_id alone, was having four other variables you were grouping by (acc_reg_year, acc_reg_month, year and month).
If you do actually want the summary by has_acc_id_and_cus_id within all the others then your original code was correct, but perhaps there are missing values in one or more of acc_reg_year, acc_reg_month, year and month when has_acc_id_and_cus_id == 0, so check your data...
Transactions[Transactions[`has_acc_id_and_cus_id`] == 0][[`acc_reg_year`, `acc_reg_month`, `year`, `month`]].head(100)

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

Aggregate the pandas elements according to some group allocation?

I have a dataframe with dimensions [1,126], where each column corresponds to a specific economic variable and these economic variables fall into one of 8 groups like Output, Labor, Housing etc. I have a separate dataframe where this group allocation is described.
Is it possible to aggregate the values of initial dataframe into a new [1,8] array according to the groups? I have no prior knowledge on the number of variables belonging to each group.
here is the code for replication on smaller scale:
data = {'RPI':[1], 'IP':[1], 'Labor1':[2], 'Labor2':[2], 'Housing1':[3], 'Housing2':[3]}
df = pd.DataFrame(data)
groups = {'Description':['RPI','IP','Labor1','Labor2','Housing1','Housing2'],
'Groups':['Real','Real','Labor','Labor','Housing','Housing']}
groups = pd.DataFrame(groups)
The final version should look like smth like this:
aggregate = {'Real':[2],'Labor':[4],'Housing':[6]}
aggregate = pd.DataFrame(aggregate)
You can merge the group to the description, then groupby and sum.
(df.T
.rename({0:'value'}, axis=1)
.merge(groups, left_index=True, right_on='Description')
.groupby('Groups')['value'].sum())
returns
Groups
Housing 6
Labor 4
Real 2
Name: value, dtype: int64

pandas.DataFrame.groupby(): keep 'whole' when grouping, in addition to groups

I want to produce an aggregation along a certain criterion, but also need a row with the same aggregation applied to the non-aggregated dataframe.
When using customers.groupby('year').size(), is there a way to keep the total among the groups, in order to output something like the following?
year customers
2011 3
2012 5
total 8
The only thing I could come up with so far is the following:
n_customers_per_year.loc['total'] = customers.size()
(n_customers_per_year is the dataframe aggregated by year. While this method is fairly straightforward for a single index, it seems to get messy when it has to be done on a multi-indexed aggregation.)
I believe the pivot_table method has a 'totals' boolean argument. Have a look.
margins : boolean, default False Add all row / columns (e.g. for
subtotal / grand totals)
I agree that this would be a desirable feature, but I don't believe it is currently implemented. Ideally, one would like to display an aggregation (e.g. sum) along one or more axis and or levels.
A workaround is to create a series that is the sum and then concatenate it to your DataFrame when delivering the data.

basic panda questions related to slicing and aggregating tables

I am getting familiar with Pandas and I want to learn the logic with a few simple examples.
Let us say I have the following panda DataFrame object:
import pandas as pd
d = {'year':pd.Series([2014,2014,2014,2014], index=['a','b','c','d']),
'dico':pd.Series(['A','A','A','B'], index=['a','b','c','d']),
'mybool':pd.Series([True,False,True,True], index=['a','b','c','d']),
'values':pd.Series([10.1,1.2,9.5,4.2], index=['a','b','c','d'])}
df = pd.DataFrame(d)
Basic Question.
How do I take a column as a list.
I.e., d['year']
would return
[2013,2014,2014,2014]
Question 0
How do I take rows 'a' and 'b' and columns 'year' and 'values' as a new dataFrame?
If I try:
d[['a','b'],['year','values']]
it doesn't work.
Question 1.
How would I aggregate (sum/average) the values column by the year, and dico columns, for example. I.e., such that different years/dico combinations would not be added, but basically mybool would be removed from the list.
I.e., after aggregation (this case average) I should get:
tipo values year
A 10.1 2013
A (9.5+1.2)/2 2014
B 4.2 2014
If I try the groupby function it seems to output some odd new DataFrame structure with bool in it, and all possible years/dico combinations - my objective is rather to have that simpler new sliced and smaller dataframe I showed above.
Question 2. How do I filter by a condition?
I.e., I want to filter out all bool columns that are False.
It'd return:
tipo values year mybool
A 10.1 2013 True
A 9.5 2014 True
B 4.2 2014 True
I've tried the panda tutorial but I still get some odd behavior so asking directly seems to be a better idea.
Thanks!
values from series in a list:
df['year'].values #returns an array
loc lets you subset a dateframe by index labels:
df.loc[['a','b'],['year','values']]
Group by lets you aggregate over columns:
df.groupby(['year','dico'],as_index=False).mean() #don't have 2013 in your df
Filtering by a column value:
df[df['mybool']==True]

Categories

Resources