I'm digging into pandas aggregator function while working with a wine reviews dataset. To aggregate points given by wine reviewers, I noticed that, when I used mean as a standalone function in agg():
reviewer_mean_ratings = reviews.groupby('taster_name').points.agg('mean')
The output looks like this:
Noticed that the output has 2 columns(at least that's what it looks like visually). But
type(reviewer_mean_ratings) = pandas.core.series.Series
Is that just 1 column with space between the name and mean rating? I'm confused.
Also noticed that, I cannot sort this output in descending order by the mean ratings. Instead if I had used mean as a list in agg() then descending order works using sort_values() method later.
My hypothesis is that if I want to access the mean ratings column later, the only way to do it is to use agg(['mean']) instead of agg('mean') in the original query. Am I mistaken somewhere?
The output is a pandas Series, sort of like a 1-column Dataframe with index. To get the actual values of the Series, just add '.values':
reviewer_mean_ratings = reviews.groupby('taster_name').points.agg('mean').values
This will output the values as a numpy array.
Found that the following statement works to get descending order by using 'mean' as a standalone function in agg() method.
reviews.groupby('taster_name').points.agg('mean').sort_values(ascending=False)
i.e. don't use the "by" clause in sort_values() method.
Related
I got a question regarding the multiple aggregation in pandas.
Originally I have a dataset which shows the oil price, and the detail is as follows:
And the head of the dataset is as follows:
What I want to do here is to get the mean and standard deviation for each quarter of the year 2014. And the ideal output is as follows:
In my script, I have already created the quarter info by doing so .
However, one thing that I do not understand here:
If I tried to use this command to do so
brent[brent.index.year == 2014].groupby('quarter').agg({"average_price": np.mean, "std_price": np.std})
I got an error as follows:
And if I use the following script, then it works
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price','std'))
So the questions are:
What's wrong with the first approach here?
And why do we need to use the second approach here?
Thank you all for the help in advance!
What's wrong with the first approach here?
There is passed dict, so pandas looking for columns from keys average_price, std_price and because not exist in DataFrame if return error.
Possible solution is specified column after groupby and pass list of tuples for specified new columns names with aggregate functions:
brent[brent.index.year == 2014].groupby('quarter')['Price'].agg([('average_price','mean'),('std_price',np.std)])
It is possible, because for one column Price is possible defined multiple columns names.
In later pandas versions are used named aggregations:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std))
Here is logic - for each aggregation is defined nw column name with aggregate column and aggregate function. So is possible aggregate multiple columns with different functions:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std),
sumQ=('quarter','sum'))
Notice, np.std has default ddof=0 and pandas std has ddof=1, so different outputs.
I've been poking around a bit and can't see to find a close solution to this one:
I'm trying to transform a dataframe from this:
To this:
Such that remark_code_names with similar denial_amounts are provided new columns based on their corresponding har_id and reason_code_name.
I've tried a few things, including a groupby function, which gets me halfway there.
denials.groupby(['har_id','reason_code_name','denial_amount']).count().reset_index()
But this obviously leaves out the reason_code_names that I need.
Here's a minimum:
pd.DataFrame({'har_id':['A','A','A','A','A','A','A','A','A'],'reason_code_name':[16,16,16,16,16,16,16,22,22],
'remark_code_name':['MA04','N130','N341','N362','N517','N657','N95','MA04','N341'],
'denial_amount':[5402,8507,5402,8507,8507,8507,8507,5402,5402]})
Using groupby() is a good way to go. Use it along with transform() and overwrite the column with name 'remark_code_name. This solution puts all remark_code_names together in the same column.
denials['remark_code_name'] = denials.groupby(['har_id','reason_code_name','denial_amount'])['remark_code_name'].transform(lambda x : ' '.join(x))
denials.drop_duplicates(inplace=True)
If you really need to create each code in their own columns, you could apply another function and use .split(). However you will first need to set the number of columns depending on the max number of codes you find in a single row.
I'm a beginner in coding and I wrote some codes in python pandas that I didn't understand fully and need some clarification.
Lets say this is the data, DeathYear, Age, Gender and Country are all columns in an excel file.
How to plot a table with non-numeric values in python?
I saw this question and I used this command
df.groupby('Gender')['Gender'].count().plot.pie(autopct='%.2f',figsize=(5,5))
it works and gives me a pie chart of the percentage of each gender,
but the normal pie chart command that I know for numerical data looks like this
df["Gender"].plot.pie(autopct="%.2f",figsize=(5,5))
My question is why did we add the .count()?
is it to transform non numerical data to numerical?
and why did why use the group by and type the column twice ('Gender')['Gender']?
I'll address the second part of your question first since it makes more sense to explain it that way
The reason that you use ('Gender')['Gender'] is that it does two different things. The first ('Gender') is the argument to the groupby function. It tells you that you want the DataFrame to be grouped by the 'Gender' column. Note that the groupby function needs to have a column or level to group by or else it will not work.
The second ['Gender'] tells you to only look at the 'Gender' column in the resulting DataFrame. The easiest way to see what the second ['Gender'] does is to compare the output of df.groupby('Gender').count() and df.groupby('Gender')['Gender'].count() and see what happens.
One detail that I omitted in first part for clarity it that the output of df.groupby('Gender') is not a DataFrame, but actually a DataFrameGroupBy object. The details of what exactly this object is are not important to your question, but the key is that to get a DataFrame back you need to have a function that tells you what to put in the rows of the DataFrame that you wish to create. The .count() function is one of those options (along with many others such as .mean(), etc.). In your case, since you want the total counts to make a pie chart, the .count() function does exactly that; it will count the number of times 'Female' and 'Male' appears in the 'Gender' column and that sum will be the entries in the corresponding row. The DataFrame is then able to be used to create a pie chart. So you are correct in that the .count() function transforms the non-numeric 'Female' and 'Male' entries into a numeric value which corresponds to how often those entries appeared in the initial DataFrame.
I have a function returning a tuple of values, as an example:
def dumb_func(number):
return number+1,number-1
I'd like to apply it to a pandas DataFrame
df=pd.DataFrame({'numbers':[1,2,3,4,5,6,7]})
test=dumb_df['numbers'].apply(dumb_func)
The result is that test is a pandas series containing tuples.
Is there a way to use the variable test or to remplace it to assign the results of the function to two distinct columns 'number_plus_one' and 'number_minus_one' of the original DataFrame?
df[['number_plus_one', 'number_minus_one']] = pd.DataFrame(zip(*df['numbers'].apply(dumb_func))).transpose()
To understand, try taking it apart piece by piece. Have a look at zip(*df['numbers'].apply(dumb_func)) in isolation (you'll need to convert it to a list). You'll see how it unpacks the tuples one by one and creates two separate lists out of them. Then have a look what happens when you create a dataframe out of it - you'll see why the transpose is necessary. For more on zip, see here : docs.python.org/3.8/library/functions.html#zip
Method 1: When you don't use dumb function,
df[['numbers_plus_one','numbers_minus_one']]=pd.DataFrame(df.apply(lambda x: (x[0]+1,x[0]-1),axis=1).values.tolist())
Method 2: When you have test(i.e. series of tuples you mentioned in question)
df[['numbers_plus_one','numbers_minus_one']]=pd.DataFrame(test.values.tolist())
I hope this is helpful
I used pivot to reshape my data and now have a column multiindex. I want the resulting columns to be the X variables in a simple OLS regression. The Y's are another series with the same row index.
When I try running
model1 = ols(y = gdp0, x = MIDAS_small)
I get
TypeError: can only call with other hierarchical index objects
I can imagine two solutions but can't figure out either one:
Collapse the multiindex. Rather than having columns of the form ('before', 'var1') and ('after', 'var1'), I would just have a bunch of 'beforevar1', 'aftervar1', etc. Then I could use ols to produce a nice and sufficiently legible table.
Is there some way to run a regression with a multiindex? It seems like it was designed in part for this sort of thing, especially panel regressions, but I couldn't find any relevant examples or documentation.
Well, I found an inelegant solution to #1:
I can create a new dataframe, loop over both column indexes, and insert new columns into the new dataframe with the same name, but with names as strings instead of tuples. There must be a more elegant, single command, right?
Have you tired using dmatricies from Patsy to prepare a regression friendly DataFrame?
An example is located here:
http://statsmodels.sourceforge.net/devel/gettingstarted.html
Im sure you are aware of the .unstack() function in pandas that would allow you remove the hierarchical indexing, but it with dmatrices could produce the result that your looking for.