I got a question regarding the multiple aggregation in pandas.
Originally I have a dataset which shows the oil price, and the detail is as follows:
And the head of the dataset is as follows:
What I want to do here is to get the mean and standard deviation for each quarter of the year 2014. And the ideal output is as follows:
In my script, I have already created the quarter info by doing so .
However, one thing that I do not understand here:
If I tried to use this command to do so
brent[brent.index.year == 2014].groupby('quarter').agg({"average_price": np.mean, "std_price": np.std})
I got an error as follows:
And if I use the following script, then it works
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price','std'))
So the questions are:
What's wrong with the first approach here?
And why do we need to use the second approach here?
Thank you all for the help in advance!
What's wrong with the first approach here?
There is passed dict, so pandas looking for columns from keys average_price, std_price and because not exist in DataFrame if return error.
Possible solution is specified column after groupby and pass list of tuples for specified new columns names with aggregate functions:
brent[brent.index.year == 2014].groupby('quarter')['Price'].agg([('average_price','mean'),('std_price',np.std)])
It is possible, because for one column Price is possible defined multiple columns names.
In later pandas versions are used named aggregations:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std))
Here is logic - for each aggregation is defined nw column name with aggregate column and aggregate function. So is possible aggregate multiple columns with different functions:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std),
sumQ=('quarter','sum'))
Notice, np.std has default ddof=0 and pandas std has ddof=1, so different outputs.
Related
I am applying an aggregate function on a data frame in pyspark. I am using a dictionary to pass the column name and aggregate function
df.groupBy(column_name).agg({"column_name":"sum"})
I now want to apply an alias to this column that has been generated using the aggregate method. Is there a way to do it?
The reason I am using the dictionary method is that aggregates will be applied dynamically depending on input parameters.
So basically it will be like
def aggregate(df, column_to_group_by, columns_to_aggregate):
df.groupBy(column_to_group_by).agg(columns_to_aggregate)
Where columns_to_aggregate will look like
{
"salary":"sum"
}
I now want to apply alias to the newly created column, because If I try to save the result to disk as praquet I get the error
Column name "sum(salary)" contains invalid character(s). Please use alias to rename it.
Any help on how to apply alias dynamically will be great
Thanks !
from pyspark.sql.functions import sum
df.groupBy("state") \
.agg(sum("salary").alias("sum_salary"))
Please read the article
I can see that this question is from 4 months ago. Here is the link to a possible solution where you rename the columns after aggregation by replacing some characters:
https://stackoverflow.com/a/70101696
The provided solution:
df.groupBy('group')
.agg({'money':'sum',
'moreMoney':'sum',
'evenMoreMoney':'sum'
})
.select(*(col(i).alias(i.replace("(",'_').replace(')','')) for i in df.columns))
It will create columns: sum_money, sume_moreMoney etc. And of course you can choose to rename/replace differently.
I have a dataframe that's the result of importing a csv and then performing a few operations and adding a column that's the difference between two other columns (column 10 - column 9 let's say). I am trying to sort the dataframe by the absolute value of that difference column, without changing its value or adding another column.
I have seen this syntax over and over all over the internet, with indications that it was a success (accepted answers, comments saying "thanks, that worked", etc.). However, I get the error you see below:
df.sort_values(by='Difference', ascending=False, inplace=True, key=abs)
Error:
TypeError: sort_values() got an unexpected keyword argument 'key'
I'm not sure why the syntax that I see working for other people is not working for me. I have a lot more going on with the code and other dataframes, so it's not a pandas import problem I don't think.
I have moved on and just made a new column that is the absolute value of the difference column and sorted by that, and exclude that column from my export to worksheet, but I really would like to know how to get it to work the other way. Any help is appreciated.
I'm using Python 3
df.loc[(df.c - df.b).sort_values(ascending = False).index]
Sorting by difference between "c" and "b" , without creating new column.
I hope this is what you were looking for.
key is optional argument
It accepts series as input , maybe you were working with dataframe.
check this
I'm digging into pandas aggregator function while working with a wine reviews dataset. To aggregate points given by wine reviewers, I noticed that, when I used mean as a standalone function in agg():
reviewer_mean_ratings = reviews.groupby('taster_name').points.agg('mean')
The output looks like this:
Noticed that the output has 2 columns(at least that's what it looks like visually). But
type(reviewer_mean_ratings) = pandas.core.series.Series
Is that just 1 column with space between the name and mean rating? I'm confused.
Also noticed that, I cannot sort this output in descending order by the mean ratings. Instead if I had used mean as a list in agg() then descending order works using sort_values() method later.
My hypothesis is that if I want to access the mean ratings column later, the only way to do it is to use agg(['mean']) instead of agg('mean') in the original query. Am I mistaken somewhere?
The output is a pandas Series, sort of like a 1-column Dataframe with index. To get the actual values of the Series, just add '.values':
reviewer_mean_ratings = reviews.groupby('taster_name').points.agg('mean').values
This will output the values as a numpy array.
Found that the following statement works to get descending order by using 'mean' as a standalone function in agg() method.
reviews.groupby('taster_name').points.agg('mean').sort_values(ascending=False)
i.e. don't use the "by" clause in sort_values() method.
I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow
I have a DataFrame ('main') that has about 300 columns. I created a smaller DataFrame ('public') and have been working on this.
I now want to delete the columns contained within 'public' from the larger DataFrame ('main').
I've tried the following instructions:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html
Python Pandas - Deleting multiple series from a data frame in one command
without any success, along with various other statements that have been unsuccessful.
The columns that make up 'public' are not consecutive - i.e. they are taken from various points in the larger DataFrame 'main'. All of the columns have the same Index. [Not sure if this is important, but 'public' was created using the 'join' function].
Yes, I'm being lazy - I don't want to have to type out the names of every column! I'm hoping there's a way to use the DataFrame 'public' in a statement that will allow deletion of these columns en masse. If anyone has any suggestions and/or guidance I'd be most grateful.
(Have Python 2.7 and am using Pandas, numpy, math, pylab etc.)
Thanks in advance.
Ignore my question - Murphy's Law prevails and I've just solved it.
I was using the statement from the stackoverflow question mentioned below:
df.drop(df.columns[1:], axis=1)
and this was not working. I have instead used
df = df.drop(df2, axis=1)
and this worked (df = main, df2 = public). Simple really once you don't overthink it.