Pyspark Column name alias when applying Aggregate using a Dictionary - python

I am applying an aggregate function on a data frame in pyspark. I am using a dictionary to pass the column name and aggregate function
df.groupBy(column_name).agg({"column_name":"sum"})
I now want to apply an alias to this column that has been generated using the aggregate method. Is there a way to do it?
The reason I am using the dictionary method is that aggregates will be applied dynamically depending on input parameters.
So basically it will be like
def aggregate(df, column_to_group_by, columns_to_aggregate):
df.groupBy(column_to_group_by).agg(columns_to_aggregate)
Where columns_to_aggregate will look like
{
"salary":"sum"
}
I now want to apply alias to the newly created column, because If I try to save the result to disk as praquet I get the error
Column name "sum(salary)" contains invalid character(s). Please use alias to rename it.
Any help on how to apply alias dynamically will be great
Thanks !

from pyspark.sql.functions import sum
df.groupBy("state") \
.agg(sum("salary").alias("sum_salary"))
Please read the article

I can see that this question is from 4 months ago. Here is the link to a possible solution where you rename the columns after aggregation by replacing some characters:
https://stackoverflow.com/a/70101696
The provided solution:
df.groupBy('group')
.agg({'money':'sum',
'moreMoney':'sum',
'evenMoreMoney':'sum'
})
.select(*(col(i).alias(i.replace("(",'_').replace(')','')) for i in df.columns))
It will create columns: sum_money, sume_moreMoney etc. And of course you can choose to rename/replace differently.

Related

How to add more than one dataframe column in pandas groupby function?

I have written the following codes in three separate cells in my jupyter notebook and have been able to generate the output I want. However, having this information in one dataframe will make it much easier to read.
How can I combine these separate dataframes into one so that the member_casual column is the index with max_ride_length, avg_ride_length and most_active_day_of_week columns next to it in the same dataframe?
Malo is correct. I will expand a little bit because you can also name the columns when they are aggregated:
df.groupby('member_casual').agg(max_ride_length=('ride_length','max'), avg_ride_length=('ride_length','mean'), most_active_day_of_the_week=('day_of_week',pd.Series.mode))
In the doc https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html
agg accepts a list a function as in the example:
df.groupby('A').agg(['min', 'max'])

Issue in renaming the multiple aggregation outcome columns in pandas python

I got a question regarding the multiple aggregation in pandas.
Originally I have a dataset which shows the oil price, and the detail is as follows:
And the head of the dataset is as follows:
What I want to do here is to get the mean and standard deviation for each quarter of the year 2014. And the ideal output is as follows:
In my script, I have already created the quarter info by doing so .
However, one thing that I do not understand here:
If I tried to use this command to do so
brent[brent.index.year == 2014].groupby('quarter').agg({"average_price": np.mean, "std_price": np.std})
I got an error as follows:
And if I use the following script, then it works
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price','std'))
So the questions are:
What's wrong with the first approach here?
And why do we need to use the second approach here?
Thank you all for the help in advance!
What's wrong with the first approach here?
There is passed dict, so pandas looking for columns from keys average_price, std_price and because not exist in DataFrame if return error.
Possible solution is specified column after groupby and pass list of tuples for specified new columns names with aggregate functions:
brent[brent.index.year == 2014].groupby('quarter')['Price'].agg([('average_price','mean'),('std_price',np.std)])
It is possible, because for one column Price is possible defined multiple columns names.
In later pandas versions are used named aggregations:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std))
Here is logic - for each aggregation is defined nw column name with aggregate column and aggregate function. So is possible aggregate multiple columns with different functions:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std),
sumQ=('quarter','sum'))
Notice, np.std has default ddof=0 and pandas std has ddof=1, so different outputs.

pandas agg() with mean(standalone vs list)

I'm digging into pandas aggregator function while working with a wine reviews dataset. To aggregate points given by wine reviewers, I noticed that, when I used mean as a standalone function in agg():
reviewer_mean_ratings = reviews.groupby('taster_name').points.agg('mean')
The output looks like this:
Noticed that the output has 2 columns(at least that's what it looks like visually). But
type(reviewer_mean_ratings) = pandas.core.series.Series
Is that just 1 column with space between the name and mean rating? I'm confused.
Also noticed that, I cannot sort this output in descending order by the mean ratings. Instead if I had used mean as a list in agg() then descending order works using sort_values() method later.
My hypothesis is that if I want to access the mean ratings column later, the only way to do it is to use agg(['mean']) instead of agg('mean') in the original query. Am I mistaken somewhere?
The output is a pandas Series, sort of like a 1-column Dataframe with index. To get the actual values of the Series, just add '.values':
reviewer_mean_ratings = reviews.groupby('taster_name').points.agg('mean').values
This will output the values as a numpy array.
Found that the following statement works to get descending order by using 'mean' as a standalone function in agg() method.
reviews.groupby('taster_name').points.agg('mean').sort_values(ascending=False)
i.e. don't use the "by" clause in sort_values() method.

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Column to Transacction ID for association rules on dataframes from Pandas Python.

I imported a CSV into Python with Pandas and I would like to be able to use one as the columns as a transaction ID in order for me to make association rules.
(link: https://github.com/antonio1695/Python/blob/master/nearBPO/facturas.csv)
I hope someone can help me to:
Use UUID as a transaction ID for me to have a dataframe like the following:
UUID Desc
123ex Meat,Beer
In order for me to get association rules like: {Meat} => {Beer}.
Also, a recommendation on a library to do so in a simple way would be appreciated.
Thank you for your time.
You can aggregate values into a list by doing the following:
df.groupby('UUID')['Desc'].apply(list)
This will give you what you want, if you want the UUID back as a column you can call reset_index on the above:
df.groupby('UUID')['Desc'].apply(list).reset_index()
Also for a Series you can still export this to a csv same as with a df:
df.groupby('UUID')['Desc'].apply(list).to_csv(your_path)
You may need to name your index prior to exporting or if you find it easier just reset_index to restore the index back as a column and then call to_csv

Categories

Resources