Pyspark groupby / agg function without changing the column names? - python

In cases where I have a large number of columns that I want to sum, average, etc., is there a way to NOT change the column names, without having to use .alias on each column? The default is to add the function to the column name (e.g. col1 becomes "avg(col1)" after taking the average), is there an efficient way to have it stay named "col1"?
df = df.groupby(seg).avg('col1')

Related

Calculating mean of rows taking specific columns from a list and adding the mean column to pyspark dataframe

I have a pyspark dataframe with columns "A", "B",and "C","D". I want to add a column with mean of rows. But the condition is that the column names for which mean need to be calculated (at row level) should be taken from a list l=["A","C"].
reason for the list is that the column names and number might vary and hence I need it to be flexible. for eg. I might want mean at row level for cols l=["A","B","C"] or just l=["A","D"].
Finally I want this mean column to be appended to the original pyspark dataframe.
how do I code this in pyspark?
When you say you want the mean, I assume that you want Arithmetic mean :
In that case, that's really simple. You can create a function like this :
from pyspark.sql import functions as F
def arithmetic_mean(*cols):
return sum(F.col(col) for col in cols)/len(cols)
Assuming df is you dataframe, you simply use it like this:
df.withColumn("mean", arithmetic_mean("A", "C"))

Ungrouping a pandas dataframe after aggregation operation

I have used the "groupby" method on my dataframe to find the total number of people at each location.
To the right of the "sum" column, I need to add a column that lists all of the people's names at each location (ideally in separate rows, but a list would be fine too).
Is there a way to "ungroup" my dataframe again after having found the sum?
dataframe.groupby(by=['location'], as_index=False)['people'].agg('sum')
You can do two different things:
(1) Create an aggregate DataFrame using groupby.agg and calling appropriate methods. The code below lists all names corresponding to a location:
out = dataframe.groupby(by=['location'], as_index=False).agg({'people':'sum', 'name':list})
(2) Use groupby.transform to add a new column to dataframe that has the sum of people by location in each row:
dataframe['sum'] = dataframe.groupby(by=['location'])['people'].transform('sum')
I think you are looking for 'transform' ?
dataframe.groupby(by=['location'], as_index=False)['people'].transform('sum')

Pandas Groupby Syntax explanation

I am confused why A Pandas Groupby function can be written both of the ways below and yield the same result. The specific code is not really the question, both give the same result. I would like someone to breakdown the syntax of both.
df.groupby(['gender'])['age'].mean()
df.groupby(['gender']).mean()['age']
In the first instance, It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after? Are there runtime considerations.
It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after?
This is exactly what's happening. df.groupby() returns a dataframe. The .mean() method is applied column-wise by default, so the mean of each column is calculated independent of the other columns and the results are returned as a Series (which can be indexed) if run on the full dataframe.
Reversing the order produces a single column as a Series and then calculates the mean. If you know you only want the mean for a single column, it will be faster to isolate that first, rather than calculate the mean for every column (especially if you have a very large dataframe).
Think of groupby as a rows-separation function. It groups all rows having the same attributes (as specified in by parameter) into separate data frames.
After the groupby, you need an aggregate function to summarize data in each subframe. You can do that in a number of ways:
# In each subframe, take the `age` column and summarize it
# using the `mean function from the result
df.groupby(['gender'])['age'].mean()
# In each subframe, apply the `mean` function to all numeric
# columns then extract the `age` column
df.groupby(['gender']).mean()['age']
The first method is more efficient since you are applying the aggregate function (mean) on a single column.

python - dataframe - groupby - treatment of non-grouped column in case of difference

I have a dataframe containing an ID and I wish to 'group by' based on the ID. I need to keep all other columns (static data,strings) of the dataframe as well, so initially I included all static data columns in the group by. However, there can be differences in the static data between 2 or more rows that have the same ID (due to different source). In that case I would still like to group on the ID and not create 'duplicates'. For the column having the difference I'm rather indifferent, the grouped row can just take the first one it encounters of the conflicting rows.
Hope this illustration clarifies:
example
Any suggestions?
You can use groupby().agg() and specify what you want to do which each column in your dataframe within a dictionary. Based on the example of your intended outcome that would be:
df.groupby('identifier').agg({'name': 'first', 'amount':'sum'})
It takes the first value of the name columns and it sums the values in the amount column.

Writing the mean of values to a new column in Pandas dataframe

I need to create a new column in my df that holds the mean of another existing column, but I need it to take into account each individual location over time rather then the mean of the all the values in the existing column.
Based on the sample dataset below, what I am looking for is a new column that contains the Mean for each Site, not the mean of all the values independent of Site.
Sample Dataset
Use groupby and agg mean of that columns:
df = df.merge(df.groupby('Site',as_index=False).agg({'TIME_HOUR':'mean'})[['Site','TIME_HOUR']],on='Site',how='left')
Use groupby:
df.groupby('Site')['TIME_HOUR'].mean().reset_index()
And assign to a column

Categories

Resources