PySpark - how to select all columns to be used in groupby

PySpark - how to select all columns to be used in groupby - python

I'm trying to chain a join and groupby operation together. The inputs and operations I want to do look like below. I want to groupby all the columns except the one used in agg. Is there a way of doing this without listing out all the column names like groupby("colA","colB")? I tried groupby(df1.*)but that didn't work. In this case I know that I'd like to group by all the columns in df1. Many thanks.
Input1:
colA | ColB
--------------
A | 100
B | 200
Input2:
colAA | ColBB
--------------
A | Group1
B | Group2
A | Group2
df1.join(df2, df1colA==df2.colAA,"left").drop("colAA").groupby("colA","colB"),agg(collect_set("colBB"))
#Is there a way that I do not need to list ("colA","colB") in groupby? there will be many cloumns.
Output:
colA | ColB | collect_set
--------------
A | 100 | (Group1,Group2)
B | 200 | (Group2)

Based on your clarifying comments, use df1.columns
df1.join(df2, df1.colA==df2.colAA,"left").drop("colAA").groupby(df1.columns).agg(collect_set("colBB").alias('new')).show()
+----+----+----------------+
|colA|ColB| new|
+----+----+----------------+
| A| 100|[Group2, Group1]|
| B| 200| [Group2]|
+----+----+----------------+

Just simple:
.groupby(df1.columns)

Related

Creating new Columns and fill them based on another columns values

Let's say I have a dataframe df looking like this:
|ColA |
|---------|
|B=7 |
|(no data)|
|C=5 |
|B=3,C=6 |
How do I extract the data into new colums, so it looks like this:
|ColA | B | C |
|------|---|---|
|True | 7 | |
|False | | |
|True | | 5 |
|True | 3 | 6 |
For filling the columns I know I can use regex .extract, as shown in this solution.
But how do I set the Column name at the same time? So far I use a loop over df.ColA.loc[df["ColA"].isna()].iteritems(), but that does not seem like the best option for a lot of data.

You could use str.extractall to get the data, then reshape the output and join to a derivative of the original dataframe:
# create the B/C columns
df2 = (df['ColA'].str.extractall('([^=]+)=([^=,]+),?')
.set_index(0, append=True)
.droplevel('match')[1]
.unstack(0, fill_value='')
)
# rework ColA and join previous output
df.notnull().join(df2).fillna('')
# or if several columns:
df.assign(ColA=df['ColA'].notnull()).join(df2).fillna('')
output:
ColA B C
0 True 7
1 False
2 True 5
3 True 3 6

how to concat values of columns with same name in pyspark

We have a feature request where we want to pull a table as per request from the database and perform some transformation on it. But these tables may have duplicate columns [columns with same name]. I want to combine these columns into a single column
for example:
Request for input table named ages:
+---+----+------+-----+
|age| ids | ids | ids |
+---+----+------+-----+
| 25| 1 | 2 | 3 |
+---+----+------+-----+
| 26| 4 | 5 | 6 |
+---+----+------+-----+
the output table is :
+---+----+------+-----+
|age| ids |
+---+----+------+-----+
| 25| [1 , 2 , 3] |
+---+----+------+-----+
| 26| [4 , 5 , 6] |
+---+----+------+-----+
next time we might get a request for input table names:
+---+----+------+-----+
|name| company | company|
+---+----+------+-----+
| abc| a | b |
+---+----+------+-----+
| xyc| c | d |
+---+----+------+-----+
The output table should be:
+---+----+------+
|name| company |
+---+----+------+
| abc| [a,b] |
+---+----+------+
| xyc| [c,d] |
+---+----+------+
So Basically I need to find the columns with the same name and then merge the values in them.

You can convert spark dataframe into pandas dataframe, perform necessary operations and convert it back to spark dataframe.
I have added necessary comments for clarity.
Using Pandas:
import pandas as pd
from collections import Counter
pd_df = spark_df.toPandas() #converting spark dataframe to pandas dataframe
pd_df.head()
def concatDuplicateColumns(df):
duplicate_cols = [] #to store duplicate column names
for col in dict(Counter(df.columns)):
if dict(Counter(df.columns))[col] >1:
duplicate_cols.append(col)
final_dict = {}
for cols in duplicate_cols:
final_dict[cols] = [] #initialize dict
for cols in duplicate_cols:
for ind in df.index.tolist():
final_dict[cols].append(df.loc[ind, cols].tolist())
df.drop(duplicate_cols, axis=1, inplace=True)
for cols in duplicate_cols:
df[cols] = final_dict[cols]
return df
final_df = concatDuplicateColumns(pd_df)
spark_df = spark.createDataFrame(final_df)
spark_df.show()

Python,Pandas,DataFrame, add new column doing SQL GROUP_CONCAT equivalent

My question is very similar to the one asked but unanswered here
Replicating GROUP_CONCAT for pandas.DataFrame
I have a Pandas DataFame which I want to group concat into a data frame
+------+---------+
| team | user |
+------+---------+
| A | elmer |
| A | daffy |
| A | bugs |
| B | dawg |
| A | foghorn |
+------+---------+
Becoming
+------+---------------------------------------+
| team | group_concat(user) |
+------+---------------------------------------+
| A | elmer,daffy,bugs,foghorn |
| B | dawg |
+------+---------------------------------------+
As answeed in the original topic, it can be done via any of these:
df.groupby('team').apply(lambda x: ','.join(x.user))
df.groupby('team').apply(lambda x: list(x.user))
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})
But the resulting object is not a Pandas Dataframe anymore.
How can I get the GROUP_CONCAT results in the original Pandas DataFrame as a new column?
Cheers

You can apply list and join after grouping by, then reset_index to get the dataframe.
output_df = df.groupby('team')['user'].apply(lambda x: ",".join(list(x))).reset_index()
output_df.rename(columns={'user': 'group_concat(user)'})
team group_concat(user)
0 A elmer,daffy,bugs,foghorn
1 B dawg

Let's break down the below code:
Firstly, groupby team and, use apply on the user to join it's elements using a ,
Then, reset the index, and rename the resulting dataframe (axis=1, refers to columns and not rows)
res = (df.groupby('team')['user']
.apply(lambda x: ','.join(str(i) for i in x))).reset_index().rename({'user':'group_concat(user)'},axis=1)
Output:
team group_concat(user)
0 A elmer,daffy,bugs,foghorn
1 B dawg

Chaining multiple groupBy in pyspark

My Data looks like this:
id | duration | action1 | action2 | ...
---------------------------------------------
1 | 10 | A | D
1 | 10 | B | E
2 | 25 | A | E
1 | 7 | A | G
I want to group it by ID (which works great!):
df.rdd.groupBy(lambda x: x['id']).mapValues(list).collect()
And now I would like to group values within each group by duration to get something like this:
[(id=1,
((duration=10,[(action1=A,action2=D),(action1=B,action2=E),
(duration=7,(action1=A,action2=G)),
(id=2,
((duration=25,(action1=A,action2=E)))]
And here is where I dont know how to do a nested group by. Any tips?

There is no need to serialize to rdd. Here's a generalized way to group by multiple columns and aggregate the rest of the columns into lists without hard-coding all of them:
from pyspark.sql.functions import collect_list
grouping_cols = ["id", "duration"]
other_cols = [c for c in df.columns if c not in grouping_cols]
df.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols]).show()
#+---+--------+-------+-------+
#| id|duration|action1|action2|
#+---+--------+-------+-------+
#| 1| 10| [A, B]| [D, E]|
#| 2| 25| [A]| [E]|
#| 1| 7| [A]| [G]|
#+---+--------+-------+-------+
Update
If you need to preserve the order of the actions, the best way is to use a pyspark.sql.Window with an orderBy(). This is because there seems to be some ambiguity as to whether or not a groupBy() following an orderBy() maintains that order.
Suppose your timestamps are stored in a column "ts". You should be able to do the following:
from pyspark.sql import Window
w = Window.partitionBy(grouping_cols).orderBy("ts")
grouped_df = df.select(
*(grouping_cols + [collect_list(c).over(w).alias(c) for c in other_cols])
).distinct()

Merging two PySpark DataFrame's gives unexpected results

I have two PySpark DataFrames (NOT pandas):
df1 =
+----------+--------------+-----------+---------+
|pk |num_id |num_pk |qty_users|
+----------+--------------+-----------+---------+
| 63479840| 12556940| 298620| 13|
| 63480030| 12557110| 298620| 9|
| 63835520| 12627890| 299750| 8|
df2 =
+----------+--------------+-----------+----------+
|pk2 |num_id2 |num_pk2 |qty_users2|
+----------+--------------+-----------+----------+
| 63479800| 11156940| 298620| 10 |
| 63480030| 12557110| 298620| 1 |
| 63835520| 12627890| 299750| 2 |
I want to join both DataFrames in order to get one DataFrame df:
+----------+--------------+-----------+---------+
|pk |num_id |num_pk |total |
+----------+--------------+-----------+---------+
| 63479840| 12556940| 298620| 13|
| 63479800| 11156940| 298620| 10|
| 63480030| 12557110| 298620| 10|
| 63835520| 12627890| 299750| 10|
The only condition for merging is that I want to sum up the values of qty_users for those rows that have the same values of < pk, num_id, num_pk > in df1 and df2. Just as I showed in the above example.
How can I do it?
UPDATE:
This is what I did:
newdf = df1.join(df2,(df1.pk==df2.pk2) & (df1.num_pk==df2.num_pk2) & (df1.num_id==df2.num_id2),'outer')
newdf = newdf.withColumn('total', sum(newdf[col] for col in ["qty_users","qty_users2"]))
But it gives me 9 columns instead of 4 columns. How to solve this issue?

The outer join will return all columns from both tables.Also,we got to fill null values in qty_users as sum will also return null.
Finally, we can select using coalsece function,
from pyspark.sql import functions as F
newdf = df1.join(df2,(df1.pk==df2.pk2) & (df1.num_pk==df2.num_pk2) & (df1.num_id==df2.num_id2),'outer').fillna(0,subset=["qty_users","qty_users2"])
newdf = newdf.withColumn('total', sum(newdf[col] for col in ["qty_users","qty_users2"]))
newdf.select(*[F.coalesce(c1,c2).alias(c1) for c1,c2 in zip(df1.columns,df2.columns)][:-1]+['total']).show()
+--------+--------+------+-----+
| pk| num_id|num_pk|total|
+--------+--------+------+-----+
|63479840|12556940|298620| 13|
|63480030|12557110|298620| 10|
|63835520|12627890|299750| 10|
|63479800|11156940|298620| 10|
+--------+--------+------+-----+
Hope this helps.!

Does this output what you want?
df3 = pd.concat([df1, df2], as_index=False).groupby(['pk','num_id','num_pk'])['qty_users'].sum()
The merging of your 2 dataframes is achieved via pd.concat([df1, df2], as_index=False)
Finding the sum of the qty_users columns when all other columns are the same first requires grouping by those columns
groupby(['pk','num_id','num_pk'])
and then finding the grouped sum of qty_users
['qty_users'].sum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark - how to select all columns to be used in groupby - python

Just simple: .groupby(df1.columns)

Related

Creating new Columns and fill them based on another columns values

how to concat values of columns with same name in pyspark

Python,Pandas,DataFrame, add new column doing SQL GROUP_CONCAT equivalent

Chaining multiple groupBy in pyspark

Merging two PySpark DataFrame's gives unexpected results

Categories

Resources