pyspark using agg to concat string after groupBy

pyspark using agg to concat string after groupBy - python

In pandas dataframe, I am able to do
df2 = df.groupBy('name').agg({'id': 'first', 'grocery': ','.join})
from
name id grocery
Mike 01 Apple
Mike 01 Orange
Kate 99 Beef
Kate 99 Wine
to
name id grocery
Mike 01 Apple,Orange
Kate 99 Beef,Wine
since id is the same across multiple rows for the same person, I just took the first one for each person, and concat the grocery.
I can't seem to make this work in pyspark. How can I do the same thing in pyspark? I want the grocery to be string instead of list

Use collect_list to collect elements into a list and then join the list as string with concat_ws:
import pyspark.sql.functions as f
df.groupBy("name")
.agg(
f.first("id").alias("id"),
f.concat_ws(",", f.collect_list("grocery")).alias("grocery")
).show()
#+----+---+------------+
#|name| id| grocery|
#+----+---+------------+
#|Kate| 99| Beef,Wine|
#|Mike| 01|Apple,Orange|
#+----+---+------------+

Related

How to collapse all rows in pandas dataframe across all columns

I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.

You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0

Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0

Split a column in Python pandas

I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation

You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.

First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)

Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]

Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])

Building another dataframe using minimun values of the columns using Pandas

I have a probleme where I have a pandas DataFrame with name df_x where my index is the name of persons and my columns are the name of products. The values are the distance between these persons to the products
I want to build another table containing the columns of df_x and as values the name of the person that have the minimun distance to this product.
Is there a simple way to do this using pandas or np? Do I need to use for loop?
Example:
(index) Banana Apple
Mike 7 2
Kevin 2 4
James 3 6
so the final table should be
(index) Banana Apple
Name Kevin Mike

IIUC, DataFrame.idxmax
df_x.idxmax().to_frame('Name').T
Output
Banana Apple
Name Mike James

Issue with Merging dataframes

I've read through the pandas documentation on merging but am still quite confused on how to apply them to my case. I have 2 dataframes that I'd like to merge - I'd like to merge on the common column 'Town', and also merge on the 'values' in a column which are the 'column names' in the 2nd df.
The first df summarizes the top 5 most common venues in each town:
The second df summarizes the frequencies of all the venue categories in each town:
The output I want:
Ang Mo Kio | Food Court | Coffee Shop | Dessert Shop | Chinese Restaurant | Japanese Restaurant | Freq of Food Court | Freq of Coffee Shop |...
What I've tried with merge:
newdf = pd.merge(sg_onehot_grouped, sg_venues_sorted, left_on=['Town'], right_on=['1st Most Common Venue'])
#only trying the 1st column because wanted to scale down my code
but I got an empty dataframe with the column names as all the columns from both dataframes.
Appreciate any help. Thanks.

Transforming dataframe by making column using unique row values python pandas

I have a following dataframe
Name Activities
Eric Soccer,Baseball,Swimming
Natasha Soccer
Mike Basketball,Baseball
I need to transform it into following dataframe
Activities Name
Soccer Eric,Natasha,Mike
Swimming Eric
Baseball Eric,Mike
Basketball Mike
how should I do it?

Using pd.get_dummies
First, use get_dummies:
tmp = df.set_index('Name').Activities.str.get_dummies(sep=',')
Now using stack and agg:
tmp.mask(tmp.eq(0)).stack().reset_index('Name').groupby(level=0).agg(', '.join)
Name
Baseball Eric, Mike
Basketball Mike
Soccer Eric, Natasha
Swimming Eric
Using str.split and melt
(df.set_index('Name').Activities.str.split(',', expand=True)
.reset_index().melt(id_vars='Name').groupby('value').Name.agg(', '.join))

You can separate the Activities by performing a split and then converting the resulting list to a Series.
Then melt from wide to long format, and groupby the resulting value column (which is Activities).
In your grouped data frame, join the Name fields associated with each Activity.
Like this:
(df.Activities.str.split(",")
.apply(pd.Series)
.merge(df, right_index=True, left_index=True)
.melt(id_vars="Name", value_vars=[0,1,2])
.groupby("value")
.agg({'Name': lambda x: ','.join(x)})
.reset_index()
.rename(columns={"value":"Activities"})
)
Output:
Activities Name
0 Baseball Eric,Mike
1 Basketball Mike
2 Soccer Eric,Natasha
3 Swimming Eric
Note: The reset_index() and rename() methods at the end of the chain are just cosmetic; the main operations are complete after the groupby aggregation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyspark using agg to concat string after groupBy - python

Related

How to collapse all rows in pandas dataframe across all columns

Split a column in Python pandas

Building another dataframe using minimun values of the columns using Pandas

Issue with Merging dataframes

Transforming dataframe by making column using unique row values python pandas

Categories

Resources