how to concat values of columns with same name in pyspark - python

We have a feature request where we want to pull a table as per request from the database and perform some transformation on it. But these tables may have duplicate columns [columns with same name]. I want to combine these columns into a single column
for example:
Request for input table named ages:
+---+----+------+-----+
|age| ids | ids | ids |
+---+----+------+-----+
| 25| 1 | 2 | 3 |
+---+----+------+-----+
| 26| 4 | 5 | 6 |
+---+----+------+-----+
the output table is :
+---+----+------+-----+
|age| ids |
+---+----+------+-----+
| 25| [1 , 2 , 3] |
+---+----+------+-----+
| 26| [4 , 5 , 6] |
+---+----+------+-----+
next time we might get a request for input table names:
+---+----+------+-----+
|name| company | company|
+---+----+------+-----+
| abc| a | b |
+---+----+------+-----+
| xyc| c | d |
+---+----+------+-----+
The output table should be:
+---+----+------+
|name| company |
+---+----+------+
| abc| [a,b] |
+---+----+------+
| xyc| [c,d] |
+---+----+------+
So Basically I need to find the columns with the same name and then merge the values in them.

You can convert spark dataframe into pandas dataframe, perform necessary operations and convert it back to spark dataframe.
I have added necessary comments for clarity.
Using Pandas:
import pandas as pd
from collections import Counter
pd_df = spark_df.toPandas() #converting spark dataframe to pandas dataframe
pd_df.head()
def concatDuplicateColumns(df):
duplicate_cols = [] #to store duplicate column names
for col in dict(Counter(df.columns)):
if dict(Counter(df.columns))[col] >1:
duplicate_cols.append(col)
final_dict = {}
for cols in duplicate_cols:
final_dict[cols] = [] #initialize dict
for cols in duplicate_cols:
for ind in df.index.tolist():
final_dict[cols].append(df.loc[ind, cols].tolist())
df.drop(duplicate_cols, axis=1, inplace=True)
for cols in duplicate_cols:
df[cols] = final_dict[cols]
return df
final_df = concatDuplicateColumns(pd_df)
spark_df = spark.createDataFrame(final_df)
spark_df.show()

Related

PySpark - how to select all columns to be used in groupby

I'm trying to chain a join and groupby operation together. The inputs and operations I want to do look like below. I want to groupby all the columns except the one used in agg. Is there a way of doing this without listing out all the column names like groupby("colA","colB")? I tried groupby(df1.*)but that didn't work. In this case I know that I'd like to group by all the columns in df1. Many thanks.
Input1:
colA | ColB
--------------
A | 100
B | 200
Input2:
colAA | ColBB
--------------
A | Group1
B | Group2
A | Group2
df1.join(df2, df1colA==df2.colAA,"left").drop("colAA").groupby("colA","colB"),agg(collect_set("colBB"))
#Is there a way that I do not need to list ("colA","colB") in groupby? there will be many cloumns.
Output:
colA | ColB | collect_set
--------------
A | 100 | (Group1,Group2)
B | 200 | (Group2)
Based on your clarifying comments, use df1.columns
df1.join(df2, df1.colA==df2.colAA,"left").drop("colAA").groupby(df1.columns).agg(collect_set("colBB").alias('new')).show()
+----+----+----------------+
|colA|ColB| new|
+----+----+----------------+
| A| 100|[Group2, Group1]|
| B| 200| [Group2]|
+----+----+----------------+
Just simple:
.groupby(df1.columns)

How can I verify the schema (number and name of the columns) of a Dataframe in pyspark?

I have to read a csv file and I have to verify the name and the number of columns of the dataframe. The minimum number of columns is 3 and they have to be: 'id', 'name' and 'phone'. There is no problem of having more columns than that. But it always needs to have at least those 3 columns with the exact name. Otherwise, program should fail.
For example:
Correct:
+-----+-----+-----+ +-----+-----+-----+-----+
| id| name|phone| | id| name|phone|unit |
+-----+-----+-----+ +-----+-----+-----+-----+
|3940A|jhon |1345 | |3940A|jhon |1345 | 222 |
|2BB56|mike | 492 | |2BB56|mike | 492 | 333 |
|3(401|jose |2938 | |3(401|jose |2938 | 444 |
+-----+-----+-----+ +-----+-----+-----+-----+
Incorrect:
+-----+-----+-----+ +-----+-----+
| sku| nomb|phone| | sku| name|
+-----+-----+-----+ +-----+-----+
|3940A|jhon |1345 | |3940A|jhon |
|2BB56|mike | 492 | |2BB56|mike |
|3(401|jose |2938 | |3(401|jose |
+-----+-----+-----+ +-----+-----+
Using simple python if-else statement should do the job:
mandatory_cols = ["id", "name", "phone"]
if all(c in df.columns for c in mandatory_cols):
# your logic
else:
raise ValueError("missing columns!")
Here's an example on how to check if columns exist in your dataframe:
from pyspark.sql import Row
def check_columns_exits(cols):
if 'id' in cols and 'name' in cols and 'phone' in cols:
print("All required columns are present")
else:
print("Does not have all the required columns")
data = [Row(id="3940A", name="john", phone="1345", unit=222),
Row(id="2BB56", name="mike", phone="492", unit=333)]
df = spark.createDataFrame(data)
check_columns_exits(df.columns)
data1 = [Row(id="3940A", name="john", unit=222),
Row(id="2BB56", name="mike", unit=333)]
df1 = spark.createDataFrame(data1)
check_columns_exits(df1.columns)
RESULT:
All required columns are present
Does not have all the required columns

Python DataFrame - Select dataframe rows based on values in a column of same dataframe

I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]

Python,Pandas,DataFrame, add new column doing SQL GROUP_CONCAT equivalent

My question is very similar to the one asked but unanswered here
Replicating GROUP_CONCAT for pandas.DataFrame
I have a Pandas DataFame which I want to group concat into a data frame
+------+---------+
| team | user |
+------+---------+
| A | elmer |
| A | daffy |
| A | bugs |
| B | dawg |
| A | foghorn |
+------+---------+
Becoming
+------+---------------------------------------+
| team | group_concat(user) |
+------+---------------------------------------+
| A | elmer,daffy,bugs,foghorn |
| B | dawg |
+------+---------------------------------------+
As answeed in the original topic, it can be done via any of these:
df.groupby('team').apply(lambda x: ','.join(x.user))
df.groupby('team').apply(lambda x: list(x.user))
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})
But the resulting object is not a Pandas Dataframe anymore.
How can I get the GROUP_CONCAT results in the original Pandas DataFrame as a new column?
Cheers
You can apply list and join after grouping by, then reset_index to get the dataframe.
output_df = df.groupby('team')['user'].apply(lambda x: ",".join(list(x))).reset_index()
output_df.rename(columns={'user': 'group_concat(user)'})
team group_concat(user)
0 A elmer,daffy,bugs,foghorn
1 B dawg
Let's break down the below code:
Firstly, groupby team and, use apply on the user to join it's elements using a ,
Then, reset the index, and rename the resulting dataframe (axis=1, refers to columns and not rows)
res = (df.groupby('team')['user']
.apply(lambda x: ','.join(str(i) for i in x))).reset_index().rename({'user':'group_concat(user)'},axis=1)
Output:
team group_concat(user)
0 A elmer,daffy,bugs,foghorn
1 B dawg

Create pandas df from difference between two dfs

Set-up
I have two pandas data frames df1 and df2, each containing two columns with observations for id and its respective url,
| id | url | | id | url |
------------ ------------
| 1 | url | | 2 | url |
| 2 | url | | 4 | url |
| 3 | url | | 3 | url |
| 4 | url | | 5 | url |
| 6 | url |
Some observations are in both dfs, which is clear from the id column, e.g. observation 2 and it's url are in both dfs.
The positioning within the dfs of those 'double' observations does not necessarily have to be the same, e.g. observation 2 is in first row in df1 and second in df2.
Lastly, the dfs do not necessarily have the same number of observations, e.g. df1 has four observations while df2 has five.
Problem
I want to elicit all unique observations in df2 and insert them in a new df (df3), i.e. I want to obtain,
| id | url |
------------
| 5 | url |
| 6 | url |
How do I go about?
I've tried this answer but cannot get it to work for my two-column dataframes.
I've also tried this other answer, but this gives me an empty common dataframe.
Possibly something like this: df3 = df2[~df2.id.isin(df1.id.tolist())]
ID numbers make good index names:
df1.index = df1.id
df2.index = df2.id
Then use the very straightforward index.difference:
diff_index = df2.index.difference(df1.index)
df3 = df2.loc[diff_index]

Categories

Resources