Combine values from multiple columns into one Pyspark Dataframe [duplicate]

Combine values from multiple columns into one Pyspark Dataframe [duplicate] - python

This question already has answers here:
Concat multiple columns of a dataframe using pyspark
(1 answer)
Concatenate columns in Apache Spark DataFrame
(18 answers)
How to concatenate multiple columns in PySpark with a separator?
(2 answers)
Closed 2 years ago.
I have a pyspark dataframe that has fields:
"id",
"fields_0_type" ,
"fields_0_price",
"fields_1_type",
"fields_1_price"
+------------------+--------------+-------------+-------------+---
|id |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return |45 |New |50 |
+--------------+----------+--------------------+------------+
How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:
+--------------------------- +------------------------------+
|id |type | price
+---------------------------+------------------------------+
|1234 |Return,Upgrade |45,50
Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.
Update:
Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it?
What it is showing now:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+
How I want it:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New |
|New,New,Sale,Sale,New|
+------------------------------------------------------------------+

Cast all columns in array then use concat_ws function.
Example:
df.show()
#+----+-------------+-------------+-------------+
#| id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234| a| b| c|
#+----+-------------+-------------+-------------+
columns=df.columns
columns.remove('id')
df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#| id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+
UPDATE:
df.show()
#+----+-------------+--------------+-------------+--------------+
#| id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234| a| 45| b| 50|
#+----+-------------+--------------+-------------+--------------+
type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]
df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#| id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+

Related

Split & Map fields of array<string> in pyspark [duplicate]

This question already has answers here:
How to explode multiple columns of a dataframe in pyspark
(7 answers)
Closed last year.
I have a Pyspark dataframe as below with 7 columns out of which 6 fields are array and one column is array<array>.
Sample data is as below
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|customer_id |equipment_id |type |language |country |lang_cnt_str |model_num |
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|[18e644bb-4342-4c22-ab9b-a90fda50ad69, 70f0b998-3e4e-422d-b863-1f5f455c4883, 54a99992-5403-4946-b059-f71ec7ef2cca]|[1407c4a9-b075-4837-bada-690da10717cd, fc4632f3-302b-43cb-9245-ede2d1ac590f, 1407c4a9-b075-4837-bada-690da10717cd]|[comm, comm, vspec]|[cs, en-GB, pt-PT] |[[CZ], [PT], [PT]] |[(language = 'cs' AND country IS IN ('CZ')), (language = 'en-GB' AND country IS IN ('PT')), (language = 'pt-PT' AND country IS IN ('PT'))]|[1618832612617, 1618832612858, 1618832614027]|
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
I want to split and map every element of all columns. Below is the expected output.
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|customer_id |equipment_id |type |language |country |lang_cnt_str |model_num |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|18e644bb-4342-4c22-ab9b-a90fda50ad69 |1407c4a9-b075-4837-bada-690da10717cd |comm |cs |[CZ] |(language = 'cs' AND country IS IN ('CZ')) |1618832612617 |
|70f0b998-3e4e-422d-b863-1f5f455c4883 |fc4632f3-302b-43cb-9245-ede2d1ac590f |comm |en-GB |[PT] |(language = 'en-GB' AND country IS IN ('PT')) |1618832612858 |
|54a99992-5403-4946-b059-f71ec7ef2cca |1407c4a9-b075-4837-bada-690da10717cd |vspec |pt-PT |[PT] |(language = 'pt-PT' AND country IS IN ('PT')) |1618832614027 |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
How can we achieve this in pyspark. Can someone please help me. Thanks in advance!!

We exchanged a couple of comments above, and I think there's nothing special about the array(array(string)) column. So I post this answer to show the solution posted in How to explode multiple columns of a dataframe in pyspark
df = spark.createDataFrame([
(['1', '2', '3'], [['1'], ['2'], ['3']])
], ['col1', 'col2'])
df = (df
.withColumn('zipped', f.arrays_zip(f.col('col1'), f.col('col2')))
.withColumn('unzipped', f.explode(f.col('zipped')))
.select(f.col('unzipped.col1'),
f.col('unzipped.col2')
)
)
df.show()
The input is:
+---------+---------------+
| col1| col2|
+---------+---------------+
|[1, 2, 3]|[[1], [2], [3]]|
+---------+---------------+
And the output is:
+----+----+
|col1|col2|
+----+----+
| 1| [1]|
| 2| [2]|
| 3| [3]|
+----+----+

How to extract an element from a array in rows in pyspark [duplicate]

This question already has answers here:
Explode array data into rows in spark [duplicate]
(3 answers)
Dividing complex rows of dataframe to simple rows in Pyspark
(3 answers)
Explode in PySpark
(2 answers)
Closed 3 years ago.
I have a data frame with following type
col1|col2|col3|col4
xxxx|yyyy|zzzz|[, 111, por-BR, 2222]
I want my output to be following type
+----+----+----+-----+
|col1|col2|col3|col4 |
+----+----+----+-----+
| xx| yy| zz| 1111|
| xx| yy| zz| 2222|
+----+----+----+-----+
col4 is an array and I want to appear in the same column (or different) but on one column
Following is my actual schema:
data1:pyspark.sql.dataframe.DataFrame
col1:string
col2:string
col3:string
col4:array
element:struct
colDept:string
I managed to do below
df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
df.show()
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| xx| yy| zz|1111|2222|
+----+----+----+----+----+
but I want like this if can any can help please?
#+----+----+----+-----+
#|col1|col2|col3|col4 |
#+----+----+----+-----+
#| xx| yy| zz| 1111|
#| xx| yy| zz| 2222|
#+----+----+----+-----+

Difference between two DataFrames based on only one column in pyspark [duplicate]

This question already has an answer here:
Pyspark filter dataframe by columns of another dataframe
(1 answer)
Closed 4 years ago.
I am looking for a way to find the difference of two DataFrames based on one column. For example:
from pyspark.sql import SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
df_a = sql_context.createDataFrame([("fa", 3), ("fb", 5), ("fc", 7)], ["first name", "id"])
df_b = sql_context.createDataFrame([("la", 3), ("lb", 10), ("lc", 13)], ["last name", "id"])
DataFrame A:
+----------+---+
|first name| id|
+----------+---+
| fa| 3|
| fb| 5|
| fc| 7|
+----------+---+
DataFrame B:
+---------+---+
|last name| id|
+---------+---+
| la| 3|
| lb| 10|
| lc| 13|
+---------+---+
My goal is to find the difference of DataFrame A and DataFrame B considering column id, the output would be the following DataFrame
+---------+---+
|last name| id|
+---------+---+
| lb| 10|
| lc| 13|
+---------+---+
I don't want to use the following method:
a_ids = set(df_a.rdd.map(lambda r: r.id).collect())
df_c = df_b.filter(~col('id').isin(a_ids))
I'm looking for an efficient method (in terms of memory and speed) that I don't have to collect the ids (the size of ids can be billions), maybe something like RDDs SubtractByKey but for DataFrame
PS: I can map df_a to RDD, but I don't want to map df_b to RDD

You can do a left_anti join on column id:
df_b.join(df_a.select('id'), how='left_anti', on=['id']).show()
+---+---------+
| id|last name|
+---+---------+
| 10| lb|
| 13| lc|
+---+---------+

How to update a pyspark dataframe with new values from another dataframe?

I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+

This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()

I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.

If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)

Order of rows in DataFrame after aggregation

Suppose I've got a data frame df (created from a hard-coded array for tests)
+----+----+---+
|name| c1|qty|
+----+----+---+
| a|abc1| 1|
| a|abc2| 0|
| b|abc3| 3|
| b|abc4| 2|
+----+----+---+
I am grouping and aggregating it to get df1
import pyspark.sql.functions as sf
df1 = df.groupBy('name').agg(sf.min('qty'))
df1.show()
+----+--------+
|name|min(qty)|
+----+--------+
| b| 2|
| a| 0|
+----+--------+
What is the expected order of the rows in df1 ?
Suppose now that I am writing a unit test. I need to compare df1 with the expected data frame. Should I compare them ignoring the order of rows. What is the best way to do it ?

The ordering of the rows in the dataframe is not fixed. There is an easy way to use the expected Dataframe in test cases
Do a dataframe diff . For scala:
assert(df1.except(expectedDf).count == 0)
And
assert(expectedDf.except(df1).count == 0)
For python you need to replace except by subtract
From documentation:
subtract(other)
Return a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combine values from multiple columns into one Pyspark Dataframe [duplicate] - python

Related

Split & Map fields of array<string> in pyspark [duplicate]

How to extract an element from a array in rows in pyspark [duplicate]

Difference between two DataFrames based on only one column in pyspark [duplicate]

How to update a pyspark dataframe with new values from another dataframe?

Order of rows in DataFrame after aggregation

Categories

Resources