Spark(Pyspark) - How to convert Dataframe String column to dataframe multiple columns

Spark(Pyspark) - How to convert Dataframe String column to dataframe multiple columns - python

I have a below pyspark DataFrame[recoms: string] where the recoms column value is of string type. Each row represent a string here.
+--------------------+
| recoms |
+--------------------+
|{"a":"1","b":"5",..}|
|{"a":"2","b":"4",..}|
|{"a":"3","b":"9",..}|
+--------------------+
The rows above doesn't have a defined schema to use the from_json method.So i'm looking for alternate options here. How do I convert or split to multi-value dataframe column as below using pyspark sql functions. In the above table all the values left to : would dataframe column name and right to : would be row values.
+--------+---+
| a | b |
+--------+---+
| 1 | 5|
| 2 | 4|
| 3 | 9|
+--------+---+
I tried explode sql function
df.select(explode("recoms")).show() and getting the below error.
org.apache.spark.sql.AnalysisException: cannot resolve 'explode(true_recoms)' due to data type mismatch: input to function explode should be array or map type, not string;;

Related

How te create a new column with values matching conditions from different dataframes?

Is there a simple function (both on pandas or numpy) to create a new column with true or false values, based on matching criteria from different dataframes?
I'm actually trying to compare two dataframes that have the column email and see, for example, which emails match with the emails on the second data frame. The goal is to print a table that looks like this (where hola#lorem.com it's actually both on the first and second dataframe):
| id | email | match |
|:------|:------ |:-------|
| 1 | hola#lorem.com | true|
| 2 | adios#lorem.com | false|
| 3 | bye#lorem.com | false|
Thanks in advance for your help

pd.assign
df1 = df1.assign(match=df2["email"].isin(df1["email"]))

You can for example use the function isin:
df1['match'] = df1['email'].isin(df2['email'])
df2['match'] = df2['email'].isin(df1['email'])

Combine values from multiple columns into one Pyspark Dataframe [duplicate]

This question already has answers here:
Concat multiple columns of a dataframe using pyspark
(1 answer)
Concatenate columns in Apache Spark DataFrame
(18 answers)
How to concatenate multiple columns in PySpark with a separator?
(2 answers)
Closed 2 years ago.
I have a pyspark dataframe that has fields:
"id",
"fields_0_type" ,
"fields_0_price",
"fields_1_type",
"fields_1_price"
+------------------+--------------+-------------+-------------+---
|id |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return |45 |New |50 |
+--------------+----------+--------------------+------------+
How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:
+--------------------------- +------------------------------+
|id |type | price
+---------------------------+------------------------------+
|1234 |Return,Upgrade |45,50
Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.
Update:
Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it?
What it is showing now:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+
How I want it:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New |
|New,New,Sale,Sale,New|
+------------------------------------------------------------------+

Cast all columns in array then use concat_ws function.
Example:
df.show()
#+----+-------------+-------------+-------------+
#| id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234| a| b| c|
#+----+-------------+-------------+-------------+
columns=df.columns
columns.remove('id')
df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#| id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+
UPDATE:
df.show()
#+----+-------------+--------------+-------------+--------------+
#| id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234| a| 45| b| 50|
#+----+-------------+--------------+-------------+--------------+
type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]
df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#| id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+

Row Operations in pyspark dataframe

I have a Spark DataFrame, like the one shown below.
I need an algorithm that whenever I find 'M' in the row, I need to select the next two columns after 'M' and create a new row. If there are two 'M' in a single row, then I need to create two rows 1 with the two columns from first 'M' and 2nd row with two columns from second 'M'.
Input Dataframe:
+------+---+--------------------+---+--------+--------+-----+----+-----------+----+
|rownum|_c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|_c10|_c11|
+------+---+--------------------+---+--------+--------+-----+----+-----------+---
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Drinks|30|M|Food|20|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Food|40| M|Bar |50|M|Drinks|100
+------+---+--------------------+---+--------+--------+-----+----+-----------+----+-----+
New Output Dataframe:
+------+---+--------------------+---+--------+--------+-----+----+-------
|rownum|_c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|
+------+---+--------------------+---+--------+--------+-----+----+-------
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Drinks|30|
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Food|20|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Food|40|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Bar |50|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Drinks |100|
+------+---+--------------------+---+--------+--------+-----+----+--+

Extracting a sub array from PySpark DataFrame column [duplicate]

This question already has answers here:
get first N elements from dataframe ArrayType column in pyspark
(2 answers)
Closed 4 years ago.
I wish to remove the last element of the array from this DataFrame. We have this link demonstrating the same thing, but with UDFs and that I wish to avoid. Is there is simple way to do this - something like list[:2]?
data = [(['cat','dog','sheep'],),(['bus','truck','car'],),(['ice','pizza','pasta'],)]
df = sqlContext.createDataFrame(data,['data'])
df.show()
+-------------------+
| data|
+-------------------+
| [cat, dog, sheep]|
| [bus, truck, car]|
|[ice, pizza, pasta]|
+-------------------+
Expected DataFrame:
+--------------+
| data|
+--------------+
| [cat, dog]|
| [bus, truck]|
| [ice, pizza]|
+--------------+

UDF is the best thing you can find for PySpark :)
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType
# Get the fist two elements
split_row = udf(lambda row: row[:2])
# apply the udf to each row
new_df = df.withColumn("data", split_row(df["data"]))
new_df.show()
# Output
+------------+
| data|
+------------+
| [cat, dog]|
|[bus, truck]|
|[ice, pizza]|
+------------+

Order of rows in DataFrame after aggregation

Suppose I've got a data frame df (created from a hard-coded array for tests)
+----+----+---+
|name| c1|qty|
+----+----+---+
| a|abc1| 1|
| a|abc2| 0|
| b|abc3| 3|
| b|abc4| 2|
+----+----+---+
I am grouping and aggregating it to get df1
import pyspark.sql.functions as sf
df1 = df.groupBy('name').agg(sf.min('qty'))
df1.show()
+----+--------+
|name|min(qty)|
+----+--------+
| b| 2|
| a| 0|
+----+--------+
What is the expected order of the rows in df1 ?
Suppose now that I am writing a unit test. I need to compare df1 with the expected data frame. Should I compare them ignoring the order of rows. What is the best way to do it ?

The ordering of the rows in the dataframe is not fixed. There is an easy way to use the expected Dataframe in test cases
Do a dataframe diff . For scala:
assert(df1.except(expectedDf).count == 0)
And
assert(expectedDf.except(df1).count == 0)
For python you need to replace except by subtract
From documentation:
subtract(other)
Return a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark(Pyspark) - How to convert Dataframe String column to dataframe multiple columns - python

Related

How te create a new column with values matching conditions from different dataframes?

Combine values from multiple columns into one Pyspark Dataframe [duplicate]

Row Operations in pyspark dataframe

Extracting a sub array from PySpark DataFrame column [duplicate]

Order of rows in DataFrame after aggregation

Categories

Resources