Convert PySpark dataframe column from list to string - python

I have this PySpark dataframe
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |[test, test2, test3]|
| 2 |[test4, test, test6]|
| 3 |[test6, test9, t55o]|
and I want to convert the column test_123 to be like this:
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |"test,test2,test3" |
| 2 |"test4,test,test6" |
| 3 |"test6,test9,t55o" |
so from list to be string.
how can I do it with PySpark?

While you can use a UserDefinedFunction it is very inefficient. Instead it is better to use concat_ws function:
from pyspark.sql.functions import concat_ws
df.withColumn("test_123", concat_ws(",", "test_123")).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+

You can create a udf that joins array/list and then apply it to the test column:
from pyspark.sql.functions import udf, col
join_udf = udf(lambda x: ",".join(x))
df.withColumn("test_123", join_udf(col("test_123"))).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
The initial data frame is created from:
from pyspark.sql.types import StructType, StructField
schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
df = spark.createDataFrame(rdd, schema)
df.show()
+----+--------------------+
|uuid| test_123|
+----+--------------------+
| 1|[test, test2, test3]|
| 2|[test4, test, test6]|
| 3|[test6, test9, t55o]|
+----+--------------------+

As of version 2.4.0, you can use array_join.Spark docs
from pyspark.sql.functions import array_join
df.withColumn("test_123", array_join("test_123", ",")).show()

Related

how to concat values of columns with same name in pyspark

We have a feature request where we want to pull a table as per request from the database and perform some transformation on it. But these tables may have duplicate columns [columns with same name]. I want to combine these columns into a single column
for example:
Request for input table named ages:
+---+----+------+-----+
|age| ids | ids | ids |
+---+----+------+-----+
| 25| 1 | 2 | 3 |
+---+----+------+-----+
| 26| 4 | 5 | 6 |
+---+----+------+-----+
the output table is :
+---+----+------+-----+
|age| ids |
+---+----+------+-----+
| 25| [1 , 2 , 3] |
+---+----+------+-----+
| 26| [4 , 5 , 6] |
+---+----+------+-----+
next time we might get a request for input table names:
+---+----+------+-----+
|name| company | company|
+---+----+------+-----+
| abc| a | b |
+---+----+------+-----+
| xyc| c | d |
+---+----+------+-----+
The output table should be:
+---+----+------+
|name| company |
+---+----+------+
| abc| [a,b] |
+---+----+------+
| xyc| [c,d] |
+---+----+------+
So Basically I need to find the columns with the same name and then merge the values in them.
You can convert spark dataframe into pandas dataframe, perform necessary operations and convert it back to spark dataframe.
I have added necessary comments for clarity.
Using Pandas:
import pandas as pd
from collections import Counter
pd_df = spark_df.toPandas() #converting spark dataframe to pandas dataframe
pd_df.head()
def concatDuplicateColumns(df):
duplicate_cols = [] #to store duplicate column names
for col in dict(Counter(df.columns)):
if dict(Counter(df.columns))[col] >1:
duplicate_cols.append(col)
final_dict = {}
for cols in duplicate_cols:
final_dict[cols] = [] #initialize dict
for cols in duplicate_cols:
for ind in df.index.tolist():
final_dict[cols].append(df.loc[ind, cols].tolist())
df.drop(duplicate_cols, axis=1, inplace=True)
for cols in duplicate_cols:
df[cols] = final_dict[cols]
return df
final_df = concatDuplicateColumns(pd_df)
spark_df = spark.createDataFrame(final_df)
spark_df.show()

Combine two DataFrames in PySpark into matrix

I have 2 DataFrames in PySpark script.
DF1 has this data:
+-----+--------------+
| id | keyword |
+-----+--------------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
+-----+--------------+
DF2 has this data:
+----+---------------+
| id | tokens |
+----+---------------+
| 13 | ['abc', 'def']|
| 14 | ['ghi', 'jkl']|
| 15 | ['mno', 'pqr']|
+----+---------------+
I'm looking to build a third DataFrame by a result of combining both of the DataFrames above and performing some complex calculations (the calculations are not important) between the keyword and the tokens defined by a python function:
def complex_calculation(keyword, tokens):
// some various stuff that produces a numeric result between the keyword and the tokens
// e.g. result = 0.7768756
return result
The final result should look something like this:
+-------------+---------+--------+--------+
| keyword | 13 | 14 | 15 |
+-------------+---------+--------+--------+
| banana | 0.5345 | 0.4325 | 0.6543 |
| apple | 0.2435 | 0.7865 | 0.9123 |
| orange | 0.3765 | 0.6942 | 0.2765 |
+-------------+---------+--------+--------+
Your complex calculation function is actually quite important in this context, because what you're looking to do is following:
Create a cartesian product of your two tables
table1 = spark._sc.parallelize([[1,"banana"],
[2,"apple"],
[3,"orange"]]).toDF(["id","keyword"])
table2 = spark._sc.parallelize([[13, ['abc', 'def']],
[14, ['ghi', 'jkl']],
[15, ['mno', 'pqr']]]).toDF(["id","token"])
Pivot with an aggregation function. Now this is where your function comes into play. As you can see, I am using f.count() as my aggregation function.
(
table1.select("keyword")
.crossJoin(table2)
.groupBy('keyword')
.pivot('id')
.agg(f.count("token"))
).show()
+-------+---+---+---+
|keyword| 13| 14| 15|
+-------+---+---+---+
| orange| 1| 1| 1|
| apple| 1| 1| 1|
| banana| 1| 1| 1|
+-------+---+---+---+
If you want to use some custom, clever calculation, you really have two options. If you're competent in Scala, you can write a UDAF (user-defined aggregate function) and register this jar to your Spark cluster. Alternatively, you can have a look at pandas udfs with something such as:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf("struct<agg_key: string, parameter1: parameter1_type>", PandasUDFType.GROUPED_MAP)
def my_agg_function(df):
df = pd.DataFrame(
df.groupby(agg_key).apply(lambda x: (...))
df.reset_index(inplace=True, drop=False)
return df
And then you use your pandas udf such as
spark_df.groupBy("keyword").pivot("id").apply(my_agg_function(...)))
However, despite best attempts at being vectorized, pandas udf are still not great and can have significant performance impacts. Hope this helps. More on pandas udf here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf
Ideally, you should try to do your complex aggregations using spark functions as much as you can, because Tungsten can then optimise this under the hood and give you best performance possible.

Iterate over an array column in PySpark with map

In PySpark I have a dataframe composed by two columns:
+-----------+----------------------+
| str1 | array_of_str |
+-----------+----------------------+
| John | [mango, apple, ... |
| Tom | [mango, orange, ... |
| Matteo | [apple, banana, ... |
I want to add a column concat_result that contains the concatenation of each element inside array_of_str with the string inside str1 column.
+-----------+----------------------+----------------------------------+
| str1 | array_of_str | concat_result |
+-----------+----------------------+----------------------------------+
| John | [mango, apple, ... | [mangoJohn, appleJohn, ... |
| Tom | [mango, orange, ... | [mangoTom, orangeTom, ... |
| Matteo | [apple, banana, ... | [appleMatteo, bananaMatteo, ... |
I'm trying to use map to iterate over the array:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType
# START EXTRACT OF CODE
ret = (df
.select(['str1', 'array_of_str'])
.withColumn('concat_result', F.udf(
map(lambda x: x + F.col('str1'), F.col('array_of_str')), ArrayType(StringType))
)
)
return ret
# END EXTRACT OF CODE
but I obtain as error:
TypeError: argument 2 to map() must support iteration
You only need small tweaks to make this work:
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(lambda con_str, arr: [x + con_str for x in arr],
ArrayType(StringType()))
ret = df \
.select(['str1', 'array_of_str']) \
.withColumn('concat_result', concat_udf(col("str1"), col("array_of_str")))
ret.show()
You don't need to use map, standard list comprehension is sufficient.

Flip a Dataframe [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Unpivot in Spark SQL / PySpark
(2 answers)
Dataframe transpose with pyspark in Apache Spark
(2 answers)
Closed 4 years ago.
I am working on Databricks using Python 2.
I have a PySpark dataframe like:
|Germany|USA|UAE|Turkey|Canada...
|5 | 3 |3 |42 | 12..
Which, as you can see, consists of hundreds of columns and only one single row.
I want to flip it in a way such that I get:
Name | Views
--------------
Germany| 5
USA | 3
UAE | 3
Turkey | 42
Canada | 12
How would I approach this?
Edit: I have hundreds of columns so I can't write them down. I don't know most of them, but they just exist there. I can't use the columns names in this process.
Edit 2: Example code:
dicttest = {'Germany': 5, 'USA': 20, 'Turkey': 15}
rdd=sc.parallelize([dicttest]).toDF()
df = rdd.toPandas().transpose()
This answer might be a bit 'overkill' but it does not use Pandas or collect anything to the driver. It will also work when you have multiple rows. We can just pass an empty list to the melt function from "How to melt Spark DataFrame?"
A working example would be as follows:
import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext, Column
import pandas as pd
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable
try:
sc
except NameError:
sc = ps.SparkContext()
sqlContext = SQLContext(sc)
# From https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe
def melt(
df: DataFrame,
id_vars: Iterable[str], value_vars: Iterable[str],
var_name: str="variable", value_name: str="value") -> DataFrame:
"""Convert :class:`DataFrame` from wide to long format."""
# Create array<struct<variable: str, value: ...>>
_vars_and_vals = array(*(
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars))
# Add to the DataFrame and explode
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
return _tmp.select(*cols)
# Sample data
df1 = sqlContext.createDataFrame(
[(0,1,2,3,4)],
("col1", "col2",'col3','col4','col5'))
df1.show()
df2 = melt(df1,id_vars=[],value_vars=df1.columns)
df2.show()
Output:
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| 0| 1| 2| 3| 4|
+----+----+----+----+----+
+--------+-----+
|variable|value|
+--------+-----+
| col1| 0|
| col2| 1|
| col3| 2|
| col4| 3|
| col5| 4|
+--------+-----+
Hope this helps.
You can convert pyspark dataframe to pandas dataframe and use Transpose function
%pyspark
import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit
dt1 = [[1,2,4,5,6,7]]
dt = sc.parallelize(dt1).toDF()
dt.show()
dt.toPandas().transpose()
Output:
Other solution
dt2 = [{"1":1,"2":2,"4":4,"5":5,"6":29,"7":8}]
df = sc.parallelize(dt2).toDF()
df.show()
a = [{"name":i,"value":df.select(i).collect()[0][0]} for i in df.columns ]
df1 = sc.parallelize(a).toDF()
print(df1)

Adding a value into a DenseVector in PySpark

I have a DataFrame that I have processed to be like:
+---------+-------+
| inputs | temp |
+---------+-------+
| [1,0,0] | 12 |
+---------+-------+
| [0,1,0] | 10 |
+---------+-------+
...
inputs is a column of DenseVectors. temp is a column of values. I want to append the DenseVector with these values and create one column, but I am not sure how to start. Any tips for this desired output:
+---------------+
| inputsMerged |
+---------------+
| [1,0,0,12] |
+---------------+
| [0,1,0,10] |
+---------------+
...
EDIT: I am trying to use the VectorAssembler method but my resulting array is not as intended.
You might do something like this:
df.show()
+-------------+----+
| inputs|temp|
+-------------+----+
|[1.0,0.0,0.0]| 12|
|[0.0,1.0,0.0]| 10|
+-------------+----+
df.printSchema()
root
|-- inputs: vector (nullable = true)
|-- temp: long (nullable = true)
Import:
import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
Create the udf to merge the Vector and element:
concat = F.udf(lambda v, e: Vectors.dense(list(v) + [e]), VectorUDT())
Apply udf to inputs and temp columns:
merged_df = df.select(concat(df.inputs, df.temp).alias('inputsMerged'))
merged_df.show()
+------------------+
| inputsMerged|
+------------------+
|[1.0,0.0,0.0,12.0]|
|[0.0,1.0,0.0,10.0]|
+------------------+
merged_df.printSchema()
root
|-- inputsMerged: vector (nullable = true)

Categories

Resources