find values closest to a list of values in pyspark

find values closest to a list of values in pyspark - python

Let's suppose to have this Pyspark dataframe:
x = np.random.randint(1, 100, 1000)
y = np.random.randint(1, 100, 1000)
z = np.random.randint(1, 100, 1000)
df = pd.DataFrame({'x': x, 'y': y, 'z': z})
spark_df = spark.createDataFrame(df)
Let's suppose to have this list of values:
lst = [10, 20, 30]
I would like to retrieve all the 3 (=len(lst)) rows of spark_df such that the difference between each value of lst and spark_df.x is the lowest. I would like to retrieve this three values as a spark dataframe. E.g:
+---+---+---+
| x| y| z|
+---+---+---+
| 11| 32| 84|
| 22| 12| 38|
| 29| 14| 12|
+---+---+---+
In this case:
11 is the closest spark_df.x value to 10
22 is the closest spark_df.x value to 20
29 is the closest spark_df.x value to 30
How can achieve this result in Pyspark 3+?
Note: this is only a toy example and the list of values could be in the order of thousands.

Step 1: add columns with the difference of the elements of lst and the x values to the dataframe:
from pyspark.sql import functions as F
diffs = [F.abs(F.col("x") - F.lit(c)).alias(f"diff_{c}") for c in lst]
df_with_diffs = spark_df.select("*", *diffs)
+---+---+---+-------+-------+-------+
| x| y| z|diff_10|diff_20|diff_30|
+---+---+---+-------+-------+-------+
| 15| 34| 20| 5| 5| 15|
| 12| 45| 24| 2| 8| 18|
| 86| 49| 13| 76| 66| 56|
+---+---+---+-------+-------+-------+
only showing top 3 rows
Step 2:
Collect the minimal values for each of the diff columns and select the respective rows:
mins=df_with_diffs.select(*[F.min(f"diff_{c}") for c in lst]).first()
filter=" or ".join([f"(diff_{c} = {mins[i]})" for i,c in enumerate(lst)])
df_with_diffs.filter(filter).select(spark_df.columns).show()
+---+---+---+
| x| y| z|
+---+---+---+
| 12| 45| 24|
| 22| 28| 58|
| 27| 96| 36|
+---+---+---+
Step 2 (original answer): use min_by for each of the newly created columns to find the row with the minimal difference. For each value of lst this returns one row. All these rows are then unioned.
agg_cols = [[F.expr(f"min_by({c}, diff_{val})").alias(c) for c in spark_df.columns]
for val in lst]
import functools
result = functools.reduce(lambda a,b: a.union(df_with_diffs.agg(*b)), agg_cols[1:],
df_with_diffs.agg(*agg_cols[0]))
result.show()

After some experiments on my own, I propose this solution. At first, I wanted avoid pandas_udf function, but, none the less, it seems elegant, pythonic and effective.
Step 1:
Create a pandas_udf in order to obtain a list of minimum values:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType, ArrayType
import pyspark.sql.functions as func
bdc_values = sc.broadcast([10, 20, 30])
#pandas_udf(ArrayType(IntegerType()))
def get_x_min(x: pd.Series) -> ArrayType(IntegerType()):
values = bdc_values.value
return [x.iloc[(x - v).abs().argmin()] for v in values]
Step 2:
Apply the pandas_udf to dataframe and collect the values, in this way:
mins = spark_df.agg(get_x_min('x')).first()[0]
Step 3:
Filter the spark_df with isin function and then eventually deduplicate:
result = spark_df.filter(func.col('x').isin(mins)).dropDuplicates(['x'])

Related

How to code dynamic number of WHEN evaluations in PySpark for a new column

I have a list of historical values for a device setting and a dataframe with timestamps.
I need to create a new column in my dataframe based on the comparison of the timestamps column in the dataframe and the timestamp of the setting value in my list.
settings_history = [[1, '2021-01-01'], [2, '2021-01-12']]
dataframe = df.withColumn(
'setting_col', when(col('device_timestamp') <= settings_history[0][1], settings_history[0][0])
.when(col('device_timestamp') <= settings_history[1][1], settings_history[1][0])
)
The number of entries in the settings_history array is dynamic and I need to find a way to implement something like above, but I get a syntax error. Also, I have tried to use a for loop in my withColumn function, but that didn't work either.
My raw dataframe has values like:
device_timestamp
2020-05-21
2020-12-19
2021-01-03
2021-01-11
My goal is to have something like:
device_timestamp setting_col
2020-05-21 1
2020-12-19 1
2021-01-03 2
2021-01-11 2
I'm using Databricks on Azure for my work.

You can use reduce to chain the when conditions together:
from functools import reduce
settings_history = [[1, '2021-01-01'], [2, '2021-01-12']]
new_col = reduce(
lambda c, history: c.when(col('device_timestamp') <= history[1], history[0]),
settings_history[1:],
when(col('device_timestamp') <= settings_history[0][1], settings_history[0][0])
)
dataframe = df.withColumn('setting_col', new_col)

Something like the below created when_expression function will be useful in this case. where a when condition is created based on whatever information you provide in list settings_array.
import pandas as pd
from pyspark.sql import functions as F
def when_expression(settings_array):
when_condition = None
for a, b in settings_array:
if when_condition is None:
when_condition = F.when(F.col('device_timestamp') <= a, F.lit(b))
else:
when_condition = when_condition.when(F.col('device_timestamp') <= a, F.lit(b))
return when_condition
settings_array = [
[2, 3], # if <= 2 make it 3
[5, 7], # if <= 5 make it 7
[10, 100], # if <= 10 make it 100
]
df = pd.DataFrame({'device_timestamp': range(10)})
df = spark.createDataFrame(df)
df.show()
when_condition = when_expression(settings_array)
print(when_condition)
df = df.withColumn('setting_col', when_condition)
df.show()
Output:
+----------------+
|device_timestamp|
+----------------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+----------------+
Column<b'CASE WHEN (device_timestamp <= 2) THEN 3 WHEN (device_timestamp <= 5) THEN 7 WHEN (device_timestamp <= 10) THEN 100 END'>
+----------------+-----------+
|device_timestamp|setting_col|
+----------------+-----------+
| 0| 3|
| 1| 3|
| 2| 3|
| 3| 7|
| 4| 7|
| 5| 7|
| 6| 100|
| 7| 100|
| 8| 100|
| 9| 100|
+----------------+-----------+

Append list of lists as column to PySpark's dataframe (Concatenating two dataframes without common column)

I have some dataframe in Pyspark:
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession.builder.getOrCreate()
sqlcontext = SQLContext(spark)
df = sqlcontext.createDataFrame([['a'],['b'],['c'],['d'],['e']], ['id'])
df.show()
+---+
| id|
+---+
| a|
| b|
| c|
| d|
| e|
+---+
And I have a list of lists:
l = [[1,1], [2,2], [3,3], [4,4], [5,5]]
Is it possible to append this list as a column to df? Namely, the first element of l should appear next to the first row of df, the second element of l next to the second row of df, etc. It should look like this:
+----+---+--+
| id| l|
+----+---+--+
| a| [1,1]|
| b| [2,2]|
| c| [3,3]|
| d| [4,4]|
| e| [5,5]|
+----+---+--+

UDF's are generally slow but a more efficient way without using any UDF's would be:
import pyspark.sql.functions as F
ldf = spark.createDataFrame(l, schema = "array<int>")
df1 = df.withColumn("m_id", F.monotonically_increasing_id())
df2 = ldf.withColumn("m_id", F.monotonically_increasing_id())
df3 = df2.join(df1, "m_id", "outer").drop("m_id")
df3.select("id", "value").show()
+---+------+
| id| value|
+---+------+
| a|[1, 1]|
| b|[2, 2]|
| d|[4, 4]|
| c|[3, 3]|
| e|[5, 5]|
+---+------+

Assuming that you are going to have same amount of rows in your df and items in your list (df.count==len(l)).
You can add a row_id (to specify the order) to your df, and based on that, access to the item on your list (l).
from pyspark.sql.functions import row_number, lit
from pyspark.sql.window import *
df = df.withColumn("row_num", row_number().over(Window().orderBy(lit('A'))))
df.show()
Above code will look like:
+---+-------+
| id|row_num|
+---+-------+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+---+-------+
Then, you can just iterate your df and access the specified index in your list:
def map_df(row):
return (row.id, l[row.row_num-1])
new_df = df.rdd.map(map_df).toDF(["id", "l"])
new_df.show()
Output:
+---+------+
| id| l|
+---+------+
| 1|[1, 1]|
| 2|[2, 2]|
| 3|[3, 3]|
| 4|[4, 4]|
| 5|[5, 5]|
+---+------+

Thanks to Cesar's answer, I figured out how to do it without making the dataframe an RDD and coming back. It would be something like this:
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import row_number, lit, udf
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, IntegerType
spark = SparkSession.builder.getOrCreate()
sqlcontext = SQLContext(spark)
df = sqlcontext.createDataFrame([['a'],['b'],['c'],['d'],['e']], ['id'])
df = df.withColumn("row_num", row_number().over(Window().orderBy(lit('A'))))
new_col = [[1.,1.], [2.,2.], [3.,3.], [4.,4.], [5.,5.]]
map_list_to_column = udf(lambda row_num: new_col[row_num -1], ArrayType(FloatType()))
df.withColumn('new_col', map_list_to_column(df.row_num)).drop('row_num').show()

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?

You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)

Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

How to retrieve all columns using pyspark collect_list functions

I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist

Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+

Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+

in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+

I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()

Transpose column to row with Spark

I'm trying to transpose some columns of my table to row.
I'm using Python and Spark 1.5.0. Here is my initial table:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
I would like to have somthing like this:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
Does someone know haw I can do it? Thank you for your help.

Spark >= 3.4
You can use built-in melt method. With Python:
df.melt(
ids=["A"], values=["col_1", "col_2"],
variableColumnName="key", valueColumnName="val"
)
with Scala
df.melt(Array($"A"), Array($"col_1", $"col_2"), "key", "val")
Spark < 3.4
It is relatively simple to do with basic Spark SQL functions.
Python
from pyspark.sql.functions import array, col, explode, struct, lit
df = sc.parallelize([(1, 0.0, 0.6), (1, 0.6, 0.7)]).toDF(["A", "col_1", "col_2"])
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["A"])
Scala:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{array, col, explode, lit, struct}
val df = Seq((1, 0.0, 0.6), (1, 0.6, 0.7)).toDF("A", "col_1", "col_2")
def toLong(df: DataFrame, by: Seq[String]): DataFrame = {
val (cols, types) = df.dtypes.filter{ case (c, _) => !by.contains(c)}.unzip
require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")
val kvs = explode(array(
cols.map(c => struct(lit(c).alias("key"), col(c).alias("val"))): _*
))
val byExprs = by.map(col(_))
df
.select(byExprs :+ kvs.alias("_kvs"): _*)
.select(byExprs ++ Seq($"_kvs.key", $"_kvs.val"): _*)
}
toLong(df, Seq("A"))

One way to solve with pyspark sql using functions create_map and explode.
from pyspark.sql import functions as func
#Use `create_map` to create the map of columns with constant
df = df.withColumn('mapCol', \
func.create_map(func.lit('col_1'),df.col_1,
func.lit('col_2'),df.col_2,
func.lit('col_3'),df.col_3
)
)
#Use explode function to explode the map
res = df.select('*',func.explode(df.mapCol).alias('col_id','col_value'))
res.show()

The Spark local linear algebra libraries are presently very weak: and they do not include basic operations as the above.
There is a JIRA for fixing this for Spark 2.1 - but that will not help you today.
Something to consider: performing a transpose will likely require completely shuffling the data.
For now you will need to write RDD code directly. I have written transpose in scala - but not in python. Here is the scala version:
def transpose(mat: DMatrix) = {
val nCols = mat(0).length
val matT = mat
.flatten
.zipWithIndex
.groupBy {
_._2 % nCols
}
.toSeq.sortBy {
_._1
}
.map(_._2)
.map(_.map(_._1))
.toArray
matT
}
So you can convert that to python for your use. I do not have bandwidth to write/test that at this particular moment: let me know if you were unable to do that conversion.
At the least - the following are readily converted to python.
zipWithIndex --> enumerate() (python equivalent - credit to #zero323)
map --> [someOperation(x) for x in ..]
groupBy --> itertools.groupBy()
Here is the implementation for flatten which does not have a python equivalent:
def flatten(L):
for item in L:
try:
for i in flatten(item):
yield i
except TypeError:
yield item
So you should be able to put those together for a solution.

You could use the stack function:
for example:
df.selectExpr("stack(2, 'col_1', col_1, 'col_2', col_2) as (key, value)")
where:
2 is the number of columns to stack (col_1 and col_2)
'col_1' is a string for the key
col_1 is the column from which to take the values
if you have several columns, you could build the whole stack string iterating the column names and pass that to selectExpr

Use flatmap. Something like below should work
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID': k, 'colValue': row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander))

I took the Scala answer that #javadba wrote and created a Python version for transposing all columns in a DataFrame. This might be a bit different from what OP was asking...
from itertools import chain
from pyspark.sql import DataFrame
def _sort_transpose_tuple(tup):
x, y = tup
return x, tuple(zip(*sorted(y, key=lambda v_k: v_k[1], reverse=False)))[0]
def transpose(X):
"""Transpose a PySpark DataFrame.
Parameters
----------
X : PySpark ``DataFrame``
The ``DataFrame`` that should be tranposed.
"""
# validate
if not isinstance(X, DataFrame):
raise TypeError('X should be a DataFrame, not a %s'
% type(X))
cols = X.columns
n_features = len(cols)
# Sorry for this unreadability...
return X.rdd.flatMap( # make into an RDD
lambda xs: chain(xs)).zipWithIndex().groupBy( # zip index
lambda val_idx: val_idx[1] % n_features).sortBy( # group by index % n_features as key
lambda grp_res: grp_res[0]).map( # sort by index % n_features key
lambda grp_res: _sort_transpose_tuple(grp_res)).map( # maintain order
lambda key_col: key_col[1]).toDF() # return to DF
For example:
>>> X = sc.parallelize([(1,2,3), (4,5,6), (7,8,9)]).toDF()
>>> X.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
>>> transpose(X).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 4| 7|
| 2| 5| 8|
| 3| 6| 9|
+---+---+---+

A very handy way to implement:
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID' : k, 'colValue' : row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander)

To transpose Dataframe in pySpark, I use pivot over the temporary created column, which I drop at the end of the operation.
Say, we have a table like this. What we wanna do is to find all users over each listed_days_bin value.
+------------------+-------------+
| listed_days_bin | users_count |
+------------------+-------------+
|1 | 5|
|0 | 2|
|0 | 1|
|1 | 3|
|1 | 4|
|2 | 5|
|2 | 7|
|2 | 2|
|1 | 1|
+------------------+-------------+
Create new temp column - 'pvt_value', aggregate over it and pivot results
import pyspark.sql.functions as F
agg_df = df.withColumn('pvt_value', lit(1))\
.groupby('pvt_value')\
.pivot('listed_days_bin')\
.agg(F.sum('users_count')).drop('pvt_value')
New Dataframe should look like:
+----+---+---+
| 0 | 1 | 2 | # Columns
+----+---+---+
| 3| 13| 14| # Users over the bin
+----+---+---+

I found PySpark to be too complicated to transpose so I just convert my dataframe to Pandas and use the transpose() method and convert the dataframe back to PySpark if required.
dfOutput = spark.createDataFrame(dfPySpark.toPandas().transpose())
dfOutput.display()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find values closest to a list of values in pyspark - python

Related

How to code dynamic number of WHEN evaluations in PySpark for a new column

Append list of lists as column to PySpark's dataframe (Concatenating two dataframes without common column)

Pyspark - Calculate number of null values in each dataframe column

How to retrieve all columns using pyspark collect_list functions

Transpose column to row with Spark

Categories

Resources