Pivot array of structs into columns using pyspark - not explode the array

Pivot array of structs into columns using pyspark - not explode the array - python

I currently have a dataframe with an id and a column which is an array of structs:
root
|-- id: string (nullable = true)
|-- lists: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
Here is an example table with data:
id | list1 | list2
------------------------------------------
1 | [[a, av], [b, bv]]| [[e, ev], [f,fv]]
2 | [[c, cv]] | [[g,gv]]
How do I transform the above dataframe to the one below? I need to "explode" the array and add columns based on first value in the struct.
id | a | b | c | d | e | f | g
----------------------------------------
1 | av | bv | null| null| ev | fv | null
2 | null| null| cv | null|null|null|gv
A pyspark code to create the dataframe is as below:
d1 = spark.createDataFrame([("1", [("a","av"),("b","bv")], [("e", "ev"), ("f", "fv")]), \
("2", [("c", "cv")], [("g", "gv")])], ["id","list1","list2"])
Note: I have a spark version of 2.2.0 so some sql functions don't work such as concat_map, etc.

You can do this using hogher order functions without exploding the arrays like:
d1.select('id',
f.when(f.size(f.expr('''filter(list1,x->x._1='a')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list1,x->x._1='a'),value->value._2)'''))).alias('a'),\
f.when(f.size(f.expr('''filter(list1,x->x._1='b')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list1,x->x._1='b'),value->value._2)'''))).alias('b'),\
f.when(f.size(f.expr('''filter(list1,x->x._1='c')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list1,x->x._1='c'),value->value._2)'''))).alias('c'),\
f.when(f.size(f.expr('''filter(list1,x->x._1='d')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list1,x->x._1='d'),value->value._2)'''))).alias('d'),\
f.when(f.size(f.expr('''filter(list2,x->x._1='e')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list2,x->x._1='e'),value->value._2)'''))).alias('e'),\
f.when(f.size(f.expr('''filter(list2,x->x._1='f')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list2,x->x._1='f'),value->value._2)'''))).alias('f'),\
f.when(f.size(f.expr('''filter(list2,x->x._1='g')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list2,x->x._1='g'),value->value._2)'''))).alias('g'),\
f.when(f.size(f.expr('''filter(list2,x->x._1='h')'''))>0,f.concat_ws(',',f.expr('''transform(filter(list2,x->x._1='h'),value->value._2)'''))).alias('h')\
).show()
+---+----+----+----+----+----+----+----+----+
| id| a| b| c| d| e| f| g| h|
+---+----+----+----+----+----+----+----+----+
| 1| av| bv|null|null| ev| fv|null|null|
| 2|null|null| cv|null|null|null| gv|null|
+---+----+----+----+----+----+----+----+----+
Hope it helps

UPD - For Spark 2.2.0
You can define similar functions in 2.2.0 using udfs. They will be much less efficient in terms of performance and you'll need a special function for each output value type (i.e. you won't be able to have one element_at function which could output value of any type from any map type), but they will work. The code below works for Spark 2.2.0:
from pyspark.sql.functions import udf
from pyspark.sql.types import MapType, ArrayType, StringType
#udf(MapType(StringType(), StringType()))
def map_from_entries(l):
return {x:y for x,y in l}
#udf(MapType(StringType(), StringType()))
def map_concat(m1, m2):
m1.update(m2)
return m1
#udf(ArrayType(StringType()))
def map_keys(m):
return list(m.keys())
def element_getter(k):
#udf(StringType())
def element_at(m):
return m.get(k)
return element_at
d2 = d1.select('id',
map_concat(map_from_entries('list1'),
map_from_entries('list2')).alias('merged_map'))
map_keys = d2.select(f.explode(map_keys('merged_map')).alias('mk')) \
.agg(f.collect_set('mk').alias('keys')) \
.collect()[0].keys
map_keys = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
selects = [element_getter(k)('merged_map').alias(k) for k in sorted(map_keys)]
d = d2.select('id', *selects)
ORIGINAL ANSWER (working for Spark 2.4.0+)
Not clear where d column came from in your example (d never appeared in the initial dataframe). If columns should be created based on the first elements in the array, then this should work (assuming total number of unique first values in the lists is small enough):
import pyspark.sql.functions as f
d2 = d1.select('id',
f.map_concat(f.map_from_entries('list1'),
f.map_from_entries('list2')).alias('merged_map'))
map_keys = d2.select(f.explode(f.map_keys('merged_map')).alias('mk')) \
.agg(f.collect_set('mk').alias('keys')) \
.collect()[0].keys
selects = [f.element_at('merged_map', k).alias(k) for k in sorted(map_keys)]
d = d2.select('id', *selects)
Output (no column for d because it never mentioned in the initial DataFrame):
+---+----+----+----+----+----+----+
| id| a| b| c| e| f| g|
+---+----+----+----+----+----+----+
| 1| av| bv|null| ev| fv|null|
| 2|null|null| cv|null|null| gv|
+---+----+----+----+----+----+----+
If you actually had in mind that list of the columns is fixed from the beginning (and they are not taken from the array), then you can just replace the definition of varaible map_keys with the fixed list of columns, e.g. map_keys=['a', 'b', 'c', 'd', 'e', 'f', 'g']. In that case you get the output you mention in the answer:
+---+----+----+----+----+----+----+----+
| id| a| b| c| d| e| f| g|
+---+----+----+----+----+----+----+----+
| 1| av| bv|null|null| ev| fv|null|
| 2|null|null| cv|null|null|null| gv|
+---+----+----+----+----+----+----+----+
By the way - what you want to do is not what is called explode in Spark. explode in Spark is for the situation when you create multiple rows from one. E.g. if you wanted to get from dataframe like this:
+---+---------+
| id| arr|
+---+---------+
| 1| [a, b]|
| 2|[c, d, e]|
+---+---------+
to this:
+---+-------+
| id|element|
+---+-------+
| 1| a|
| 1| b|
| 2| c|
| 2| d|
| 2| e|
+---+-------+

You can solve this in an alternative way:
from pyspark.sql.functions import concat, explode, first
d1 = spark.createDataFrame([("1", [("a", "av"), ("b", "bv")], [("e", "ev"), ("f", "fv")]), \
("2", [("c", "cv")], [("g", "gv")])], ["id", "list1", "list2"])
d2 = d1.withColumn('concat', concat('list1', 'list2'))
d3 = d2.withColumn('explode', explode('concat'))
d4 = d3.groupby('id').pivot('explode._1').agg(first('explode._2'))
d4.show()
+---+----+----+----+----+----+----+
|id |a |b |c |e |f |g |
+---+----+----+----+----+----+----+
|1 |av |bv |null|ev |fv |null|
|2 |null|null|cv |null|null|gv |
+---+----+----+----+----+----+----+

Related

How to get Weighted Average for a column in pyspark

Here i need to find exponential moving average in spark dataframe :
Table :
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],["CID","date","A","B","C","Row","SMA"] )
ab.show()
+---+---------+-----+-----+----+---+-----+
|CID| date| A| B| C| Row| SMA|
+---+---------+-----+-----+----+---+-----+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| |
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| |
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| |
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |
+---+---------+-----+-----+----+---+-----+
Expected Output :
+---+---------+-----+-----+----+---+-----+----------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+---------+-----+-----+----+---+-----+----------+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| | 14.354|
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| | 21.4124|
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| | 28.04674|
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |20.7558916|
+---+---------+-----+-----+----+---+-----+----------+
Logic :
For every customer
if row == 1 then
SMA as EMA
else ( C * LAG(EMA) + A * B ) as EMA

The problem here is that a freshly calculated value of a previous row is used as input for the current row. That means that it is not possible to parallelize the calculations for a single customer.
For Spark 3.0+, it is possible to get the required result with a pandas udf using grouped map
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],\
["CID","date","A","B","C","Row","SMA"] ) \
.withColumn("SMA", F.col('SMA').cast(T.DoubleType())) \
.withColumn("date", F.to_date(F.col("date"), "d/M/yyyy"))
import pandas as pd
def calc(df: pd.DataFrame):
# df is a pandas.DataFrame
df = df.sort_values('date').reset_index(drop=True)
df.loc[0, 'EMA'] = df.loc[0, 'SMA']
for i in range(1, len(df)):
df.loc[i, 'EMA'] = df.loc[i, 'C'] * df.loc[i-1, 'EMA'] + \
df.loc[i, 'A'] * df.loc[i, 'B']
return df
ab.groupBy("CID").applyInPandas(calc,
schema = "CID long, date date, A double, B double, C double, Row long, SMA double, EMA double")\
.show()
Output:
+---+----------+-----+-----+----+---+-----+------------------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+----------+-----+-----+----+---+-----+------------------+
| 1|2020-01-01| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|2020-03-10| 24.0| 0.3| 0.7| 2| null| 14.354|
| 1|2020-05-21| 32.0| 0.4| 0.6| 3| null|21.412399999999998|
| 2|2020-01-03| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|2020-05-10|14.56|0.333|0.66| 2| null| 27.80328|
| 2|2020-09-30| 17.0| 0.66|0.34| 3| null| 20.6731152|
+---+----------+-----+-----+----+---+-----+------------------+
The idea is to use a Pandas dataframe for each group. This Pandas dataframe contains all values of the current partition and is ordered by date. During the iteration over the Pandas dataframe we can now access the value of EMA of the previous row (which is not possible for a Spark dataframe).
There are some caveats:
all rows of one partition should fit into the memory of a single executor. Partial aggregation is not possible here
iterating over a Pandas dataframe is discouraged

Pyspark remove strings (or imputate them) from numeric columns

I have some strings in a numeric columns. Like 1, 2, 3, 4, 'lol', 6 ...
I just wanna del this rows. How can I del them?
.cast did not return NaN. I wrote function, but it takes too much time (unreal), and it didn't work anyway...
from pyspark.sql.types import *
def is_digit(val):
try:
is_num = str(val).replace(".", "", 1).isdigit() if val else False
return is_num
except:
return False
is_digit_udf = F.udf(is_digit, BooleanType())
It's so stupid :(((

Use .rlike() function to filter only the rows that are not having numbers by specifying regex.
Example:
df.show()
+----+
| val|
+----+
| 1|
| 2|
| 3|
| lol|
| 6|
+----+
df.filter(col("val").rlike("[^a-zA-Z]")).show()
#or using [^\d] regex
df.filter(~col("val").rlike("[^\d]")).show()
#+---+
#|val|
#+---+
#| 1|
#| 2|
#| 3|
#| 6|
#+---+

As far as I understand, you want to delete the rows which contain not a numeric value, right? Then simply cast and filter out the values.
df.printSchema()
df.show(10, False)
root
|-- num: string (nullable = true)
+---+
|num|
+---+
|1 |
|2 |
|3 |
|lol|
|5 |
+---+
import pyspark.sql.functions as f
df.withColumn("num", f.col("num").cast("int")) \
.filter(f.col("num").isNotNull()) \
.show(10, False)
+---+
|num|
+---+
|1 |
|2 |
|3 |
|5 |
+---+

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?

You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)

Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

Pyspark - Select the distinct values from each column

I am trying to find all of the distinct values in each column in a dataframe and show in one table.
Example data:
|-----------|-----------|-----------|
| COL_1 | COL_2 | COL_3 |
|-----------|-----------|-----------|
| A | C | D |
| A | C | D |
| A | C | E |
| B | C | E |
| B | C | F |
| B | C | F |
|-----------|-----------|-----------|
Example output:
|-----------|-----------|-----------|
| COL_1 | COL_2 | COL_3 |
|-----------|-----------|-----------|
| A | C | D |
| B | | E |
| | | F |
|-----------|-----------|-----------|
Is this even possible? I have been able to do it in separate tables, but it would be much better all in one table.
Any ideas?

The simplest thing here would be to use pyspark.sql.functions.collect_set on all of the columns:
import pyspark.sql.functions as f
df.select(*[f.collect_set(c).alias(c) for c in df.columns]).show()
#+------+-----+---------+
#| COL_1|COL_2| COL_3|
#+------+-----+---------+
#|[B, A]| [C]|[F, E, D]|
#+------+-----+---------+
Obviously, this returns the data as one row.
If instead you want the output as you wrote in your question (one row per unique value for each column), it's doable but requires quite a bit of pyspark gymnastics (and any solution likely will be much less efficient).
Nevertheless, I present you some options:
Option 1: Explode and Join
You can use pyspark.sql.functions.posexplode to explode the elements in the set of values for each column along with the index in the array. Do this for each column separately and then outer join the resulting list of DataFrames together using functools.reduce:
from functools import reduce
unique_row = df.select(*[f.collect_set(c).alias(c) for c in df.columns])
final_df = reduce(
lambda a, b: a.join(b, how="outer", on="pos"),
(unique_row.select(f.posexplode(c).alias("pos", c)) for c in unique_row.columns)
).drop("pos")
final_df.show()
#+-----+-----+-----+
#|COL_1|COL_2|COL_3|
#+-----+-----+-----+
#| A| null| E|
#| null| null| D|
#| B| C| F|
#+-----+-----+-----+
Option 2: Select by position
First compute the size of the maximum array and store this in a new column max_length. Then select elements from each array if a value exists at that index.
Once again we use pyspark.sql.functions.posexplode but this time it's just to create a column to represent the index in each array to extract.
Finally we use this trick that allows you to use a column value as a parameter.
final_df= df.select(*[f.collect_set(c).alias(c) for c in df.columns])\
.withColumn("max_length", f.greatest(*[f.size(c) for c in df.columns]))\
.select("*", f.expr("posexplode(split(repeat(',', max_length-1), ','))"))\
.select(
*[
f.expr(
"case when size({c}) > pos then {c}[pos] else null end AS {c}".format(c=c))
for c in df.columns
]
)
final_df.show()
#+-----+-----+-----+
#|COL_1|COL_2|COL_3|
#+-----+-----+-----+
#| B| C| F|
#| A| null| E|
#| null| null| D|
#+-----+-----+-----+

Transpose column to row with Spark

I'm trying to transpose some columns of my table to row.
I'm using Python and Spark 1.5.0. Here is my initial table:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
I would like to have somthing like this:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
Does someone know haw I can do it? Thank you for your help.

Spark >= 3.4
You can use built-in melt method. With Python:
df.melt(
ids=["A"], values=["col_1", "col_2"],
variableColumnName="key", valueColumnName="val"
)
with Scala
df.melt(Array($"A"), Array($"col_1", $"col_2"), "key", "val")
Spark < 3.4
It is relatively simple to do with basic Spark SQL functions.
Python
from pyspark.sql.functions import array, col, explode, struct, lit
df = sc.parallelize([(1, 0.0, 0.6), (1, 0.6, 0.7)]).toDF(["A", "col_1", "col_2"])
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["A"])
Scala:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{array, col, explode, lit, struct}
val df = Seq((1, 0.0, 0.6), (1, 0.6, 0.7)).toDF("A", "col_1", "col_2")
def toLong(df: DataFrame, by: Seq[String]): DataFrame = {
val (cols, types) = df.dtypes.filter{ case (c, _) => !by.contains(c)}.unzip
require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")
val kvs = explode(array(
cols.map(c => struct(lit(c).alias("key"), col(c).alias("val"))): _*
))
val byExprs = by.map(col(_))
df
.select(byExprs :+ kvs.alias("_kvs"): _*)
.select(byExprs ++ Seq($"_kvs.key", $"_kvs.val"): _*)
}
toLong(df, Seq("A"))

One way to solve with pyspark sql using functions create_map and explode.
from pyspark.sql import functions as func
#Use `create_map` to create the map of columns with constant
df = df.withColumn('mapCol', \
func.create_map(func.lit('col_1'),df.col_1,
func.lit('col_2'),df.col_2,
func.lit('col_3'),df.col_3
)
)
#Use explode function to explode the map
res = df.select('*',func.explode(df.mapCol).alias('col_id','col_value'))
res.show()

The Spark local linear algebra libraries are presently very weak: and they do not include basic operations as the above.
There is a JIRA for fixing this for Spark 2.1 - but that will not help you today.
Something to consider: performing a transpose will likely require completely shuffling the data.
For now you will need to write RDD code directly. I have written transpose in scala - but not in python. Here is the scala version:
def transpose(mat: DMatrix) = {
val nCols = mat(0).length
val matT = mat
.flatten
.zipWithIndex
.groupBy {
_._2 % nCols
}
.toSeq.sortBy {
_._1
}
.map(_._2)
.map(_.map(_._1))
.toArray
matT
}
So you can convert that to python for your use. I do not have bandwidth to write/test that at this particular moment: let me know if you were unable to do that conversion.
At the least - the following are readily converted to python.
zipWithIndex --> enumerate() (python equivalent - credit to #zero323)
map --> [someOperation(x) for x in ..]
groupBy --> itertools.groupBy()
Here is the implementation for flatten which does not have a python equivalent:
def flatten(L):
for item in L:
try:
for i in flatten(item):
yield i
except TypeError:
yield item
So you should be able to put those together for a solution.

You could use the stack function:
for example:
df.selectExpr("stack(2, 'col_1', col_1, 'col_2', col_2) as (key, value)")
where:
2 is the number of columns to stack (col_1 and col_2)
'col_1' is a string for the key
col_1 is the column from which to take the values
if you have several columns, you could build the whole stack string iterating the column names and pass that to selectExpr

Use flatmap. Something like below should work
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID': k, 'colValue': row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander))

I took the Scala answer that #javadba wrote and created a Python version for transposing all columns in a DataFrame. This might be a bit different from what OP was asking...
from itertools import chain
from pyspark.sql import DataFrame
def _sort_transpose_tuple(tup):
x, y = tup
return x, tuple(zip(*sorted(y, key=lambda v_k: v_k[1], reverse=False)))[0]
def transpose(X):
"""Transpose a PySpark DataFrame.
Parameters
----------
X : PySpark ``DataFrame``
The ``DataFrame`` that should be tranposed.
"""
# validate
if not isinstance(X, DataFrame):
raise TypeError('X should be a DataFrame, not a %s'
% type(X))
cols = X.columns
n_features = len(cols)
# Sorry for this unreadability...
return X.rdd.flatMap( # make into an RDD
lambda xs: chain(xs)).zipWithIndex().groupBy( # zip index
lambda val_idx: val_idx[1] % n_features).sortBy( # group by index % n_features as key
lambda grp_res: grp_res[0]).map( # sort by index % n_features key
lambda grp_res: _sort_transpose_tuple(grp_res)).map( # maintain order
lambda key_col: key_col[1]).toDF() # return to DF
For example:
>>> X = sc.parallelize([(1,2,3), (4,5,6), (7,8,9)]).toDF()
>>> X.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
>>> transpose(X).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 4| 7|
| 2| 5| 8|
| 3| 6| 9|
+---+---+---+

A very handy way to implement:
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID' : k, 'colValue' : row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander)

To transpose Dataframe in pySpark, I use pivot over the temporary created column, which I drop at the end of the operation.
Say, we have a table like this. What we wanna do is to find all users over each listed_days_bin value.
+------------------+-------------+
| listed_days_bin | users_count |
+------------------+-------------+
|1 | 5|
|0 | 2|
|0 | 1|
|1 | 3|
|1 | 4|
|2 | 5|
|2 | 7|
|2 | 2|
|1 | 1|
+------------------+-------------+
Create new temp column - 'pvt_value', aggregate over it and pivot results
import pyspark.sql.functions as F
agg_df = df.withColumn('pvt_value', lit(1))\
.groupby('pvt_value')\
.pivot('listed_days_bin')\
.agg(F.sum('users_count')).drop('pvt_value')
New Dataframe should look like:
+----+---+---+
| 0 | 1 | 2 | # Columns
+----+---+---+
| 3| 13| 14| # Users over the bin
+----+---+---+

I found PySpark to be too complicated to transpose so I just convert my dataframe to Pandas and use the transpose() method and convert the dataframe back to PySpark if required.
dfOutput = spark.createDataFrame(dfPySpark.toPandas().transpose())
dfOutput.display()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pivot array of structs into columns using pyspark - not explode the array - python

Related

How to get Weighted Average for a column in pyspark

Pyspark remove strings (or imputate them) from numeric columns

Pyspark - Calculate number of null values in each dataframe column

Pyspark - Select the distinct values from each column

Transpose column to row with Spark

Categories

Resources