Spark union of multiple RDDs - python

In my pig code I do this:
all_combined = Union relation1, relation2,
relation3, relation4, relation5, relation 6.
I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise:
first = rdd1.union(rdd2)
second = first.union(rdd3)
third = second.union(rdd4)
# .... and so on
Is there a union operator that will let me operate on multiple rdds at a time:
e.g. union(rdd1, rdd2,rdd3, rdd4, rdd5, rdd6)
It is a matter on convenience.

If these are RDDs you can use SparkContext.union method:
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
rdd3 = sc.parallelize([7, 8, 9])
rdd = sc.union([rdd1, rdd2, rdd3])
rdd.collect()
## [1, 2, 3, 4, 5, 6, 7, 8, 9]
There is no DataFrame equivalent but it is just a matter of a simple one-liner:
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))
unionAll(df1, df2, df3).show()
## +---+----+
## | k| v|
## +---+----+
## | 1|foo1|
## | 2|bar1|
## | 3|foo2|
## | 4|bar2|
## | 5|foo3|
## | 6|bar3|
## +---+----+
If number of DataFrames is large using SparkContext.union on RDDs and recreating DataFrame may be a better choice to avoid issues related to the cost of preparing an execution plan:
def unionAll(*dfs):
first, *_ = dfs # Python 3.x, for 2.x you'll have to unpack manually
return first.sql_ctx.createDataFrame(
first.sql_ctx._sc.union([df.rdd for df in dfs]),
first.schema
)

You can also use addition for UNION between RDDs
rdd = sc.parallelize([1, 1, 2, 3])
(rdd + rdd).collect()
## [1, 1, 2, 3, 1, 1, 2, 3]

Unfortunately it's the only way to UNION tables in Spark. However instead of
first = rdd1.union(rdd2)
second = first.union(rdd3)
third = second.union(rdd4)
...
you can perform it in a little bit cleaner way like this:
result = rdd1.union(rdd2).union(rdd3).union(rdd4)

Related

Z3 optimize by index not a value

With greate respect to the answer of #alias there: (Find minimum sum) I would like to solve similar puzzle. Having 4 agents and 4 type of works. Each agent does work on some price (see initial matrix in the code). I need find the optimal allocation of agents to the particular work. Following code almost copy paste from the mentioned answer:
initial = ( # Row - agent, Column - work
(7, 7, 3, 6),
(4, 9, 5, 4),
(5, 5, 4, 5),
(6, 4, 7, 2)
)
opt = Optimize()
agent = [Int(f"a_{i}") for i, _ in enumerate(initial)]
opt.add(And(*(a != b for a, b in itertools.combinations(agent, 2))))
for w, row in zip(agent, initial):
opt.add(Or(*[w == val for val in row]))
minTotal = Int("minTotal")
opt.add(minTotal == sum(agent))
opt.minimize(minTotal)
print(opt.check())
print(opt.model())
Mathematically correct answer: [a_2 = 4, a_1 = 5, a_3 = 2, a_0 = 3, minTotal = 14] is not working for me, because I need get index of agent instead.
Now, my question - how to rework the code to optimize by indexes instead of values? I've tried to leverage the Array but have no idea how to minimize multiple sums.
You can simply keep track of the indexes and walk each row to pick the corresponding element. Note that the itertools.combinations can be replaced by Distinct. We also add extra checks to make sure the indices are between 1 and 4 to ensure there's no out-of-bounds access:
from z3 import *
initial = ( # Row - agent, Column - work
(7, 7, 3, 6),
(4, 9, 5, 4),
(5, 5, 4, 5),
(6, 4, 7, 2)
)
opt = Optimize()
def choose(i, vs):
if vs:
return If(i == 1, vs[0], choose(i-1, vs[1:]))
else:
return 0
agent = [Int(f"a_{i}") for i, _ in enumerate(initial)]
opt.add(Distinct(*agent))
for a, row in zip(agent, initial):
opt.add(a >= 1)
opt.add(a <= 4)
opt.add(Or(*[choose(a, row) == val for val in row]))
minTotal = Int("minTotal")
opt.add(minTotal == sum(choose(a, row) for a, row in zip (agent, initial)))
opt.minimize(minTotal)
print(opt.check())
print(opt.model())
This prints:
sat
[a_1 = 1, a_0 = 3, a_2 = 2, a_3 = 4, minTotal = 14]
which I believe is what you're looking for.
Note that z3 also supports arrays, which you can use for this problem. However, in SMTLib, arrays are not "bounded" like in programming languages. They're indexed by all elements of their domain type. So, you won't get much of a benefit from doing that, and the above formulation seems to be the most straightforward.

How to coalesce multiple pyspark arrays?

I have an arbitrary number of arrays of equal length in a PySpark DataFrame. I need to coalesce these, element by element, into a single list. The problem with coalesce is that it doesn't work by element, but rather selects the entire first non-null array. Any suggestions for how to accomplish this would be appreciated. Please see the test case below for an example of expected input and output:
def test_coalesce_elements():
"""
Test array coalescing on a per-element basis
"""
from pyspark.sql import SparkSession
import pyspark.sql.types as t
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
{
"a": [None, 1, None, None],
"b": [2, 3, None, None],
"c": [5, 6, 7, None],
}
]
schema = t.StructType([
t.StructField('a', t.ArrayType(t.IntegerType())),
t.StructField('b', t.ArrayType(t.IntegerType())),
t.StructField('c', t.ArrayType(t.IntegerType())),
])
df = spark.createDataFrame(data, schema)
# Inspect schema
df.printSchema()
# root
# | -- a: array(nullable=true)
# | | -- element: integer(containsNull=true)
# | -- b: array(nullable=true)
# | | -- element: integer(containsNull=true)
# | -- c: array(nullable=true)
# | | -- element: integer(containsNull=true)
# Inspect df values
df.show(truncate=False)
# +---------------------+------------------+---------------+
# |a |b |c |
# +---------------------+------------------+---------------+
# |[null, 1, null, null]|[2, 3, null, null]|[5, 6, 7, null]|
# +---------------------+------------------+---------------+
# This obviously does not work, but hopefully provides the general idea
# Remember: this will need to work with an arbitrary and dynamic set of columns
input_cols = ['a', 'b', 'c']
df = df.withColumn('d', f.coalesce(*[f.col(i) for i in input_cols]))
# This is the expected output I would like to see for the given inputs
assert df.collect()[0]['d'] == [2, 1, 7, None]
Thanks in advance for any ideas!
Well, as Derek and OP have said, Derek's answer works but it would be better if we avoid using UDFs, so here is a way to accomplish it natively,
from pyspark.sql.window import Window
# Give it any static value as we just want row number for all the rows present in DataFrame
w = Window().orderBy(F.lit('A'))
# Will be used later tp join df with second df containing the calculated "d" column
df = df.withColumn("row_num", F.row_number().over(w))
print("DF:")
df.show(truncate=False)
# Input Columns
input_cols = ['a', 'b', 'c']
# Zip all the array using array_zip
# Explode the zipped array
# Create the new columns from the exploded zipped array to get single values
# Coalesce to get the first non-null value
# group by row_num as we want to bring all the values back in one array
# First convert to array before using collect_list as it ignore "null" values and the flatten the nested array to get one single flat array
df_2 = df.withColumn("new", F.arrays_zip(*input_cols)) \
.withColumn("new", F.explode("new")) \
.select("row_num", *[F.col(f"new.{i}").alias(f"new_{i}") for i in input_cols]) \
.withColumn("d", F.coalesce(*[(F.col(f"new_{i}")) for i in input_cols])) \
.groupBy("row_num") \
.agg(F.flatten(F.collect_list(F.array("d"))).alias("d"))
print("Second DF:")
df_2.show(truncate=False)
# Join based on the row_num
final_df = df.join(df_2, df["row_num"] == df_2["row_num"], "inner") \
.drop("row_num")
# voilĂ 
print("Final DF:")
final_df.show(truncate = False)
assert final_df.collect()[0]["d"] == [2, 1, 7, None]
DF:
+---------------------+------------------+---------------+-------+
|a |b |c |row_num|
+---------------------+------------------+---------------+-------+
|[null, 1, null, null]|[2, 3, null, null]|[5, 6, 7, null]|1 |
+---------------------+------------------+---------------+-------+
Second DF:
+-------+---------------+
|row_num|d |
+-------+---------------+
|1 |[2, 1, 7, null]|
+-------+---------------+
Final DF:
+---------------------+------------------+---------------+---------------+
|a |b |c |d |
+---------------------+------------------+---------------+---------------+
|[null, 1, null, null]|[2, 3, null, null]|[5, 6, 7, null]|[2, 1, 7, null]|
+---------------------+------------------+---------------+---------------+
Although it would be ideal, I am not sure if there is an elegant way to do this using only pyspark functions.
What I did is write a udf that takes in an variable number of columns (using *args, which you can read about here), and return an array of integers.
#f.udf(returnType=t.ArrayType(t.IntegerType()))
def get_array_non_null_first_element(*args):
data_array = [item for item in args]
array_lengths = [len(array) for array in data_array]
## check that all of the arrays have the same length
assert(len(set(array_lengths)) == 1)
## if they do, then you can set the array length
array_length = array_lengths[0]
first_value_array = []
for i in range(array_length):
element_array = [array[i] for array in data_array]
value = None
for x in element_array:
if x is not None:
value = x
break
else:
continue
first_value_array.append(value)
return first_value_array
Then create a new column d by applying this udf to whichever columns you like:
df.withColumn("d", get_array_non_null_first_element(F.col('a'), F.col('b'), F.col('c'))).show()
+--------------------+------------------+---------------+---------------+
| a| b| c| d|
+--------------------+------------------+---------------+---------------+
|[null, 1, null, n...|[2, 3, null, null]|[5, 6, 7, null]|[2, 1, 7, null]|
+--------------------+------------------+---------------+---------------+
Thanks to Derek and Tushar for their responses! I was able to use them as a basis to solve the issue without a UDF, join, or explode.
Generally speaking, joins are unfavorable due to being computationally and memory expensive, UDFs can be computationally intensive, and explode can be memory intensive. Fortunately we can avoid all of these using transform:
def add_coalesced_list_by_elements_col(
df: DataFrame,
cols: List[Union[Column, str]],
col_name: str,
) -> DataFrame:
"""
Adds a new column representing a list that is collected by element from the
input set. Please note that all provided this does not check that all provided
columns are of equal length.
Args:
df: Input DataFrame to add column to
cols: List of columns to collect by element. All columns should be of equal length.
col_name: The name of the new column
Returns:
DataFrame with result added as a new column.
"""
return (
df
.withColumn(
col_name,
f.transform(
# Doesn't matter which col, we don't use this val
cols[0],
# We use i + 1 because sql array index starts at 1, while transform index starts at 0
lambda _, i: f.coalesce(*[f.element_at(c, i + 1) for c in cols]))
)
)
def test_collect_list_elements():
from typing import List
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql import SparkSession, DataFrame, Column, Window
# Arrange
spark = SparkSession.builder.getOrCreate()
data = [
{
"id": 1,
"a": [None, 1, None, None],
"b": [2, 3, None, None],
"c": [5, 6, 7, None],
}
]
schema = t.StructType(
[
t.StructField("id", t.IntegerType()),
t.StructField("a", t.ArrayType(t.IntegerType())),
t.StructField("b", t.ArrayType(t.IntegerType())),
t.StructField("c", t.ArrayType(t.IntegerType())),
]
)
df = spark.createDataFrame(data, schema)
# Act
df = add_coalesced_list_by_elements_col(df=df, cols=["a", "b", "c"], col_name="d")
# Assert new col is correct output
assert df.collect()[0]["d"] == [2, 1, 7, None]
# Assert all the other cols are not affected
assert df.collect()[0]["a"] == [None, 1, None, None]
assert df.collect()[0]["b"] == [2, 3, None, None]
assert df.collect()[0]["c"] == [5, 6, 7, None]

How do I get the indexwise average of a array column with pyspark

I have a df with a column fftAbs (absolute values acquired after an fft). The type of df['fftAbs'] is a ArrayType(DoubleType()). I want to get the indexwise average of all the values.
So if the column holds
// Random data
||fftAbs ||
------------
|[0, 1, 2] |
|[2, 3, 12]|
|[1, 8, 4] |
I want to aquire a list like [1, 4, 6] (because 0+2+1/3 = 1, 1+3+8/3 = 4, 2+4+12/3 = 6)
I've tried using:
import pyspark.sql.functions as F
avgDf = df.select(F.avg('fftAbs'))
but I'll get AnalysisException: cannot resolve 'avg(fftAbs)' due to data type mismatch: function average requires numeric or interval types, not array<double>;
EDIT:
I also tried
def _index_avg(twoDL):
return np.mean(twoDL)
spark_index_avg = F.udf(_index_avg, T.ArrayType(T.DoubleType(), False))
avgDf = df.agg(spark_index_avg(F.collect_list('fftAbs')))
but then I got net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype). This happens when an unsupported/unregistered class is being unpickled that requires construction arguments. Fix it by registering a custom IObjectConstructor for this class.
Just for reference, my complete code is here (except from the first query):
import numpy as np
import pyspark as ps
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.functions import col
def _rfft(x):
transformed = np.fft.rfft(x)
return map(lambda c: (c.real, c.imag),transformed.tolist())
spark_complexType = T.ArrayType(T.StructType([
T.StructField("real", T.DoubleType(), False),
T.StructField("imag", T.DoubleType(), False),
]), False)
spark_rfft = F.udf(_rfft, spark_complexType)
def _fft_bins(size, periodMicroSeconds):
return np.fft.rfftfreq(size, d=(periodMicroSeconds/10**6)).tolist()
spark_rfft_bins = F.udf(_fft_bins, T.ArrayType(T.DoubleType(), False))
def _abs_complex(complex_tuple_list):
return list([abs(complex(real, imag)) for real, imag in complex_tuple_list])
spark_abs_complex = F.udf(_abs_complex, T.ArrayType(T.DoubleType()))
# df incoming from builder
fftDf = df.withColumn('fft', spark_rfft(col('data'))) \
.withColumn('fftFreq', spark_rfft_bins('dataDim', 'samplePeriod')) \
.withColumn('fftAbs', spark_abs_complex('fft'))
avgDf = fftDf.select(F.avg('fftAbs'))
What about to use posexplode?
df = spark.createDataFrame([[[0, 1, 2]],[[2, 3, 12]],[[1, 8, 4]]]).toDF('array')
df.select(f.posexplode('array')) \
.groupBy('pos') \
.agg(f.avg('col').alias('avg')) \
.show(truncate=False)
+---+---+
|pos|avg|
+---+---+
|1 |4.0|
|2 |6.0|
|0 |1.0|
+---+---+
If you have another field, you can use the Window but there can be a performance issue.
df = spark.createDataFrame([['a', [0, 1, 2]],['b', [2, 3, 12]],['b', [1, 8, 4]]], ['col1', 'array'])
w = Window.partitionBy(f.lit(1))
df.withColumn('avg', f.array(*[f.avg(f.col('array')[i]).over(w) for i in range(0, 3)])) \
.show(truncate=False)
+----+----------+---------------+
|col1|array |avg |
+----+----------+---------------+
|a |[0, 1, 2] |[1.0, 4.0, 6.0]|
|b |[2, 3, 12]|[1.0, 4.0, 6.0]|
|b |[1, 8, 4] |[1.0, 4.0, 6.0]|
+----+----------+---------------+
I found the solution:
def _index_avg(twoDL):
return np.mean(twoDL, axis=0).tolist()
spark_index_avg = F.udf(_index_avg, T.ArrayType(T.DoubleType(), False))
avgDf = df.agg(spark_index_avg(F.collect_list('fftAbs')))
I'm sure someone will come along with a better answer, but this seems to work for now.
EDIT: Yeah this didn't work for larger sets of data

PySpark testing: construct test data consisting array of structs

I'd like to generate some test data for my unit tests in PySpark. One of the fields in input Row is an array of structs: basket: array<struct<price:bigint,product_id:string>>. Whats the best way to achieve it?
Here is one way using python and two helper functions responsible for generating the random data:
from pyspark.sql.types import *
from random import randrange, uniform
array_size = 2
def create_row(array_size):
return ([{"price" : uniform(1.0, 100.0), "product_id" : randrange(10) + 1} for _ in range(array_size)],)
def generate_data(data_size):
return [create_row(array_size) for _ in range(data_size)]
# create 5 rows
rows = generate_data(5)
# string schema
schema = "basket: array<struct<price:double,product_id:string>>"
# static typing schema
# schema = StructType([
# StructField('basket',
# ArrayType(
# StructType(
# [
# StructField('price', DoubleType()),
# StructField('product_id', StringType()),
# ]
# )
# )
# )])
df = spark.createDataFrame(rows, schema)
df.show(10, False)
# +--------------------------------------------------+
# |basket |
# +--------------------------------------------------+
# |[[61.40674765573896, 9], [5.994467505720648, 7]] |
# |[[1.1388272509974906, 10], [47.32070824053193, 3]]|
# |[[42.423106687845795, 2], [70.99107361888588, 4]] |
# |[[50.019594333009806, 8], [63.51239439900147, 4]] |
# |[[68.15711374321089, 9], [70.06617125228864, 10]] |
# +--------------------------------------------------+
create_row: will generate a new row (represented as a tuple here) with array_size items. price will have value in the range 1.0 - 100.0 and product_id in the range 1 - 10, please feel free to modify the boundaries accordingly. Also, we handle each item of the array (product_id-price pairs) with a python dictionary.
generate_data: calls create_row data_size times and returns the random generated rows into a list.

PySpark's reduceByKey not working as expected

I'm writing a large PySpark program and I've recently run into trouble when using reduceByKey on an RDD. I've been able to recreate the problem with a simple test program. The code is:
from pyspark import SparkConf, SparkContext
APP_NAME = 'Test App'
def main(sc):
test = [(0, [i]) for i in xrange(100)]
test = sc.parallelize(test)
test = test.reduceByKey(method)
print test.collect()
def method(x, y):
x.append(y[0])
return x
if __name__ == '__main__':
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster('local[*]')
sc = SparkContext(conf=conf)
main(sc)
I would expect the output to be (0, [0,1,2,3,4,...,98,99]) based on the Spark documentation. Instead, I get the following output:
[(0, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 24, 36, 48, 60, 72, 84])]
Could someone please help me understand why this output is being generated?
As a side note, when I use
def method(x, y):
x = x + y
return x
I get the expected output.
First of all it looks like you actually want groupByKey not reduceByKey:
rdd = sc.parallelize([(0, i) for i in xrange(100)])
grouped = rdd.groupByKey()
k, vs = grouped.first()
assert len(list(vs)) == 100
Could someone please help me understand why this output is being generated?
reduceByKey assumes that f is associative and your method is clearly not. Depending on the order of operations the output is different. Lets say you start with following data for a certain key:
[1], [2], [3], [4]
Now add lets add some parentheses:
((([1], [2]), [3]), [4])
(([1, 2], [3]), [4])
([1, 2, 3], [4])
[1, 2, 3, 4]
and with another set of parentheses
(([1], ([2], [3])), [4])
(([1], [2, 3]), [4])
([1, 2], [4])
[1, 2, 4]
When you rewrite it as follows:
method = lambda x, y: x + y
or simply
from operator import add
method = add
you get an associative function and it works as expected.
Generally speaking for reduce* operations you want functions which are both associative and commutative.

Categories

Resources