I have an arbitrary number of arrays of equal length in a PySpark DataFrame. I need to coalesce these, element by element, into a single list. The problem with coalesce is that it doesn't work by element, but rather selects the entire first non-null array. Any suggestions for how to accomplish this would be appreciated. Please see the test case below for an example of expected input and output:
def test_coalesce_elements():
"""
Test array coalescing on a per-element basis
"""
from pyspark.sql import SparkSession
import pyspark.sql.types as t
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
{
"a": [None, 1, None, None],
"b": [2, 3, None, None],
"c": [5, 6, 7, None],
}
]
schema = t.StructType([
t.StructField('a', t.ArrayType(t.IntegerType())),
t.StructField('b', t.ArrayType(t.IntegerType())),
t.StructField('c', t.ArrayType(t.IntegerType())),
])
df = spark.createDataFrame(data, schema)
# Inspect schema
df.printSchema()
# root
# | -- a: array(nullable=true)
# | | -- element: integer(containsNull=true)
# | -- b: array(nullable=true)
# | | -- element: integer(containsNull=true)
# | -- c: array(nullable=true)
# | | -- element: integer(containsNull=true)
# Inspect df values
df.show(truncate=False)
# +---------------------+------------------+---------------+
# |a |b |c |
# +---------------------+------------------+---------------+
# |[null, 1, null, null]|[2, 3, null, null]|[5, 6, 7, null]|
# +---------------------+------------------+---------------+
# This obviously does not work, but hopefully provides the general idea
# Remember: this will need to work with an arbitrary and dynamic set of columns
input_cols = ['a', 'b', 'c']
df = df.withColumn('d', f.coalesce(*[f.col(i) for i in input_cols]))
# This is the expected output I would like to see for the given inputs
assert df.collect()[0]['d'] == [2, 1, 7, None]
Thanks in advance for any ideas!
Well, as Derek and OP have said, Derek's answer works but it would be better if we avoid using UDFs, so here is a way to accomplish it natively,
from pyspark.sql.window import Window
# Give it any static value as we just want row number for all the rows present in DataFrame
w = Window().orderBy(F.lit('A'))
# Will be used later tp join df with second df containing the calculated "d" column
df = df.withColumn("row_num", F.row_number().over(w))
print("DF:")
df.show(truncate=False)
# Input Columns
input_cols = ['a', 'b', 'c']
# Zip all the array using array_zip
# Explode the zipped array
# Create the new columns from the exploded zipped array to get single values
# Coalesce to get the first non-null value
# group by row_num as we want to bring all the values back in one array
# First convert to array before using collect_list as it ignore "null" values and the flatten the nested array to get one single flat array
df_2 = df.withColumn("new", F.arrays_zip(*input_cols)) \
.withColumn("new", F.explode("new")) \
.select("row_num", *[F.col(f"new.{i}").alias(f"new_{i}") for i in input_cols]) \
.withColumn("d", F.coalesce(*[(F.col(f"new_{i}")) for i in input_cols])) \
.groupBy("row_num") \
.agg(F.flatten(F.collect_list(F.array("d"))).alias("d"))
print("Second DF:")
df_2.show(truncate=False)
# Join based on the row_num
final_df = df.join(df_2, df["row_num"] == df_2["row_num"], "inner") \
.drop("row_num")
# voilĂ
print("Final DF:")
final_df.show(truncate = False)
assert final_df.collect()[0]["d"] == [2, 1, 7, None]
DF:
+---------------------+------------------+---------------+-------+
|a |b |c |row_num|
+---------------------+------------------+---------------+-------+
|[null, 1, null, null]|[2, 3, null, null]|[5, 6, 7, null]|1 |
+---------------------+------------------+---------------+-------+
Second DF:
+-------+---------------+
|row_num|d |
+-------+---------------+
|1 |[2, 1, 7, null]|
+-------+---------------+
Final DF:
+---------------------+------------------+---------------+---------------+
|a |b |c |d |
+---------------------+------------------+---------------+---------------+
|[null, 1, null, null]|[2, 3, null, null]|[5, 6, 7, null]|[2, 1, 7, null]|
+---------------------+------------------+---------------+---------------+
Although it would be ideal, I am not sure if there is an elegant way to do this using only pyspark functions.
What I did is write a udf that takes in an variable number of columns (using *args, which you can read about here), and return an array of integers.
#f.udf(returnType=t.ArrayType(t.IntegerType()))
def get_array_non_null_first_element(*args):
data_array = [item for item in args]
array_lengths = [len(array) for array in data_array]
## check that all of the arrays have the same length
assert(len(set(array_lengths)) == 1)
## if they do, then you can set the array length
array_length = array_lengths[0]
first_value_array = []
for i in range(array_length):
element_array = [array[i] for array in data_array]
value = None
for x in element_array:
if x is not None:
value = x
break
else:
continue
first_value_array.append(value)
return first_value_array
Then create a new column d by applying this udf to whichever columns you like:
df.withColumn("d", get_array_non_null_first_element(F.col('a'), F.col('b'), F.col('c'))).show()
+--------------------+------------------+---------------+---------------+
| a| b| c| d|
+--------------------+------------------+---------------+---------------+
|[null, 1, null, n...|[2, 3, null, null]|[5, 6, 7, null]|[2, 1, 7, null]|
+--------------------+------------------+---------------+---------------+
Thanks to Derek and Tushar for their responses! I was able to use them as a basis to solve the issue without a UDF, join, or explode.
Generally speaking, joins are unfavorable due to being computationally and memory expensive, UDFs can be computationally intensive, and explode can be memory intensive. Fortunately we can avoid all of these using transform:
def add_coalesced_list_by_elements_col(
df: DataFrame,
cols: List[Union[Column, str]],
col_name: str,
) -> DataFrame:
"""
Adds a new column representing a list that is collected by element from the
input set. Please note that all provided this does not check that all provided
columns are of equal length.
Args:
df: Input DataFrame to add column to
cols: List of columns to collect by element. All columns should be of equal length.
col_name: The name of the new column
Returns:
DataFrame with result added as a new column.
"""
return (
df
.withColumn(
col_name,
f.transform(
# Doesn't matter which col, we don't use this val
cols[0],
# We use i + 1 because sql array index starts at 1, while transform index starts at 0
lambda _, i: f.coalesce(*[f.element_at(c, i + 1) for c in cols]))
)
)
def test_collect_list_elements():
from typing import List
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql import SparkSession, DataFrame, Column, Window
# Arrange
spark = SparkSession.builder.getOrCreate()
data = [
{
"id": 1,
"a": [None, 1, None, None],
"b": [2, 3, None, None],
"c": [5, 6, 7, None],
}
]
schema = t.StructType(
[
t.StructField("id", t.IntegerType()),
t.StructField("a", t.ArrayType(t.IntegerType())),
t.StructField("b", t.ArrayType(t.IntegerType())),
t.StructField("c", t.ArrayType(t.IntegerType())),
]
)
df = spark.createDataFrame(data, schema)
# Act
df = add_coalesced_list_by_elements_col(df=df, cols=["a", "b", "c"], col_name="d")
# Assert new col is correct output
assert df.collect()[0]["d"] == [2, 1, 7, None]
# Assert all the other cols are not affected
assert df.collect()[0]["a"] == [None, 1, None, None]
assert df.collect()[0]["b"] == [2, 3, None, None]
assert df.collect()[0]["c"] == [5, 6, 7, None]
Related
I have a df with a column fftAbs (absolute values acquired after an fft). The type of df['fftAbs'] is a ArrayType(DoubleType()). I want to get the indexwise average of all the values.
So if the column holds
// Random data
||fftAbs ||
------------
|[0, 1, 2] |
|[2, 3, 12]|
|[1, 8, 4] |
I want to aquire a list like [1, 4, 6] (because 0+2+1/3 = 1, 1+3+8/3 = 4, 2+4+12/3 = 6)
I've tried using:
import pyspark.sql.functions as F
avgDf = df.select(F.avg('fftAbs'))
but I'll get AnalysisException: cannot resolve 'avg(fftAbs)' due to data type mismatch: function average requires numeric or interval types, not array<double>;
EDIT:
I also tried
def _index_avg(twoDL):
return np.mean(twoDL)
spark_index_avg = F.udf(_index_avg, T.ArrayType(T.DoubleType(), False))
avgDf = df.agg(spark_index_avg(F.collect_list('fftAbs')))
but then I got net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype). This happens when an unsupported/unregistered class is being unpickled that requires construction arguments. Fix it by registering a custom IObjectConstructor for this class.
Just for reference, my complete code is here (except from the first query):
import numpy as np
import pyspark as ps
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.functions import col
def _rfft(x):
transformed = np.fft.rfft(x)
return map(lambda c: (c.real, c.imag),transformed.tolist())
spark_complexType = T.ArrayType(T.StructType([
T.StructField("real", T.DoubleType(), False),
T.StructField("imag", T.DoubleType(), False),
]), False)
spark_rfft = F.udf(_rfft, spark_complexType)
def _fft_bins(size, periodMicroSeconds):
return np.fft.rfftfreq(size, d=(periodMicroSeconds/10**6)).tolist()
spark_rfft_bins = F.udf(_fft_bins, T.ArrayType(T.DoubleType(), False))
def _abs_complex(complex_tuple_list):
return list([abs(complex(real, imag)) for real, imag in complex_tuple_list])
spark_abs_complex = F.udf(_abs_complex, T.ArrayType(T.DoubleType()))
# df incoming from builder
fftDf = df.withColumn('fft', spark_rfft(col('data'))) \
.withColumn('fftFreq', spark_rfft_bins('dataDim', 'samplePeriod')) \
.withColumn('fftAbs', spark_abs_complex('fft'))
avgDf = fftDf.select(F.avg('fftAbs'))
What about to use posexplode?
df = spark.createDataFrame([[[0, 1, 2]],[[2, 3, 12]],[[1, 8, 4]]]).toDF('array')
df.select(f.posexplode('array')) \
.groupBy('pos') \
.agg(f.avg('col').alias('avg')) \
.show(truncate=False)
+---+---+
|pos|avg|
+---+---+
|1 |4.0|
|2 |6.0|
|0 |1.0|
+---+---+
If you have another field, you can use the Window but there can be a performance issue.
df = spark.createDataFrame([['a', [0, 1, 2]],['b', [2, 3, 12]],['b', [1, 8, 4]]], ['col1', 'array'])
w = Window.partitionBy(f.lit(1))
df.withColumn('avg', f.array(*[f.avg(f.col('array')[i]).over(w) for i in range(0, 3)])) \
.show(truncate=False)
+----+----------+---------------+
|col1|array |avg |
+----+----------+---------------+
|a |[0, 1, 2] |[1.0, 4.0, 6.0]|
|b |[2, 3, 12]|[1.0, 4.0, 6.0]|
|b |[1, 8, 4] |[1.0, 4.0, 6.0]|
+----+----------+---------------+
I found the solution:
def _index_avg(twoDL):
return np.mean(twoDL, axis=0).tolist()
spark_index_avg = F.udf(_index_avg, T.ArrayType(T.DoubleType(), False))
avgDf = df.agg(spark_index_avg(F.collect_list('fftAbs')))
I'm sure someone will come along with a better answer, but this seems to work for now.
EDIT: Yeah this didn't work for larger sets of data
I have a dataframe df containing 40 millions of rows. There is a column named group_id to specific the group identifier of a row. There is a total of 2000 groups.
I would like to label randomly elements in each group and add this information to a column batch of df. For example, if group 1 contains rows 1, 2, 3, 4, and 5, then I choose a permutation of (1, 2, 3, 4, 5), for example, we take (5, 3, 4, 2, 1). Then I assign to a column batch of these rows the values [5, 3, 4, 2, 1].
I defined a function func and used parallelization dummy.Pool, but the speed is very slow. Could you suggest a faster way to do so?
import pandas as pd
import numpy as np
import random
import os
from multiprocessing import dummy
import itertools
core = os.cpu_count()
P = dummy.Pool(processes = core)
N = int(4e7)
M = int(2e3) + 1
col_1 = np.random.randint(1, M, N)
col_2 = np.random.uniform(low = 1, high = 5, size = N)
df = pd.DataFrame({'group_id': col_1, 'value': col_2})
df.sort_values(by = 'group_id', inplace = True)
df.reset_index(inplace = True, drop = True)
id_ = np.unique(df.group_id)
def func(i):
idx = df.group_id == i
m = sum(idx) # count the number of rows in each group
r = list(range(1, m + 1, 1)) # create an enumeration
random.shuffle(r) # create a permutation the enumeration
return(r)
order_list = P.map(func, id_)
# merge the list containing permutations
order = list(itertools.chain.from_iterable(order_list))
df['batch'] = order
Perhaps this could solve you problem. Take a random permutation of the group size.
import numpy as np
import pandas as pd
l = np.repeat([x for x in range(2000)],20000)
df = pd.DataFrame(l, columns=['group'])
df['batch'] = df.groupby('group')['group'].transform(lambda x: np.random.permutation(np.arange(x.size)))
I want to fill missing values with like this:
data = pd.read_csv("E:\\SPEED.csv")
Data - DataFrame
Case - 1
if flcass= "motorway", "motorway_link", "trunk" or "trunk_link"
I want to replace the text "nan" with 110
Case - 2
if flcass= "primary", "primary_link", "secondary" or "secondary_link"
I want to replace the text "nan" with 70
Case - 3
if "fclass" is another value, I want to change it to 40.
I would be grateful for any help.
Two ways in pandas:
df = DataFrame(
{
"A": [1, 2, np.nan, 4],
"B": [1, 4, 9, np.nan],
"C": [1, 2, 3, 5],
"D": list("abcd"),
}
)
fillna lets you fill NA's (or NaNs) with what appears to be a fixed value:
df['B'].fillna(12)
[1,4,9,12]
interpolate uses scipy's interpolation methods -- linear by default:
df.interpolate()
df['A']
[1,2,3,4]
Thank you all for your answers. However, as there are 6812 rows and 16 columns (containing nan values) in the data, it seems that different solutions are required.
You can try this
import pandas as pd
import math
def valuesMapper(data, valuesDict, columns_to_update):
for i in columns_to_update:
data[i] = data[i].apply(lambda x: valuesDict.get(x, 40) if math.isnan(x) else x)
return data
data = pd.read_csv("E:\\SPEED.csv")
valuesDict = {"motorway":110, "motorway_link":110, "trunk":110, "primary":70, "primary_link":70, "secondary":70, "secondary_link":70}
column_to_update = ['AGU_PZR_07_10'] #columns_to_update is the list of columns to be updated, you can get it through code didn't added that as i dont have your data
print(valuesMapper(data, valuesDict, columns_to_update))
With the below example:
data = pandas.DataFrame({
'flclass': ['a', 'b', 'c', 'a'],
'AGU': [float('nan'), float('nan'), float('nan'), 9]
})
You can update it using numpy conditionals iterating over your columns starting from 2nd ([1:]) - 5th ([4:]) in your data:
for column in data.columns[1:]:
data[column] = np.where((data['flclass'] == 'b') & (data[column].isna()), 110, data[column])
Or panadas apply:
import numpy as np
data['AGU'] = data.apply(
lambda row: 110 if np.isnan(row['AGU']) and row['flclass'] in ("b","a") else row['AGU'],
axis=1,
)
where you can replace ("b","a") with eg ("motorway", "motorway_link", "trunk", "trunk_link")
I'd like to generate some test data for my unit tests in PySpark. One of the fields in input Row is an array of structs: basket: array<struct<price:bigint,product_id:string>>. Whats the best way to achieve it?
Here is one way using python and two helper functions responsible for generating the random data:
from pyspark.sql.types import *
from random import randrange, uniform
array_size = 2
def create_row(array_size):
return ([{"price" : uniform(1.0, 100.0), "product_id" : randrange(10) + 1} for _ in range(array_size)],)
def generate_data(data_size):
return [create_row(array_size) for _ in range(data_size)]
# create 5 rows
rows = generate_data(5)
# string schema
schema = "basket: array<struct<price:double,product_id:string>>"
# static typing schema
# schema = StructType([
# StructField('basket',
# ArrayType(
# StructType(
# [
# StructField('price', DoubleType()),
# StructField('product_id', StringType()),
# ]
# )
# )
# )])
df = spark.createDataFrame(rows, schema)
df.show(10, False)
# +--------------------------------------------------+
# |basket |
# +--------------------------------------------------+
# |[[61.40674765573896, 9], [5.994467505720648, 7]] |
# |[[1.1388272509974906, 10], [47.32070824053193, 3]]|
# |[[42.423106687845795, 2], [70.99107361888588, 4]] |
# |[[50.019594333009806, 8], [63.51239439900147, 4]] |
# |[[68.15711374321089, 9], [70.06617125228864, 10]] |
# +--------------------------------------------------+
create_row: will generate a new row (represented as a tuple here) with array_size items. price will have value in the range 1.0 - 100.0 and product_id in the range 1 - 10, please feel free to modify the boundaries accordingly. Also, we handle each item of the array (product_id-price pairs) with a python dictionary.
generate_data: calls create_row data_size times and returns the random generated rows into a list.
In my pig code I do this:
all_combined = Union relation1, relation2,
relation3, relation4, relation5, relation 6.
I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise:
first = rdd1.union(rdd2)
second = first.union(rdd3)
third = second.union(rdd4)
# .... and so on
Is there a union operator that will let me operate on multiple rdds at a time:
e.g. union(rdd1, rdd2,rdd3, rdd4, rdd5, rdd6)
It is a matter on convenience.
If these are RDDs you can use SparkContext.union method:
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
rdd3 = sc.parallelize([7, 8, 9])
rdd = sc.union([rdd1, rdd2, rdd3])
rdd.collect()
## [1, 2, 3, 4, 5, 6, 7, 8, 9]
There is no DataFrame equivalent but it is just a matter of a simple one-liner:
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))
unionAll(df1, df2, df3).show()
## +---+----+
## | k| v|
## +---+----+
## | 1|foo1|
## | 2|bar1|
## | 3|foo2|
## | 4|bar2|
## | 5|foo3|
## | 6|bar3|
## +---+----+
If number of DataFrames is large using SparkContext.union on RDDs and recreating DataFrame may be a better choice to avoid issues related to the cost of preparing an execution plan:
def unionAll(*dfs):
first, *_ = dfs # Python 3.x, for 2.x you'll have to unpack manually
return first.sql_ctx.createDataFrame(
first.sql_ctx._sc.union([df.rdd for df in dfs]),
first.schema
)
You can also use addition for UNION between RDDs
rdd = sc.parallelize([1, 1, 2, 3])
(rdd + rdd).collect()
## [1, 1, 2, 3, 1, 1, 2, 3]
Unfortunately it's the only way to UNION tables in Spark. However instead of
first = rdd1.union(rdd2)
second = first.union(rdd3)
third = second.union(rdd4)
...
you can perform it in a little bit cleaner way like this:
result = rdd1.union(rdd2).union(rdd3).union(rdd4)