Pair combinations of array column in PySpark

Pair combinations of array column in PySpark - python

Similar to this question (Scala), but I need combinations in PySpark (pair combinations of array column).
Example input:
df = spark.createDataFrame(
[([0, 1],),
([2, 3, 4],),
([5, 6, 7, 8],)],
['array_col'])
Expected output:
+------------+------------------------------------------------+
|array_col |out |
+------------+------------------------------------------------+
|[0, 1] |[[0, 1]] |
|[2, 3, 4] |[[2, 3], [2, 4], [3, 4]] |
|[5, 6, 7, 8]|[[5, 6], [5, 7], [5, 8], [6, 7], [6, 8], [7, 8]]|
+------------+------------------------------------------------+

Native Spark approach. I've translated this answer to PySpark.
Python 3.8+ (walrus := operator for "array_col" which is repeated several times in this script):
from pyspark.sql import functions as F
df = df.withColumn(
"out",
F.filter(
F.transform(
F.flatten(F.transform(
c:="array_col",
lambda x: F.arrays_zip(F.array_repeat(x, F.size(c)), c)
)),
lambda x: F.array(x["0"], x[c])
),
lambda x: x[0] < x[1]
)
)
df.show(truncate=0)
# +------------+------------------------------------------------+
# |array_col |out |
# +------------+------------------------------------------------+
# |[0, 1] |[[0, 1]] |
# |[2, 3, 4] |[[2, 3], [2, 4], [3, 4]] |
# |[5, 6, 7, 8]|[[5, 6], [5, 7], [5, 8], [6, 7], [6, 8], [7, 8]]|
# +------------+------------------------------------------------+
Alternative without walrus operator:
from pyspark.sql import functions as F
df = df.withColumn(
"out",
F.filter(
F.transform(
F.flatten(F.transform(
"array_col",
lambda x: F.arrays_zip(F.array_repeat(x, F.size("array_col")), "array_col")
)),
lambda x: F.array(x["0"], x["array_col"])
),
lambda x: x[0] < x[1]
)
)
Alternative for Spark 2.4+
from pyspark.sql import functions as F
df = df.withColumn(
"out",
F.expr("""
filter(
transform(
flatten(transform(
array_col,
x -> arrays_zip(array_repeat(x, size(array_col)), array_col)
)),
x -> array(x["0"], x["array_col"])
),
x -> x[0] < x[1]
)
""")
)

pandas_udf is an efficient and concise approach in PySpark.
from pyspark.sql import functions as F
import pandas as pd
from itertools import combinations
#F.pandas_udf('array<array<int>>')
def pudf(c: pd.Series) -> pd.Series:
return c.apply(lambda x: list(combinations(x, 2)))
df = df.withColumn('out', pudf('array_col'))
df.show(truncate=0)
# +------------+------------------------------------------------+
# |array_col |out |
# +------------+------------------------------------------------+
# |[0, 1] |[[0, 1]] |
# |[2, 3, 4] |[[2, 3], [2, 4], [3, 4]] |
# |[5, 6, 7, 8]|[[5, 6], [5, 7], [5, 8], [6, 7], [6, 8], [7, 8]]|
# +------------+------------------------------------------------+
Note: in some systems, instead of 'array<array<int>>' you may need to provide types from pyspark.sql.types, e.g. ArrayType(ArrayType(IntegerType())).

Related

Selecting a range of columns in Python without using numpy

I want to extract range of columns. I know how to do that in numpy but I don't want to use numpy slicing operator.
import numpy as np
a = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
arr = np.array(a)
k = 0
print(arr[k:, k+1]) # --> [2 7]
print([[a[r][n+1] for n in range(0,k+1)] for r in range(k,len(a))][0]) # --> [2]
What's wrong with second statement?

You're overcomplicating it. Get the rows with a[k:], then get a cell with row[k+1].
>>> [row[k+1] for row in a[k:]]
[2, 7]

a = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
k = 0
print(list(list(zip(*a[k:]))[k+1])) # [2, 7]

Is this what you're looking for?
cols = [1,2,3] # extract middle 3 columns
cols123 = [[l[col] for col in cols] for l in a]
# [[2, 3, 4], [7, 8, 9]]

element-wise merge np.array in multiple pandas column

I got a pandas dataframe, in which there are several columns’ value are np.array， I would like to merge these np.arrays into one array elementwise based row.
e.g
col1 col2 col3
[2.1, 3] [4, 4] [2, 3]
[4, 5] [6, 7] [9, 9]
[7, 8] [8, 9] [5, 4]
... ... ...
expected result:
col_f
[2.1, 3, 4, 4, 2, 3]
[4, 5, 6, 7, 9, 9]
[7, 8, 8, 9 5, 4]
........
I use kind of for loop to realize it, but just wondering if there is the more elegant way to do it.
below is my for loop cod:
f_vector = []
for i in range(len(df.index)):
vector = np.hstack((df['A0_vector'][i], items_df['A1_vector'][i], items_df['A2_vector'][i], items_df['A3_vector'][i], items_df['A4_vector'][i], items_df['A5_vector'][i]))
f_vector.append(vector)
X = np.array(f_vector)

You can use numpy.concatenate with apply along axis=1:
import numpy as np
df['col_f'] = df[['col1', 'col2', 'col3']].apply(np.concatenate, axis=1)
If those were lists instead of np.arrays, + operator would have worked:
df['col_f'] = df['col1'] + df['col2'] + + df['col3']
Note: edited after comments thread below.

When func call in return statement of Pyspark UDF no output returned

I am trying to sort an ArrayType col in Pyspark DataFrame.
Following is the pyspark code that is not giving any output:
from operator import itemgetter
from pyspark.sql.functions import *
from pyspark.sql.types import *
def sort_data_array(row):
return sorted(row, key=itemgetter(1))
# sorting the ArrayType cols according to ascending order of 1st index elems of inner lists
df1 = spark.createDataFrame([[1,[[3,2,3], [1,5,4], [5,6,6]]], [2,[[12,3,6], [9,5,1], [5,3,1]]]], StructType([StructField('_1', IntegerType()), StructField('_2', ArrayType(ArrayType(IntegerType())))]))
sorting_udf = udf(sort_data_array, ArrayType(ArrayType(IntegerType())))
df1 = df1.withColumn('sorted_2', sorting_udf(df1['_2']))
df1.take(2)
When I make this change this code runs and gives desired output:
def sort_data_array(row):
sorted_row = sorted(row, key=itemgetter(1))
return sorted_row
Why is this happening?

Please change index in function,
def sort_data_array(row):
return sorted(row, key=itemgetter(0)) # 0 index
Now output is :
[Row(_1=1, _2=[[3, 2, 3], [1, 5, 4], [5, 6, 6]], sorted_2=[[1, 5, 4], [3, 2, 3], [5, 6, 6]]),
Row(_1=2, _2=[[12, 3, 6], [9, 5, 1], [5, 3, 1]], sorted_2=[[5, 3, 1], [9, 5, 1], [12, 3, 6]])]

How to map a function to every item in every sublist of a list

Is there a way to do this without using a regular for loop to iterate through the main list?
>>> map(lambda x: x*2, [[1,2,3],[4,5,6]])
[[1, 2, 3, 1, 2, 3], [4, 5, 6, 4, 5, 6]]
# want [[2,4,6],[8,10,12]]

You have nested lists, and x represents just one of the lists. To process that, you need to actually map the multiplication function on to the individual elements of x, like this
>>> map(lambda x: map(lambda y: y * 2, x), [[1, 2, 3], [4, 5, 6]])
[[2, 4, 6], [8, 10, 12]]
But I would prefer list comprehension over this,
>>> [[y * 2 for y in x] for x in [[1, 2, 3], [4, 5, 6]]]
[[2, 4, 6], [8, 10, 12]]

Alternative solution would be to go for numpy vectorized operations:
import numpy as np
ll = [[1,2,3],[4,5,6]]
(2*np.array(ll)).tolist()
#Out[6]: [[2, 4, 6], [8, 10, 12]]

This is a bit overkill and not too practical for this particular example, but another stylistic option could be to use functools.partial to make it very clear what is happening and a combination of map and a list comprehension.
from functools import partial
from operator import mul
l = [[1, 2, 3], [4, 5, 6]]
double = partial(mul, 2)
dub_l = [map(double, sub) for sub in l]

combinations of the values of two lists

I am looking for an idiomatic way to combine an n-dimensional vector (given as a list) with a list of offsets, that shall be applied to every dimensions.
I.e.: Given I have the following values and offsets:
v = [5, 6]
o = [-1, 2, 3]
I want to obtain the following list:
n = [[4, 5], [7, 5], [8, 5], [4, 8], [7, 8], [8, 8], [4, 9], [7, 9], [8, 9]]
originating from:
n = [[5-1, 6-1], [5+2, 6-1], [5+3, 6-1], [5-1, 6+2], [5+2, 6+2], [5+3, 6+2], [5-1, 6+3], [5+2, 6+3], [5+3, 6+3]]
Performance is not an issue here and the order of the resulting list also does not matter. Any suggestions on how this can be produced without ugly nested for loops? I guess itertools provides the tools for a solution, but I did not figure it out yet.

from itertools import product
[map(sum, zip(*[v, y])) for y in product(o, repeat=2)]
or, as falsetru and Dobi suggested in comments:
[map(sum, zip(v, y)) for y in product(o, repeat=len(v))]

itertools.product() gives you the desired combinations of o. Use that with a list comprehension to create n:
from itertools import product
n = [[v[0] + x, v[1] + y] for x, y in product(o, repeat=2)]
Demo:
>>> [[v[0] + x, v[1] + y] for x, y in product(o, repeat=2)]
[[4, 5], [4, 8], [4, 9], [7, 5], [7, 8], [7, 9], [8, 5], [8, 8], [8, 9]]

Use itertools.product:
>>> import itertools
>>>
>>> v = [5, 6]
>>> o = [-1, 2, 3]
>>>
>>> x, y = v
>>> [[x+dx, y+dy] for dx, dy in itertools.product(o, repeat=2)]
[[4, 5], [4, 8], [4, 9], [7, 5], [7, 8], [7, 9], [8, 5], [8, 8], [8, 9]]
originating from:
[[5-1, 6-1], [5-1, 6+2], [5-1, 6+3], [5+2, 6-1], [5+2, 6+2], [5+2, 6+3], [5+3, 6-1], [5+3, 6+2], [5+3, 6+3]]

You may use itertools module to get all permutations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pair combinations of array column in PySpark - python

Related

Selecting a range of columns in Python without using numpy

element-wise merge np.array in multiple pandas column

When func call in return statement of Pyspark UDF no output returned

How to map a function to every item in every sublist of a list

combinations of the values of two lists

Categories

Resources