Iterate over an array column in PySpark with map - python

In PySpark I have a dataframe composed by two columns:
+-----------+----------------------+
| str1 | array_of_str |
+-----------+----------------------+
| John | [mango, apple, ... |
| Tom | [mango, orange, ... |
| Matteo | [apple, banana, ... |
I want to add a column concat_result that contains the concatenation of each element inside array_of_str with the string inside str1 column.
+-----------+----------------------+----------------------------------+
| str1 | array_of_str | concat_result |
+-----------+----------------------+----------------------------------+
| John | [mango, apple, ... | [mangoJohn, appleJohn, ... |
| Tom | [mango, orange, ... | [mangoTom, orangeTom, ... |
| Matteo | [apple, banana, ... | [appleMatteo, bananaMatteo, ... |
I'm trying to use map to iterate over the array:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType
# START EXTRACT OF CODE
ret = (df
.select(['str1', 'array_of_str'])
.withColumn('concat_result', F.udf(
map(lambda x: x + F.col('str1'), F.col('array_of_str')), ArrayType(StringType))
)
)
return ret
# END EXTRACT OF CODE
but I obtain as error:
TypeError: argument 2 to map() must support iteration

You only need small tweaks to make this work:
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(lambda con_str, arr: [x + con_str for x in arr],
ArrayType(StringType()))
ret = df \
.select(['str1', 'array_of_str']) \
.withColumn('concat_result', concat_udf(col("str1"), col("array_of_str")))
ret.show()
You don't need to use map, standard list comprehension is sufficient.

Related

Check if PySaprk column values exists in another dataframe column values

I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+

Create a dataframe by iterating over column of list in another dataframe

In pyspark, I have a DataFrame with a column that contains a list of ordered nodes to go through:
osmDF.schema
Out[1]:
StructType(List(StructField(id,LongType,true),
StructField(nodes,ArrayType(LongType,true),true),
StructField(tags,MapType(StringType,StringType,true),true)))
osmDF.head(3)
Out[2]:
| id | nodes | tags |
|-----------|-----------------------------------------------------|---------------------|
| 62960871 | [783186590,783198852] | "{""foo"":""bar""}" |
| 211528816 | [2215187080,2215187140,2215187205,2215187256] | "{""foo"":""boo""}" |
| 62960872 | [783198772,783183397,783167527,783169067,783198772] | "{""foo"":""buh""}" |
I need to create a dataframe with a row for each consecutive combination of 2 nodes the list of nodes, then save it as parquet.
The expected result will have a length of n-1, with n len(nodes) for each rows. It would look like this (with other columns that I'll add):
| id | from | to | tags |
|-----------------------|------------|------------|---------------------|
| 783186590_783198852 | 783186590 | 783198852 | "{""foo"":""bar""}" |
| 2215187080_2215187140 | 2215187080 | 2215187140 | "{""foo"":""boo""}" |
| 2215187140_2215187205 | 2215187140 | 2215187205 | "{""foo"":""boo""}" |
| 2215187205_2215187256 | 2215187205 | 2215187256 | "{""foo"":""boo""}" |
| 783198772_783183397 | 783198772 | 783183397 | "{""foo"":""buh""}" |
| 783183397_783167527 | 783183397 | 783167527 | "{""foo"":""buh""}" |
| 783167527_783169067 | 783167527 | 783169067 | "{""foo"":""buh""}" |
| 783169067_783198772 | 783169067 | 783198772 | "{""foo"":""buh""}" |
I tried to initiate with the following
from pyspark.sql.functions import udf
def split_ways_into_arcs(row):
arcs = []
for node in range(len(row['nodes']) - 1):
arc = dict()
arc['id'] = str(row['nodes'][node]) + "_" + str(row['nodes'][node + 1])
arc['from'] = row['nodes'][node]
arc['to'] = row['nodes'][node + 1]
arc['tags'] = row['tags']
arcs.append(arc)
return arcs
# Declare function as udf
split = udf(lambda row: split_ways_into_arcs(row.asDict()))
The issue I'm having is I don't know how many nodes there are in each row of the original DataFrame.
I know how to apply a udf to add a column to an existing DataFrame, but not to create a new one from lists of dicts.
Iterate over the nodes array using transform and explode the array afterwards:
from pyspark.sql import functions as F
df = ...
df.withColumn("nodes", F.expr("transform(nodes, (n,i) -> named_struct('from', nodes[i], 'to', nodes[i+1]))")) \
.withColumn("nodes", F.explode("nodes")) \
.filter("not nodes.to is null") \
.selectExpr("concat_ws('_', nodes.to, nodes.from) as id", "nodes.*", "tags") \
.show(truncate=False)
Output:
+---------------------+----------+----------+-----------------+
|id |from |to |tags |
+---------------------+----------+----------+-----------------+
|783198852_783186590 |783186590 |783198852 |{""foo"":""bar""}|
|2215187140_2215187080|2215187080|2215187140|{""foo"":""boo""}|
|2215187205_2215187140|2215187140|2215187205|{""foo"":""boo""}|
|2215187256_2215187205|2215187205|2215187256|{""foo"":""boo""}|
|783183397_783198772 |783198772 |783183397 |{""foo"":""buh""}|
|783167527_783183397 |783183397 |783167527 |{""foo"":""buh""}|
|783169067_783167527 |783167527 |783169067 |{""foo"":""buh""}|
|783198772_783169067 |783169067 |783198772 |{""foo"":""buh""}|
+---------------------+----------+----------+-----------------+

Combining Two Pandas Dataframe with Same Columns into one String Columns

I have two Pandas dataframes ie:
+-------+-------------------+--+
| Name | Class | |
+-------+-------------------+--+
| Alice | Physics | |
| Bob | "" (Empty string) | |
+-------+-------------------+--+
Table 2:
+-------+-----------+
| Name | Class |
+-------+-----------+
| Alice | Chemistry |
| Bob | Math |
+-------+-----------+
Is there a way to combine it easily on the column Class so the resulting table is like:
+-------+--------------------+
| Name | Class |
+-------+--------------------+
| Alice | Physics, Chemistry |
| Bob | Math |
+-------+--------------------+
I also want to make sure there are no extra commas when adding columns. Thanks!
df = pd.DataFrame({'Name':['Alice','Bob'],
'Class':['Physics',np.nan]})
df2 = pd.DataFrame({'Name':['Alice','Bob'],
'Class':['Chemistry','Math']})
df3 = df.append(df2).dropna(subset=['Class']).groupby('Name')['Class'].apply(list).reset_index()
# to remove list
df3['Class'] = df3['Class'].apply(lambda x: ', '.join(x))
Try with concat and groupby:
>>> pd.concat([df1, df2]).groupby("Name").agg(lambda x: ", ".join(i for i in x.tolist() if len(i.strip())>0)).reset_index()
Name Class
Alice Physics, Chemistry
Bob Math

Complex functions from python to pyspark - EDIT: concatenation problem (I think)

I'm trying to transform a pandas function over two dataframes to a pyspark function.
In particular, I have a dataframe of Keys and functions as strings, namely:
> mv
| Keys | Formula | label |
---------------------------------------
| key1 | 'val1 + val2 - val3' | name1 |
| key2 | 'val3 + val4' | name2 |
| key3 | 'val1 - val4' | name3 |
and a dataframe df:
> df
| Keys | Datetime | names | values |
------------------------------------
| key1 | tmstmp1 | val1 | 0.3 |
| key1 | tmstmp1 | val2 | 0.4 |
| key1 | tmstmp1 | val3 | 0.2 |
| key1 | tmstmp1 | val4 | 1.2 |
| key1 | tmstmp2 | val1 | 0.5 |
| key2 | tmstmp1 | val1 | 1.1 |
| key2 | tmstmp2 | val3 | 1.0 |
| key2 | tmstmp1 | val3 | 1.3 |
and so on.
I've created two functions that read the code, evaluate the string ok the measure expression, and return a list of pandas.DataFrame that in the end I concat.
def evaluate_vm(x, regex):
m = re.findall(regex, x)
to_replace = ['#[' + i + ']' for i in m]
replaces = [i.split(', ') for i in m]
replacement = ["df.loc[(df.Keys == %s) & (df.names == %s), ['Datetime', 'values']].dropna().set_index('Datetime')"%tuple(i) for i in replaces]
for i in range(len(to_replace)):
x = x.replace(to_replace[i], replacement[i])
return eval(x)
def _mv_(X):
formula = evaluate_vm(X.Formula)
formula['Keys'] = X.Keys
formula.reset_index(inplace = True)
formula.rename_axis({'Formula': 'Values'}, axis = 1, inplace = True)
return formula[['Keys', 'Datetime', 'names', 'Values']]
After that my code is
res = pd.concat([_mv_(mv.loc[i]) for i in mv.index])
and res is what I need to obtain.
NOTE: I've slightly modified the functions and inputs in order to make it understandable: anyway, I don't think the problem lies here.
Here's the thing. I'm trying to transform this stuff in pyspark.
This is the code I wrote this far.
from pyspark.sql.functions import pandas_udf, PandasUDFType, struct
from pyspark.sql.types import FloatType, StringType, IntegerType, TimestampType, StructType, StructField
EvaluateVM = pandas_udf(lambda x: _mv_(x),\
functionType = PandasUDFType.GROUPED_MAP, \
returnType = StructType([StructField("Keys", StringType(), False),
StructField("Datetime", TimestampType(), False),\
StructField("names", StringType(), False),\
StructField("Values", FloatType(), False)])
)
res = EvaluateVM(struct([mv[i] for i in mv.columns]))
That is "almost" working: when I print res here's the result.
> res
Column<<lambda>(named_struct(Keys, Keys, Formula, Formula))>
And I can't see inside res: I think it created something like a python iterable, but I would like to have the same result I get in pandas.
What should I do? Did I get it all wrong?
EDIT: I think the problem might be in the fact that in pandas I create a list of dataframes that I concatenate after evaluating them, in pyspark I use a sort of apply(_mv_, axis = 1): this kind of syntax gave me error even in pandas (cannot concatenate dataframe of dimension 192 into one of size 5, something like that), and my workaround was the pandas.concat([…]). I don't know if this works in pyspark too, or if there's some way to avoid this.
EDIT 2: Sorry, I didn't write the expected output:
| Keys | Datetime | label | values |
---------------------------------------------
| key1 | tmstmp1 | name1 | 0.3 + 0.4 - 0.2 |
| key1 | tmstmp1 | name2 | 0.2 + 1.2 |
and so on. The values column should contain the numeric result, here I wrote the operands in order to let you understand.

Convert PySpark dataframe column from list to string

I have this PySpark dataframe
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |[test, test2, test3]|
| 2 |[test4, test, test6]|
| 3 |[test6, test9, t55o]|
and I want to convert the column test_123 to be like this:
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |"test,test2,test3" |
| 2 |"test4,test,test6" |
| 3 |"test6,test9,t55o" |
so from list to be string.
how can I do it with PySpark?
While you can use a UserDefinedFunction it is very inefficient. Instead it is better to use concat_ws function:
from pyspark.sql.functions import concat_ws
df.withColumn("test_123", concat_ws(",", "test_123")).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
You can create a udf that joins array/list and then apply it to the test column:
from pyspark.sql.functions import udf, col
join_udf = udf(lambda x: ",".join(x))
df.withColumn("test_123", join_udf(col("test_123"))).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
The initial data frame is created from:
from pyspark.sql.types import StructType, StructField
schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
df = spark.createDataFrame(rdd, schema)
df.show()
+----+--------------------+
|uuid| test_123|
+----+--------------------+
| 1|[test, test2, test3]|
| 2|[test4, test, test6]|
| 3|[test6, test9, t55o]|
+----+--------------------+
As of version 2.4.0, you can use array_join.Spark docs
from pyspark.sql.functions import array_join
df.withColumn("test_123", array_join("test_123", ",")).show()

Categories

Resources