Concat multiple columns with loop Pyspark

Concat multiple columns with loop Pyspark - python

I have n arrays of string columns. I would like concatenate this n columns in one, using a loop.
I have this function to concat columns:
def concat(type):
def concat_(*args):
return list(chain(*args))
return udf(concat_, ArrayType(type))
concat_string_arrays = concat(StringType())
And in the following example, I have 4 columns that I will concatenate like this:
df_aux = df.select('ID_col',concat_string_arrays(col("patron_txt_1"),col("patron_txt_2"),col('patron_txt_3'),col('patron_txt_0')).alias('patron_txt')
But, if I have 200 columns, how can I use dynamically this function with a loop?

You can use the * operator to pass a list of columns to your concat UDF:
from itertools import chain
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *
df = sqlContext.createDataFrame([("1", "2","3","4"),
("5","6","7","8")],
('ID_col', 'patron_txt_0','patron_txt_1','patron_txt_2'))
def concat(type):
def concat_(*args):
return list(chain(*args))
return udf(concat_, ArrayType(type))
concat_string_arrays = concat(StringType())
#Select the columns you want to concatenate
cols = [c for c in df.columns if c.startswith("patron_txt")]
#Use the * operator to pass multiple columns to concat_string_arrays
df.select('ID_col',concat_string_arrays(*cols).alias('patron_txt')).show()
This results in the following output:
+------+----------+
|ID_col|patron_txt|
+------+----------+
| 1| [2, 3, 4]|
| 5| [6, 7, 8]|
+------+----------+

Related

Modify Different Pyspark Column on Exception in UDF

I have a data frame and a function that I want to run on every cell in my data frame:
def foo(x):
# does stuff to x
return x
foo_udf = udf(lambda x: foo(x), StringType())
df = df.withColumn("col1", foo_udf(col("col1")))
.withColumn("col2", foo_udf(col("col2")))
.withColumn("col3", foo_udf(col("col3")))
It simply modifies the data passed in and returns a new value to replace the passed in value.
However, there may be instances where an error will occur, for these instances, I have another column col4 which will store a boolean of whether or not the udf failed for that row.
My issue is that when this occurs, I have no way of accessing col4 for that given row.

You can do this on a partition level with mapPartition. I will use Fugue which will provide an easier interface to bring this to Spark.
First some setup:
from typing import List, Dict, Any, Iterable
import pandas as pd
def foo(x):
if x == "E":
raise ValueError()
return x + "_ran"
def logic(df: List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
try:
x = foo(row["col1"])
y = foo(row["col2"])
z = foo(row["col3"])
# if it reaches here, we can update all
row["col1"] = x
row["col2"] = y
row["col3"] = z
row["col4"] = False
except:
row["col4"] = True
return df
foo() is your original function and logic() is a wrapper to only update the columns if every foo() call is successful. Annotating the function will guide Fugue to apply conversions. From here we can use Fugue's transform() to test on Pandas.
df = pd.DataFrame({"col1": ["A", "B", "C"], "col2": ["A", "B", "C"], "col3": ["D", "E", "F"]})
from fugue import transform
transform(df, logic, schema="*, col4:boolean")
The schema is a requirement for Spark operations. This is just a minimal expression and then Fugue handles it, and then we get a result:
col1 col2 col3 col4
A_ran A_ran D_ran False
B B E True
C_ran C_ran F_ran False
so we can bring it to Spark. We just need to supply a SparkSession.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
transform(sdf, logic, schema="*, col4:boolean", engine=spark).show()

You can only return/change a single column from an UDF. However, this column can be a StructType, containing the payload and an error flag. Then you can "unpack" the struct column into two (or more) normal columns.
from pyspark.sql import functions as F
from pyspark.sql import types as T
#some testdata
data = [['A', 4],
['B', 2],
['C', 5]]
df=spark.createDataFrame(data, ["id", "col1"])
#the udf
def foo(x):
if x == 5:
error=True
else:
error=False
return [x, error]
foo_udf = F.udf(lambda x: foo(x), returnType = T.StructType([
T.StructField("x", T.StringType(), False),
T.StructField("error", T.BooleanType(), False)
]))
#calling the udf and unpacking the return values
df.withColumn("col1", foo_udf("col1")) \
.withColumn("error", F.col("col1.error")) \
.withColumn("col1", F.col("col1.x")) \
.show()
Output:
+---+----+-----+
| id|col1|error|
+---+----+-----+
| A| 4|false|
| B| 2|false|
| C| 5| true|
+---+----+-----+

Pyspark: Count how many rows have the same value in two columns and drop the duplicates [duplicate]

I have three Arrays of string type containing following information:
groupBy array: containing names of the columns I want to group my data by.
aggregate array: containing names of columns I want to aggregate.
operations array: containing the aggregate operations I want to perform
I am trying to use spark data frames to achieve this. Spark data frames provide an agg() where you can pass a Map [String,String] (of column name and respective aggregate operation ) as input, however I want to perform different aggregation operations on the same column of the data. Any suggestions on how to achieve this?

Scala:
You can for example map over a list of functions with a defined mapping from name to function:
import org.apache.spark.sql.functions.{col, min, max, mean}
import org.apache.spark.sql.Column
val df = Seq((1L, 3.0), (1L, 3.0), (2L, -5.0)).toDF("k", "v")
val mapping: Map[String, Column => Column] = Map(
"min" -> min, "max" -> max, "mean" -> avg)
val groupBy = Seq("k")
val aggregate = Seq("v")
val operations = Seq("min", "max", "mean")
val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))
df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*).show
// +---+------+------+------+
// | k|min(v)|max(v)|avg(v)|
// +---+------+------+------+
// | 1| 3.0| 3.0| 3.0|
// | 2| -5.0| -5.0| -5.0|
// +---+------+------+------+
or
df.groupBy(groupBy.head, groupBy.tail: _*).agg(exprs.head, exprs.tail: _*).show
Unfortunately parser which is used internally SQLContext is not exposed publicly but you can always try to build plain SQL queries:
df.registerTempTable("df")
val groupExprs = groupBy.mkString(",")
val aggExprs = aggregate.flatMap(c => operations.map(
f => s"$f($c) AS ${c}_${f}")
).mkString(",")
sqlContext.sql(s"SELECT $groupExprs, $aggExprs FROM df GROUP BY $groupExprs")
Python:
from pyspark.sql.functions import mean, sum, max, col
df = sc.parallelize([(1, 3.0), (1, 3.0), (2, -5.0)]).toDF(["k", "v"])
groupBy = ["k"]
aggregate = ["v"]
funs = [mean, sum, max]
exprs = [f(col(c)) for f in funs for c in aggregate]
# or equivalent df.groupby(groupBy).agg(*exprs)
df.groupby(*groupBy).agg(*exprs)
See also:
Spark SQL: apply aggregate functions to a list of column

For those that wonder, how #zero323 answer can be written without a list comprehension in python:
from pyspark.sql.functions import min, max, col
# init your spark dataframe
expr = [min(col("valueName")),max(col("valueName"))]
df.groupBy("keyName").agg(*expr)

Do something like
from pyspark.sql import functions as F
df.groupBy('groupByColName') \
.agg(F.sum('col1').alias('col1_sum'),
F.max('col2').alias('col2_max'),
F.avg('col2').alias('col2_avg')) \
.show()

Here is another straight forward way to apply different aggregate functions on the same column while using Scala (this has been tested in Azure Databricks).
val groupByColName = "Store"
val colName = "Weekly_Sales"
df.groupBy(groupByColName)
.agg(min(colName),
max(colName),
round(avg(colName), 2))
.show()

for example if you want to count percentage of zeroes in each column in pyspark dataframe for which we can use expression to be executed on each column of the dataframe
from pyspark.sql.functions import count,col
def count_zero_percentage(c):
pred = col(c)==0
return sum(pred.cast("integer")).alias(c)
df.agg(*[count_zero_percentage(c)/count('*').alias(c) for c in df.columns]).show()

case class soExample(firstName: String, lastName: String, Amount: Int)
val df = Seq(soExample("me", "zack", 100)).toDF
import org.apache.spark.sql.functions._
val groupped = df.groupBy("firstName", "lastName").agg(
sum("Amount"),
mean("Amount"),
stddev("Amount"),
count(lit(1)).alias("numOfRecords")
).toDF()
display(groupped)
// Courtesy Zach ..
Zach simplified answer for a post Marked Duplicate
Spark Scala Data Frame to have multiple aggregation of single Group By

map values in a dataframe from a dictionary using pyspark

I want to know how to map values in a specific column in a dataframe.
I have a dataframe which looks like:
df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])
+-----+-------+
| col1| col2|
+-----+-------+
|india| japan|
| usa|uruguay|
+-----+-------+
I have a dictionary from where I want to map the values.
dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')])
The output I want is:
+-----+-------+--------+--------+
| col1| col2|col1_map|col2_map|
+-----+-------+--------+--------+
|india| japan| ind| jpn|
| usa|uruguay| us| urg|
+-----+-------+--------+--------+
I have tried using the lookup function but it doesn't work. It throws error SPARK-5063. Following is my approach which failed:
def map_val(x):
return dicts.lookup(x)[0]
myfun = udf(lambda x: map_val(x), StringType())
df = df.withColumn('col1_map', myfun('col1')) # doesn't work
df = df.withColumn('col2_map', myfun('col2')) # doesn't work

I think the easier way is just to use a simple dictionary and df.withColumn.
from itertools import chain
from pyspark.sql.functions import create_map, lit
simple_dict = {'india':'ind', 'usa':'us', 'japan':'jpn', 'uruguay':'urg'}
mapping_expr = create_map([lit(x) for x in chain(*simple_dict.items())])
df = df.withColumn('col1_map', mapping_expr[df['col1']])\
.withColumn('col2_map', mapping_expr[df['col2']])
df.show(truncate=False)

udf way
I would suggest you to change the list of tuples to dicts and broadcast it to be used in udf
dicts = sc.broadcast(dict([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]))
from pyspark.sql import functions as f
from pyspark.sql import types as t
def newCols(x):
return dicts.value[x]
callnewColsUdf = f.udf(newCols, t.StringType())
df.withColumn('col1_map', callnewColsUdf(f.col('col1')))\
.withColumn('col2_map', callnewColsUdf(f.col('col2')))\
.show(truncate=False)
which should give you
+-----+-------+--------+--------+
|col1 |col2 |col1_map|col2_map|
+-----+-------+--------+--------+
|india|japan |ind |jpn |
|usa |uruguay|us |urg |
+-----+-------+--------+--------+
join way (slower than udf way)
All you have to do is change the dicts rdd to dataframe too and use two joins with aliasings as following
df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])
dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]).toDF(['key', 'value'])
from pyspark.sql import functions as f
df.join(dicts, df['col1'] == dicts['key'], 'inner')\
.select(f.col('col1'), f.col('col2'), f.col('value').alias('col1_map'))\
.join(dicts, df['col2'] == dicts['key'], 'inner') \
.select(f.col('col1'), f.col('col2'), f.col('col1_map'), f.col('value').alias('col2_map'))\
.show(truncate=False)
which should give you the same result

Similar to Ali AzG, but pulling it all out into a handy little method if anyone finds it useful
from itertools import chain
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from typing import Dict
def map_column_values(df:DataFrame, map_dict:Dict, column:str, new_column:str="")->DataFrame:
"""Handy method for mapping column values from one value to another
Args:
df (DataFrame): Dataframe to operate on
map_dict (Dict): Dictionary containing the values to map from and to
column (str): The column containing the values to be mapped
new_column (str, optional): The name of the column to store the mapped values in.
If not specified the values will be stored in the original column
Returns:
DataFrame
"""
spark_map = F.create_map([F.lit(x) for x in chain(*map_dict.items())])
return df.withColumn(new_column or column, spark_map[df[column]])
This can be used as follows
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.master("local[3]").getOrCreate()
df = spark.createDataFrame([Row(A=0), Row(A=1)])
df = map_column_values(df, map_dict={0:"foo", 1:"bar"}, column="A", new_column="B")
df.show()
#>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#+---+---+
#| A| B|
#+---+---+
#| 0|foo|
#| 1|bar|
#+---+---+

Python spark from DenseVector to columns [duplicate]

This question already has answers here:
How to access element of a VectorUDT column in a Spark DataFrame?
(5 answers)
Closed 7 months ago.
Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT.
An Example:
word | vector
assert | [435,323,324,212...]
And I want to get this:
word | v1 | v2 | v3 | v4 | v5 | v6 ......
assert | 435 | 5435| 698| 356|....
Question:
How can I split a column with vectors in several columns for each dimension using PySpark ?
Thanks in advance

Spark >= 3.0.0
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
(df
.withColumn("xs", vector_to_array("vector")))
.select(["word"] + [col("xs")[i] for i in range(3)]))
## +-------+-----+-----+-----+
## | word|xs[0]|xs[1]|xs[2]|
## +-------+-----+-----+-----+
## | assert| 1.0| 2.0| 3.0|
## |require| 0.0| 2.0| 0.0|
## +-------+-----+-----+-----+
Spark < 3.0.0
One possible approach is to convert to and from RDD:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
("assert", Vectors.dense([1, 2, 3])),
("require", Vectors.sparse(3, {1: 2}))
]).toDF(["word", "vector"])
def extract(row):
return (row.word, ) + tuple(row.vector.toArray().tolist())
df.rdd.map(extract).toDF(["word"]) # Vector values will be named _2, _3, ...
## +-------+---+---+---+
## | word| _2| _3| _4|
## +-------+---+---+---+
## | assert|1.0|2.0|3.0|
## |require|0.0|2.0|0.0|
## +-------+---+---+---+
An alternative solution would be to create an UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, DoubleType
def to_array(col):
def to_array_(v):
return v.toArray().tolist()
# Important: asNondeterministic requires Spark 2.3 or later
# It can be safely removed i.e.
# return udf(to_array_, ArrayType(DoubleType()))(col)
# but at the cost of decreased performance
return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)
(df
.withColumn("xs", to_array(col("vector")))
.select(["word"] + [col("xs")[i] for i in range(3)]))
## +-------+-----+-----+-----+
## | word|xs[0]|xs[1]|xs[2]|
## +-------+-----+-----+-----+
## | assert| 1.0| 2.0| 3.0|
## |require| 0.0| 2.0| 0.0|
## +-------+-----+-----+-----+
For Scala equivalent see Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)].

To split the rawPrediction or probability columns generated after training a PySpark ML model into Pandas columns, you can split like this:
your_pandas_df['probability'].apply(lambda x: pd.Series(x.toArray()))

It is much faster to use the i_th udf from how-to-access-element-of-a-vectorudt-column-in-a-spark-dataframe
The extract function given in the solution by zero323 above uses toList, which creates a Python list object, populates it with Python float objects, finds the desired element by traversing the list, which then needs to be converted back to java double; repeated for each row. Using the rdd is much slower than the to_array udf, which also calls toList, but both are much slower than a udf that lets SparkSQL handle most of the work.
Timing code comparing rdd extract and to_array udf proposed here to i_th udf from 3955864:
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
from pyspark.sql.functions import lit, udf, col
from pyspark.sql.types import ArrayType, DoubleType
import pyspark.sql.dataframe
from pyspark.sql.functions import pandas_udf, PandasUDFType
sc = SparkContext('local[4]', 'FlatTestTime')
spark = SparkSession(sc)
spark.conf.set("spark.sql.execution.arrow.enabled", True)
from pyspark.ml.linalg import Vectors
# copy the two rows in the test dataframe a bunch of times,
# make this small enough for testing, or go for "big data" and be prepared to wait
REPS = 20000
df = sc.parallelize([
("assert", Vectors.dense([1, 2, 3]), 1, Vectors.dense([4.1, 5.1])),
("require", Vectors.sparse(3, {1: 2}), 2, Vectors.dense([6.2, 7.2])),
] * REPS).toDF(["word", "vector", "more", "vorpal"])
def extract(row):
return (row.word, ) + tuple(row.vector.toArray().tolist(),) + (row.more,) + tuple(row.vorpal.toArray().tolist(),)
def test_extract():
return df.rdd.map(extract).toDF(['word', 'vector__0', 'vector__1', 'vector__2', 'more', 'vorpal__0', 'vorpal__1'])
def to_array(col):
def to_array_(v):
return v.toArray().tolist()
return udf(to_array_, ArrayType(DoubleType()))(col)
def test_to_array():
df_to_array = df.withColumn("xs", to_array(col("vector"))) \
.select(["word"] + [col("xs")[i] for i in range(3)] + ["more", "vorpal"]) \
.withColumn("xx", to_array(col("vorpal"))) \
.select(["word"] + ["xs[{}]".format(i) for i in range(3)] + ["more"] + [col("xx")[i] for i in range(2)])
return df_to_array
# pack up to_array into a tidy function
def flatten(df, vector, vlen):
fieldNames = df.schema.fieldNames()
if vector in fieldNames:
names = []
for fieldname in fieldNames:
if fieldname == vector:
names.extend([col(vector)[i] for i in range(vlen)])
else:
names.append(col(fieldname))
return df.withColumn(vector, to_array(col(vector)))\
.select(names)
else:
return df
def test_flatten():
dflat = flatten(df, "vector", 3)
dflat2 = flatten(dflat, "vorpal", 2)
return dflat2
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
select = ["word"]
select.extend([ith("vector", lit(i)) for i in range(3)])
select.append("more")
select.extend([ith("vorpal", lit(i)) for i in range(2)])
# %% timeit ...
def test_ith():
return df.select(select)
if __name__ == '__main__':
import timeit
# make sure these work as intended
test_ith().show(4)
test_flatten().show(4)
test_to_array().show(4)
test_extract().show(4)
print("i_th\t\t",
timeit.timeit("test_ith()",
setup="from __main__ import test_ith",
number=7)
)
print("flatten\t\t",
timeit.timeit("test_flatten()",
setup="from __main__ import test_flatten",
number=7)
)
print("to_array\t",
timeit.timeit("test_to_array()",
setup="from __main__ import test_to_array",
number=7)
)
print("extract\t\t",
timeit.timeit("test_extract()",
setup="from __main__ import test_extract",
number=7)
)
Results:
i_th 0.05964796099999958
flatten 0.4842299350000001
to_array 0.42978780299999997
extract 2.9254476840000017

def splitVecotr(df, new_features=['f1','f2']):
schema = df.schema
cols = df.columns
for col in new_features: # new_features should be the same length as vector column length
schema = schema.add(col,DoubleType(),True)
return spark.createDataFrame(df.rdd.map(lambda row: [row[i] for i in cols]+row.features.tolist()), schema)
The function turns the feature vector column into separate columns

Spark dataframe update column where other colum is like with PySpark

This creates my example dataframe:
df = sc.parallelize([('abc',),('def',)]).toDF() #(
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
df.show()
looking like this:
+---+---+
|one|two|
+---+---+
|abc| z|
|def| z|
+---+---+
now what I want to do is a series of SQL where like statements where column two is appended whether or not it matches
in "pseudo code" it looks like this:
for letter in ['a','b','c','d']:
df = df['two'].where(col('one').like("%{}%".format(letter))) += letter
finally resulting in a df looking like this:
+---+----+
|one| two|
+---+----+
|abc|zabc|
|def| zd|
+---+----+

If you are using a list of strings to subset your string column, you can best use broadcast variables. Let's start with a more realistic example where your string still contain spaces:
df = sc.parallelize([('a b c',),('d e f',)]).toDF()
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
Then we create a broadcast variable from a list of letters, and consequently define an udf that uses them to subset a list of strings; and finally concatenates them with the value in another column, returning one string:
letters = ['a','b','c','d']
letters_bd = sc.broadcast(letters)
def subs(col1, col2):
l_subset = [x for x in col1 if x in letters_bd.value]
return col2 + ' ' + ' '.join(l_subset)
subs_udf = udf(subs)
To apply the above, the string we are subsetting need to be converted to a list, so we use the function split() first and then apply our udf:
from pyspark.sql.functions import col, split
df.withColumn("three", split(col('one'), r'\W+')) \
.withColumn("three", subs_udf("three", "two")) \
.show()
+-----+---+-------+
| one|two| three|
+-----+---+-------+
|a b c| z|z a b c|
|d e f| z| z d|
+-----+---+-------+
Or without udf, using regexp_replace and concat if your letters can be comfortably fit into the regex expression.
from pyspark.sql.functions import regexp_replace, col, concat, lit
df.withColumn("three", concat(col('two'), lit(' '),
regexp_replace(col('one'), '[^abcd]', ' ')))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concat multiple columns with loop Pyspark - python

Related

Modify Different Pyspark Column on Exception in UDF

Pyspark: Count how many rows have the same value in two columns and drop the duplicates [duplicate]

map values in a dataframe from a dictionary using pyspark

Python spark from DenseVector to columns [duplicate]

Spark dataframe update column where other colum is like with PySpark

Categories

Resources