I have a PySpark dataframe (say df) which has two columns ( Name and Score). Following is an example of the dataframe:
+------+-----+
| Name|Score|
+------+-----+
| name1|11.23|
| name2|14.57|
| name3| 2.21|
| name4| 8.76|
| name5|18.71|
+------+-----+
I have a numpy array (say bin_array) which has values close to the numerical values that are there in the column titled Score of the PySpark dataframe.
Following is the aforementioned numpy array:
bin_array = np.array([0, 5, 10, 15, 20])
I want to compare value from each row of the column Score with values in bin_array and store the closest value (gotten from bin_array) in a separate column in the PySpark dataframe.
Below is how I would like my new dataframe (say df_new) to look.
+------+-----+------------+
| Name|Score| Closest_bin|
+------+-----+------------+
| name1|11.23| 10.0 |
| name2|14.57| 15.0 |
| name3| 2.21| 0.0 |
| name4| 8.76| 10.0 |
| name5|18.71| 20.0 |
+------+-----+------------+
I have the below mentioned function which gives me the closest values from bin_array. The function works fine when I test it with individual numbers.
def find_nearest(array, value):
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return float(array[idx])
In my actual work, I will have millions of rows in the datafrmae. What is the most efficient way to create df_new?
Following are the steps that I tried to use to create user-defined function (udf) and the new data frame (df_new).
closest_bin_udf = F.udf( lambda x: find_nearest(array, x) )
df_new = df.withColumn( 'Closest_bin' , closest_bin_udf(df.Score) )
But, I got errors when I tried df_new.show(). A portion of the error is shown below.
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-11-685c9b7e25d9> in <module>()
----> 1 df_new.show()
/usr/lib/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
376 """
377 if isinstance(truncate, bool) and truncate:
--> 378 print(self._jdf.showString(n, 20, vertical))
379 else:
380 print(self._jdf.showString(n, int(truncate), vertical))
You can use the below mentioned steps to create the aforementioned dataframe:
from pyspark.sql import *
import pyspark.sql.functions as F
import numpy as np
Stats = Row("Name", "Score")
stat1 = Stats('name1', 11.23)
stat2 = Stats('name2', 14.57)
stat3 = Stats('name3', 2.21)
stat4 = Stats('name4', 8.76)
stat5 = Stats('name5', 18.71)
stat_lst = [stat1 , stat2, stat3, stat4, stat5]
df = spark.createDataFrame(stat_lst)
df.show()
You can use a bucketizer from pyspark.mllib
from pyspark.sql import *
import pyspark.sql.functions as F
import numpy as np
Stats = Row("Name", "Score")
stat_lst = [Stats('name1', 11.23) , Stats('name2', 14.57), Stats('name3', 2.21), Stats('name4', 8.76), Stats('name5', 18.71)]
df = spark.createDataFrame(stat_lst)
from pyspark.ml.feature import Bucketizer
"""
Bucketizer creates bins like 0-5:0, 5-10:1, 10-15:2, 15-20:3
As I see, your expected output wants the closest numbered bin, so you might
have to change your buckets or the variable `t` below accordingly.
"""
bucket_list = [0, 5, 10, 15, 20]
bucketizer = Bucketizer(splits=bucket_list, inputCol="Score", outputCol="buckets")
df_buck = bucketizer.setHandleInvalid("keep").transform(df)
df_buck.show()
I am still working on getting the closest bin, I'll update my answer.
If you want your array values for each bucket you can use udf to create a new column with bucket names
from pyspark.sql.functions import udf
from pyspark.sql.types import *
t = dict(zip(range(len(bucket_list)), bucket_list))
udf_foo = udf(lambda x: t[x], IntegerType())
df_buck = df_buck.withColumn("score_bucket", udf_foo("buckets"))
Output
>>> df_buck.show()
+-----+-----+-------+------------+
| Name|Score|buckets|score_bucket|
+-----+-----+-------+------------+
|name1|11.23| 2.0| 10|
|name2|14.57| 2.0| 10|
|name3| 2.21| 0.0| 0|
|name4| 8.76| 1.0| 5|
|name5|18.71| 3.0| 15|
+-----+-----+-------+------------+
EDIT: Correcting the score buckets:
# Not dynamic, but please try to figure out this business logic according to your use-case
df_buck = df_buck.withColumn("correct_buckets", F.when(df_buck.Score-df_buck.score_bucket > 5/2, F.col("score_bucket") + 5).otherwise(F.col("score_bucket"))).drop("buckets", "score_bucket")
Now output is as expected:
+-----+-----+---------------+
| Name|Score|correct_buckets|
+-----+-----+---------------+
|name1|11.23| 10|
|name2|14.57| 15|
|name3| 2.21| 0|
|name4| 8.76| 10|
|name5|18.71| 20|
+-----+-----+---------------+
You can also pandas_udf although I'd suggest you test out the speed and memory consumption as you scale up
from pyspark.sql.functions import pandas_udf, PandasUDFType
import numpy as np
import pandas as pd
df = spark.createDataFrame(zip(["name_"+str(i) for i in range(1,6)], [11.23, 14.57, 2.21, 8.76, 18.71]), ["Name", "Score"])
bin_array = np.array([0, 5, 10, 15, 20])
#pandas_udf('double', PandasUDFType.SCALAR)
def find_nearest(value):
res = bin_array[np.newaxis, :] - value.values[:, np.newaxis]
ret_vals = [bin_array[np.argmin(np.abs(i))] for i in res]
return pd.Series(ret_vals)
df.withColumn('v2', find_nearest(df.Score)).show()
Output
+------+-----+----+
| Name|Score| v2|
+------+-----+----+
|name_1|11.23|10.0|
|name_2|14.57|15.0|
|name_3| 2.21| 0.0|
|name_4| 8.76|10.0|
|name_5|18.71|20.0|
+------+-----+----+
Related
I have a dictionary:
dict = {10: 1, 50: 2, 200: 3, 500: 4}
And a Dask DataFrame:
+---+---+
| a| b|
+---+---+
| 1| 24|
| 1| 49|
| 2|125|
| 3|400|
+---+---+
I want to groupBy a and get the minimum b value. After that, I want to check which dict key is closest to b and create a new column with the dict value.
As a example, when b=24, the closest key is 10. So I want to assign the value 1.
This is the result I am expecting:
+---+---+-------+
| a| b|closest|
+---+---+-------+
| 1| 24| 1|
| 1| 49| 2|
| 2|125| 3|
| 3|400| 4|
+---+---+-------+
I have found something similar with PySpark. I have not been able to make it run, but it apparently run for other people. Sharing it anyway for reference.
df = spark.createDataFrame(
[
(1, 24),
(1, 49),
(2, 125),
(3, 400)
],
["a", "b"]
)
dict = {10:1, 50:2, 200: 3, 500: 4}
def func(value, dict):
closest_key = (
value if value in dict else builtins.min(
dict.keys(), key=lambda k: builtins.abs(k - value)
)
)
score = dict.get(closest_key)
return score
df = (
df.groupby('a')
.agg(
min('b')
)
).withColumn('closest', func('b', dict))
From what I understand, I think on the spark version the calculation was done per row and I have not been able to replicate that.
Instead of thinking of a row-rise operation, you can think of it as a partition-wise operation. If my interpretation is off, you can still use this sample I wrote for the most part with a few tweaks.
I will show a solution with Fugue that lets you just define your logic in Pandas, and then bring it to Dask. This will return a Dask DataFrame.
First some setup, note that df is a Pandas DataFrame. This is meant to represent a smaller sample you can test on:
import pandas as pd
import dask.dataframe as dd
import numpy as np
_dict = {10: 1, 50: 2, 200: 3, 500: 4}
df = pd.DataFrame({"a": [1,1,2,3], "b":[24,49,125,400]})
ddf = dd.from_pandas(df, npartitions=2)
and then we define the logic. This is written to handle one partition so everything in column a will already be the same value.
def logic(df: pd.DataFrame) -> pd.DataFrame:
# handles the logic for 1 group. all values in a are the same
min_b = df['b'].min()
keys = np.array(list(_dict.keys()))
# closest taken from https://stackoverflow.com/a/10465997/11163214
closest = keys[np.abs(keys - min_b).argmin()]
closest_val = _dict[closest]
df = df.assign(closest=closest_val)
return df
We can test this on Pandas:
logic(df.loc[df['a'] == 1])
and we'll get:
a b closest
0 1 24 1
1 1 49 1
So then we can just bring it to Dask with Fugue. We just need to call the transform function:
from fugue import transform
ddf = transform(ddf,
logic,
schema="*,closest:int",
partition={"by":"a"},
engine="dask")
ddf.compute()
This can take in either Pandas or Dask DataFrames and will output the Dask DataFrame because we specified the "dask" engine. There is also a "spark" engine if you want a Spark DataFrame.
Schema is a requirement for distributed computing so we specify the output schema here. We also partition by column a.
So here it is another approach for you friend, this will return a numpy array, but hey it will be faster than spark, and you can easily reindex it.
import numpy as np
a = pydf.toNumpy()
a = a[:,1] # Grabs your b column
np.where([a <=10,a <=50,a<=200,a<=500],[1,2,3,4],a) # Check the closest values and fill them with what you want
I have created a dataframe as shown
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
But the data remains the same without a change.
I dont want convert the dataframe to pandas or use collect as it is memory intensive.
And if possible try to give me an optimal solution.I am using pyspark
That's because you forgot about the type
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
Though something like this would be more efficient:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+
I can create the true result in python 3 with a little change in definition of function df_amp_conversion. You didn't return the value of df_modelamp! This code works for me properly:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+
This question already has answers here:
How to access element of a VectorUDT column in a Spark DataFrame?
(5 answers)
Closed 7 months ago.
Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT.
An Example:
word | vector
assert | [435,323,324,212...]
And I want to get this:
word | v1 | v2 | v3 | v4 | v5 | v6 ......
assert | 435 | 5435| 698| 356|....
Question:
How can I split a column with vectors in several columns for each dimension using PySpark ?
Thanks in advance
Spark >= 3.0.0
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
(df
.withColumn("xs", vector_to_array("vector")))
.select(["word"] + [col("xs")[i] for i in range(3)]))
## +-------+-----+-----+-----+
## | word|xs[0]|xs[1]|xs[2]|
## +-------+-----+-----+-----+
## | assert| 1.0| 2.0| 3.0|
## |require| 0.0| 2.0| 0.0|
## +-------+-----+-----+-----+
Spark < 3.0.0
One possible approach is to convert to and from RDD:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
("assert", Vectors.dense([1, 2, 3])),
("require", Vectors.sparse(3, {1: 2}))
]).toDF(["word", "vector"])
def extract(row):
return (row.word, ) + tuple(row.vector.toArray().tolist())
df.rdd.map(extract).toDF(["word"]) # Vector values will be named _2, _3, ...
## +-------+---+---+---+
## | word| _2| _3| _4|
## +-------+---+---+---+
## | assert|1.0|2.0|3.0|
## |require|0.0|2.0|0.0|
## +-------+---+---+---+
An alternative solution would be to create an UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, DoubleType
def to_array(col):
def to_array_(v):
return v.toArray().tolist()
# Important: asNondeterministic requires Spark 2.3 or later
# It can be safely removed i.e.
# return udf(to_array_, ArrayType(DoubleType()))(col)
# but at the cost of decreased performance
return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)
(df
.withColumn("xs", to_array(col("vector")))
.select(["word"] + [col("xs")[i] for i in range(3)]))
## +-------+-----+-----+-----+
## | word|xs[0]|xs[1]|xs[2]|
## +-------+-----+-----+-----+
## | assert| 1.0| 2.0| 3.0|
## |require| 0.0| 2.0| 0.0|
## +-------+-----+-----+-----+
For Scala equivalent see Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)].
To split the rawPrediction or probability columns generated after training a PySpark ML model into Pandas columns, you can split like this:
your_pandas_df['probability'].apply(lambda x: pd.Series(x.toArray()))
It is much faster to use the i_th udf from how-to-access-element-of-a-vectorudt-column-in-a-spark-dataframe
The extract function given in the solution by zero323 above uses toList, which creates a Python list object, populates it with Python float objects, finds the desired element by traversing the list, which then needs to be converted back to java double; repeated for each row. Using the rdd is much slower than the to_array udf, which also calls toList, but both are much slower than a udf that lets SparkSQL handle most of the work.
Timing code comparing rdd extract and to_array udf proposed here to i_th udf from 3955864:
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
from pyspark.sql.functions import lit, udf, col
from pyspark.sql.types import ArrayType, DoubleType
import pyspark.sql.dataframe
from pyspark.sql.functions import pandas_udf, PandasUDFType
sc = SparkContext('local[4]', 'FlatTestTime')
spark = SparkSession(sc)
spark.conf.set("spark.sql.execution.arrow.enabled", True)
from pyspark.ml.linalg import Vectors
# copy the two rows in the test dataframe a bunch of times,
# make this small enough for testing, or go for "big data" and be prepared to wait
REPS = 20000
df = sc.parallelize([
("assert", Vectors.dense([1, 2, 3]), 1, Vectors.dense([4.1, 5.1])),
("require", Vectors.sparse(3, {1: 2}), 2, Vectors.dense([6.2, 7.2])),
] * REPS).toDF(["word", "vector", "more", "vorpal"])
def extract(row):
return (row.word, ) + tuple(row.vector.toArray().tolist(),) + (row.more,) + tuple(row.vorpal.toArray().tolist(),)
def test_extract():
return df.rdd.map(extract).toDF(['word', 'vector__0', 'vector__1', 'vector__2', 'more', 'vorpal__0', 'vorpal__1'])
def to_array(col):
def to_array_(v):
return v.toArray().tolist()
return udf(to_array_, ArrayType(DoubleType()))(col)
def test_to_array():
df_to_array = df.withColumn("xs", to_array(col("vector"))) \
.select(["word"] + [col("xs")[i] for i in range(3)] + ["more", "vorpal"]) \
.withColumn("xx", to_array(col("vorpal"))) \
.select(["word"] + ["xs[{}]".format(i) for i in range(3)] + ["more"] + [col("xx")[i] for i in range(2)])
return df_to_array
# pack up to_array into a tidy function
def flatten(df, vector, vlen):
fieldNames = df.schema.fieldNames()
if vector in fieldNames:
names = []
for fieldname in fieldNames:
if fieldname == vector:
names.extend([col(vector)[i] for i in range(vlen)])
else:
names.append(col(fieldname))
return df.withColumn(vector, to_array(col(vector)))\
.select(names)
else:
return df
def test_flatten():
dflat = flatten(df, "vector", 3)
dflat2 = flatten(dflat, "vorpal", 2)
return dflat2
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
select = ["word"]
select.extend([ith("vector", lit(i)) for i in range(3)])
select.append("more")
select.extend([ith("vorpal", lit(i)) for i in range(2)])
# %% timeit ...
def test_ith():
return df.select(select)
if __name__ == '__main__':
import timeit
# make sure these work as intended
test_ith().show(4)
test_flatten().show(4)
test_to_array().show(4)
test_extract().show(4)
print("i_th\t\t",
timeit.timeit("test_ith()",
setup="from __main__ import test_ith",
number=7)
)
print("flatten\t\t",
timeit.timeit("test_flatten()",
setup="from __main__ import test_flatten",
number=7)
)
print("to_array\t",
timeit.timeit("test_to_array()",
setup="from __main__ import test_to_array",
number=7)
)
print("extract\t\t",
timeit.timeit("test_extract()",
setup="from __main__ import test_extract",
number=7)
)
Results:
i_th 0.05964796099999958
flatten 0.4842299350000001
to_array 0.42978780299999997
extract 2.9254476840000017
def splitVecotr(df, new_features=['f1','f2']):
schema = df.schema
cols = df.columns
for col in new_features: # new_features should be the same length as vector column length
schema = schema.add(col,DoubleType(),True)
return spark.createDataFrame(df.rdd.map(lambda row: [row[i] for i in cols]+row.features.tolist()), schema)
The function turns the feature vector column into separate columns
i want to compare values of one column with another column having range of reference values.
i have tried using the following code:
from pyspark.sql.functions import udf, size
from pyspark.sql.types import *
df1 = sc.parallelize([([1], [1, 2, 3]), ([2], [4, 5, 6,7])]).toDF(["value", "Reference_value"])
intersect = lambda type: (udf(
lambda x, y: (
list(set(x) & set(y)) if x is not None and y is not None else None),
ArrayType(type)))
integer_intersect = intersect(IntegerType())
# df1.select(
# integer_intersect("value", "Reference_value"),
# size(integer_intersect("value", "Reference_value"))).show()
df1=df1.where(size(integer_intersect("value", "Reference_value")) > 0)
df1.show()
The above code works if we create dataframe like following:
because the value and refernce_value columns are array_type with long_type
but if i am reading dataframe with csv then i am not able to convert to array type. here df1 is read from CSV
df1 is as follows df1=
category value Reference value
count 1 1
n_timer n20 n40,n20
frames 54 56
timer n8 n3,n6,n7
pdf FALSE TRUE
zip FALSE FALSE
I want to compare "value" column with "Reference_value" column and to derive two new dataframes where one data frame is to filter rows if value column is not in the set of Reference value.
Output df2=
category value Reference value
count 1 1
n_timer n20 n40,n20
zip FALSE FALSE
output df3=
category value Reference value
frames 54 56
timer n8 n3,n6,n7
pdf FALSE TRUE
is there any easier way like array_contains. I tried Array_contains either but not working
from pyspark.sql.functions import array_contains
df.where(array_contains("Reference_value", df1["vale"]))
#One can copy paste the below code for direct input and outputs
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.functions import udf, size
from pyspark.sql.types import *
from pyspark.sql.functions import split
sc = SparkContext.getOrCreate()
sqlContext = SQLContext.getOrCreate(sc)
df1 = sc.parallelize([("count","1","1"), ("n_timer","n20","n40,n20"), ("frames","54","56"),("timer","n8","n3,n6,n7"),("pdf","FALSE","TRUE"),("zip","FALSE","FALSE")]).toDF(["category", "value","Reference_value"])
print(df1.show())
df1=df1.withColumn("Reference_value", split("Reference_value", ",\s*").cast("array<string>"))
df1=df1.withColumn("value", split("value", ",\s*").cast("array<string>"))
intersect = lambda type: (udf(
lambda x, y: (
list(set(x) & set(y)) if x is not None and y is not None else None),
ArrayType(type)))
string_intersect = intersect(StringType())
df2=df1.where(size(string_intersect("value", "Reference_value")) > 0)
df3=df1.where(size(string_intersect("value", "Reference_value")) <= 0)
print(df2.show())
print(df3.show())
input df1=
+--------+-----+---------------+
|category|value|Reference_value|
+--------+-----+---------------+
| count| 1| 1|
| n_timer| n20| n40,n20|
| frames| 54| 56|
| timer| n8| n3,n6,n7|
| pdf|FALSE| TRUE|
| zip|FALSE| FALSE|
+--------+-----+---------------+
df2=
+--------+-------+---------------+
|category| value|Reference_value|
+--------+-------+---------------+
| count| [1]| [1]|
| n_timer| [n20]| [n40, n20]|
| zip|[FALSE]| [FALSE]|
+--------+-------+---------------+
df3=
+--------+-------+---------------+
|category| value|Reference_value|
+--------+-------+---------------+
| frames| [54]| [56]|
| timer| [n8]| [n3, n6, n7]|
| pdf|[FALSE]| [TRUE]|
+--------+-------+---------------+
Based on previous questions: 1, 2. Suppose I have the following dataframe:
df = spark.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)],
("x1", "x2", "x3"))
And I want to add new column x4 but I have value in a list of Python instead to add to the new column e.g. x4_ls = [35.0, 32.0]. Is there a best way to add new column to the Spark dataframe? (note that I use Spark 2.1)
Output should be something like:
## +---+---+-----+----+
## | x1| x2| x3| x4|
## +---+---+-----+----+
## | 1| a| 23.0|35.0|
## | 3| B|-23.0|32.0|
## +---+---+-----+----+
I can also transform my list to dataframe df_x4 = spark.createDataFrame([Row(**{'x4': x}) for x in x4_ls]) (but I don't how to concatenate dataframe together)
Thanks to Gaurav Dhama for a great answer! I made changes a little with his solution. Here is my solution which join two dataframe together on added new column row_num.
from pyspark.sql import Row
def flatten_row(r):
r_ = r.features.asDict()
r_.update({'row_num': r.row_num})
return Row(**r_)
def add_row_num(df):
df_row_num = df.rdd.zipWithIndex().toDF(['features', 'row_num'])
df_out = df_row_num.rdd.map(lambda x : flatten_row(x)).toDF()
return df_out
df = add_row_num(df)
df_x4 = add_row_num(df_x4)
df_concat = df.join(df_x4, on='row_num').drop('row_num')
We can concatenate on the basis of rownumbers as follows. Suppose we have two dataframes df and df_x4 :
def addrownum(df):
dff = df.rdd.zipWithIndex().toDF(['features','rownum'])
odf = dff.map(lambda x : tuple(x.features)+tuple([x.rownum])).toDF(df.columns+['rownum'])
return odf
df1 = addrownum(df)
df2 = addrownum(df_x4)
outputdf = df1.join(df2,df1.rownum==df2.rownum).drop(df1.rownum).drop(df2.rownum)
## outputdf
## +---+---+-----+----+
## | x1| x2| x3| x4|
## +---+---+-----+----+
## | 1| a| 23.0|35.0|
## | 3| B|-23.0|32.0|
## +---+---+-----+----+
outputdf is your required output dataframe