using Word2VecModel.transform() does not work in map function - python

I have built a Word2Vec model using Spark and save it as a model. Now, I want to use it in another code as offline model. I have loaded the model and used it to present vector of a word (e.g. Hello) and it works well. But, I need to call it for many words in an RDD using map.
When I call model.transform() in a map function, it throws this error:
"It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
the code:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pyspark.mllib.feature import Word2VecModel
sc = SparkContext('local[4]',appName='Word2Vec')
model=Word2VecModel.load(sc, "word2vecModel")
x= model.transform("Hello")
print(x[0]) # it works fine and returns [0.234, 0.800,....]
y=sc.parallelize([['Hello'],['test']])
y.map(lambda w: model.transform(w[0])).collect() #it throws the error
I will really appreciate your help.

It is an expected behavior. Like other MLlib models Python object is just a wrapper around Scala model and actual processing is delegated to its JVM counterpart. Since Py4J gateway is not accessible on workers (see How to use Java/Scala function from an action or a transformation?) you cannot call Java / Scala method from an action or transformation.
Typically MLlib models provide a helper method which can work directly on RDDs but it is not the case here. Word2VecModel provides getVectors method which returns a map from words to vector but unfortunately it is a JavaMap so it won't work inside transformation. You could try something like this:
from pyspark.mllib.linalg import DenseVector
vectors_ = model.getVectors() # py4j.java_collections.JavaMap
vectors = {k: DenseVector([x for x in vectors_.get(k)])
for k in vectors_.keys()}
to get Python dictionary but it will be extremely slow. Another option is to dump this object to disk in a form that can be consumed by Python but it requires some tinkering with Py4J and it is better to avoid this. Instead lets read model as a DataFrame:
lookup = sqlContext.read.parquet("path_to_word2vec_model/data").alias("lookup")
and we'll get a following structure:
lookup.printSchema()
## root
## |-- word: string (nullable = true)
## |-- vector: array (nullable = true)
## | |-- element: float (containsNull = true)
which can be used to map words to vectors for example through join:
from pyspark.sql.functions import col
words = sc.parallelize([('hello', ), ('test', )]).toDF(["word"]).alias("words")
words.join(lookup, col("words.word") == col("lookup.word"))
## +-----+-----+--------------------+
## | word| word| vector|
## +-----+-----+--------------------+
## |hello|hello|[-0.030862354, -0...|
## | test| test|[-0.13154022, 0.2...|
## +-----+-----+--------------------+
If data fits into driver / worker memory you can try to collect and map with broadcast:
lookup_bd = sc.broadcast(lookup.rdd.collectAsMap())
rdd = sc.parallelize([['Hello'],['test']])
rdd.map(lambda ws: [lookup_bd.value.get(w) for w in ws])

Related

printing the uploading progress using `xarray.Dataset.to_zarr` function

I'm trying to upload an xarray dataset to GCP using the function ds.to_zarr(store=store), and it works perfect. However, I would like to show the progress of big datasets. Is there any option to chunk my dataset in a way I can use tqdm or someting like that to log the uploading progress?
Here is the code that I currently have:
import os
import xarray as xr
import numpy as np
import gcsfs
from dask.diagnostics import ProgressBar
if __name__ == '__main__':
# for testing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "service-account.json"
# create xarray
data_arr = np.random.rand(5000, 100, 100)
data_xarr = xr.DataArray(data_arr,
dims=["x", "y", "z"])
# define store
gcp_blob_uri = "gs://gprlib/test.zarr"
gcs = gcsfs.GCSFileSystem()
store = gcs.get_mapper(gcp_blob_uri)
# delayed to_zarr computation -> seems that it does not work
write_job = data_xarr\
.to_dataset(name="data")\
.to_zarr(store, mode="w", compute=False)
print(write_job)
xarray.Dataset.to_zarr has an optional argument compute which is True by default:
compute (bool, optional) – If True write array data immediately, otherwise return a dask.delayed.Delayed object that can be computed to write array data later. Metadata is always updated eagerly.
Using this, you can track the progress using dask's own dask.distributed.progress bar:
write_job = ds.to_zarr(store, compute=False)
write_job = write_job.persist()
# this will return an interactive (non-blocking) widget if in a notebook
# environment. To force the widget to block, provide notebook=False.
distributed.progress(write_job, notebook=False)
[############## ] | 35% Completed | 4.5s
Note that for this to work, the dataset must consist of chunked dask arrays. If the data is in memory, you could use a single chunk per array with ds.chunk().to_zarr.

pyspark ML LabeledPoint not working with LinearRegression

I'm studying Spark 3.0.1 with pyspark, and have setup some data for simple OLS regression using
data = results.select('OrderMonthYear', 'SaleAmount').rdd.map(lambda row: LabeledPoint(row[1], [row[0]])).toDF()
The OrderMonthYear is my feature column (int), and SaleAmount is the response (float). The LabeledPoint method was imported from pyspark.mllib.regression. I then try to fit the regression model with
from pyspark.ml.regression import LinearRegression
lr = LinearRegression()
modelA = lr.fit(data, {lr.regParam:0.0})
to get this exception
IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
This is clearly not very helpful, as the required and passed features seem to be the same structs. I've searched online, and only found answers to this problem for java, or for someone building the struct themselves. The exception was thrown from a util function that was just throwing a java exception (#Hide where the exception came from that shows a non-Pythonic JVM exception message.), so I can't debug further.
MLlib and RDD-based MLlib functions are deprecated. I suggest using vector assembler of ML:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
data = spark.createDataFrame([[0,1],[1,2],[2,3]]).toDF('OrderMonthYear', 'SaleAmount')
va = VectorAssembler(inputCols=['SaleAmount'], outputCol='features')
data2 = va.transform(data)
lr = LinearRegression(labelCol='OrderMonthYear')
model = lr.fit(data2)
For anyone else following the same LI Learning course, based on some modifications to the accepted answer above to align more with what I was seeing in the course, here's what Cmd 4 cell should look like:
# convenience for specifying schema
from pyspark.ml.feature import VectorAssembler
data = VectorAssembler(inputCols=['OrderMonthYear'], outputCol='features').transform(results.select("OrderMonthYear", "SaleAmount")).drop('OrderMonthYear').withColumnRenamed('SaleAmount', 'label')
display(data)
Alternatively, you can use the following which also works:
from pyspark.ml.linalg import Vectors
data = results.rdd.map(lambda r: (Vectors.dense(r[0]), r[1])).toDF(["features","label"])
display(data)
Then you should be good to go. Note that you'll want to make the same changes to Cmd 4 in notebooks 4.4 and 4.5 as well. Hope this helps!

How to read files in parallel in DataBricks?

Could someone tell me how to read files in parallel? I'm trying something like this:
def processFile(path):
df = spark.read.json(path)
return df.count()
paths = ["...", "..."]
distPaths = sc.parallelize(paths)
counts = distPaths.map(processFile).collect()
print(counts)
It fails with the following error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Is there any other way to optimize this?
In your particular case, you can just pass the whole paths array to DataFrameReader:
df = spark.read.json(paths)
...and reading its file elements will be parallelized by Spark.

How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session

Q: How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session
I made the following code change to increase spark.sql.pivotMaxValues. It sadly had no effect in the resulting error after restarting jupyter and running the code again.
from pyspark import SparkConf, SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
import numpy as np
try:
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker') # original
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", "99999")
conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", 99999)
sc = SparkContext(conf=conf)
except:
print("Variables sc and conf are now defined. Everything is OK and ready to run.")
<... (other code) ...>
df = sess.read.csv(in_filename, header=False, mode="DROPMALFORMED", schema=csv_schema)
ct = df.crosstab('username', 'itemname')
Spark error message that was thrown on my crosstab line of code:
IllegalArgumentException: "requirement failed: The number of distinct values for itemname, can't exceed 1e4. Currently 16467"
I expect I'm not actually setting the config variable that I was trying to set, so what is a way to get that value actually set, programmatically if possible? THanks.
References:
Finally, you may be interested to know that there is a maximum number
of values for the pivot column if none are specified. This is mainly
to catch mistakes and avoid OOM situations. The config key is
spark.sql.pivotMaxValues and its default is 10,000.
Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
I would prefer to change the config variable upwards, since I have written the crosstab code already which works great on smaller datasets. If it turns out there truly is no way to change this config variable then my backup plans are, in order:
relational right outer join to implement my own Spark crosstab with higher capacity than was provided by databricks
scipy dense vectors with handmade unique combinations calculation code using dictionaries
kernel.json
This configuration file should be distributed together with jupyter
~/.ipython/kernels/pyspark/kernel.json
It contains SPARK configuration, including variable PYSPARK_SUBMIT_ARGS - list of arguments that will be used with spark-submit script.
You can try to add --conf spark.sql.pivotMaxValues=99999 to this variable in mentioned script.
PS
There are also cases where people are trying to override this variable programmatically. You can give it a try too...

Combining Spark Streaming + MLlib

I've tried to use a Random Forest model in order to predict a stream of examples, but it appears that I cannot use that model to classify the examples.
Here is the code used in pyspark:
sc = SparkContext(appName="App")
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150)
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(hostname, int(port))
parsedLines = lines.map(parse)
parsedLines.pprint()
predictions = parsedLines.map(lambda event: model.predict(event.features))
and the error returned while compiling it in the cluster:
Error : "It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
is there a way to use a modèle generated from a static data to predict a streaming examples ?
Thanks guys i really appreciate it !!!!
Yes, you can use model generated from static data. The problem you experience is not related to streaming at all. You simply cannot use JVM based model inside action or transformations (see How to use Java/Scala function from an action or a transformation? for an explanation why). Instead you should apply predict method to a complete RDD for example using transform on DStream:
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from operator import attrgetter
sc = SparkContext("local[2]", "foo")
ssc = StreamingContext(sc, 1)
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
trainingData, testData = data.randomSplit([0.7, 0.3])
model = RandomForest.trainClassifier(
trainingData, numClasses=2, nmTrees=3
)
(ssc
.queueStream([testData])
# Extract features
.map(attrgetter("features"))
# Predict
.transform(lambda _, rdd: model.predict(rdd))
.pprint())
ssc.start()
ssc.awaitTerminationOrTimeout(10)

Categories

Resources