Getting external R libraries to work from rpy2 in pyspark - python

From pyspark worker nodes when I try to import an R library like zoo, it errors out.
Heres an example:
rdd = sc.parallelize([1,2,3,4])
def mapper(x):
from rpy2.robjects.packages import importr
try:
importr('zoo')
return 'Success!'
except:
raise ValueError('Import Failed')
rdd.map(mapper).collect() #errors out
But when I run it using pyspark --master local, it works.
Any Idea why this is happening?

Related

How to productionise Python script for AWS Glue?

I'm following this tutorial video: https://www.youtube.com/watch?v=EzQArFt_On4
The example code provided in this video:
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
glueJob = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueJob.init(args['JOB_NAME'], args)
sparkSession = glueContext.spark_session
#ETL process code
def etl_process():
...
return xxx
glueJob.commit()
I'm wondering if the part before the function etl_process can be used in production directly? Or do I need to wrap that part into a separate function so that I can add unit test for it?
something like this:
def define_spark_session():
sc = SparkContext.getOrCreate()
glue_context = GlueContext(sc)
glue_job = Job(glue_context)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glue_job.init(args['JOB_NAME'], args)
spark_session = glue_context.spark_session
return spark_session
But it seems doesn't need a parameter...
Or should I just write unit test for etl_process function?
Or maybe I can create a separate python file with etl_process function and import it in this script?
I'm new to this, a bit confused, might someone be able to help please? Thanks.
As for now it is very difficult to test AWS Glue itself locally, although there are some solutions like downloading a docker image AWS provides you and run it from there (you'll probably need some tweaks but should be all right).
I guess the easies way is to transform the DynamicFrame you get from gluelibs into a Spark DataFrame (.toDf()) and then do thinks in pure Spark (PySpark) so you'll be able to test the result.
dataFrame = dynamic_frame.toDf()
def transormation(dataframe):
return dataframe.withColumn(...)
def test_transformation()
result = transformation(input_test_dataframe)
assert ...

Importing rugarch R library into python

I need to import into python the library rugarch of R for volatility forecast.
This is just an example, which could be done entirely in python since it is univariate, however I have to apply later on a multivariate method for which I have not a python solution.
So I have done the following:
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
from rpy2.robjects import numpy2ri
the error happens when:
rugarch = importr('rugarch')
RRuntimeError: Error in loadNamespace(name) : there is no package called 'rugarch'
I also tried to make it pointing the right folder as:
import rpy2.rinterface
utils = importr("utils")
base = importr('base')
print(base._libPaths())
got: C:/Users/simeone/Anaconda3/envs/Luigi/Lib/R/library
rugarch = importr('rugarch', lib_loc = C:/Users/simeone/Anaconda3/envs/Luigi/Lib/R/library")
still the same error: RRuntimeError: Error in loadNamespace(name) : there is no package called 'rugarch'.
In addition I tried forcing the installation of rugarch as follows:
utils.install_packages('rugarch')
but I get this error: RRuntimeError: Error in contrib.url(repos, "source") :
trying to use CRAN without setting a mirror.
Can anybody help? I am stuck
I decided to post an answer on this which works and can be of help for other people.
The last command was working jbut the CRAN mirror was missing.
SO the final code is:
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
from rpy2.robjects import numpy2ri
utils = importr("utils")
utils.chooseCRANmirror(ind=1) # this was missing
utils.install_packages('rugarch')
rugarch = importr('rugarch')

Importing any function from an R package into python

While using the rpy2 library of Python to work with R. I get the following error message while trying to import a function of the bnlearn package:
# Using R inside python
import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects.packages import importr
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
# Install packages
packnames = ('visNetwork', 'bnlearn')
utils.install_packages(StrVector(packnames))
# Load packages
visNetwork = importr('visNetwork')
bnlearn = importr('bnlearn')
tabu = bnlearn.tabu
fit = bn.learn.bn.fit
With the error:
AttributeError: module 'bnlearn' has no attribute 'bn'
While checking the bnlearn documentation one finds out that bn is a class structure. So one should check out all the attributes of the object in question, that is, running:
bnlearn.__dict__['_rpy2r']
After that you should get a similar output like the next one, where you find how you would import each attribute of bnlearn:
...
...
'bn_boot': 'bn.boot',
'bn_cv': 'bn.cv',
'bn_cv_algorithm': 'bn.cv.algorithm',
'bn_cv_structure': 'bn.cv.structure',
'bn_fit': 'bn.fit',
'bn_fit_backend': 'bn.fit.backend',
'bn_fit_backend_continuous': 'bn.fit.backend.continuous',
'bn_fit_backend_discrete': 'bn.fit.backend.discrete',
'bn_fit_backend_mixedcg': 'bn.fit.backend.mixedcg',
'bn_fit_barchart': 'bn.fit.barchart',
'bn_fit_dotplot': 'bn.fit.dotplot',
...
...
Then, running the following will solve the issue:
bn_fit = bnlearn.bn_fit
Now, you could, for example, run a bayesian Network:
structure = tabu(datos, score = "loglik-g")
bn_mod = bn_fit(structure, data = datos, method = "mle")
In general, this approach solves the issue of importing any function from an R package into Python through the rpy2 package.

mleap AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'

I am having problems executing the example code from the mleap repository. I wish to run the code in a script instead of a jupyter notebook (which is how the example is run). My script is as follows:
##################################################################################
# start a local spark session
# https://spark.apache.org/docs/0.9.0/python-programming-guide.html
##################################################################################
from pyspark import SparkContext, SparkConf
conf = SparkConf()
#set app name
conf.set("spark.app.name", "train classifier")
#Run Spark locally with as many worker threads as logical cores on your machine (cores X threads).
conf.set("spark.master", "local[*]")
#number of cores to use for the driver process (only in cluster mode)
conf.set("spark.driver.cores", "1")
#Limit of total size of serialized results of all partitions for each Spark action (e.g. collect)
conf.set("spark.driver.maxResultSize", "1g")
#Amount of memory to use for the driver process
conf.set("spark.driver.memory", "1g")
#Amount of memory to use per executor process (e.g. 2g, 8g).
conf.set("spark.executor.memory", "2g")
#pass configuration to the spark context object along with code dependencies
sc = SparkContext(conf=conf)
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)
##################################################################################
import mleap.pyspark
# # Imports MLeap serialization functionality for PySpark
from mleap.pyspark.spark_support import SimpleSparkSerializer
# Import standard PySpark Transformers and packages
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
# Create a test data frame
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()
# Build a very simple pipeline using two transformers
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
feature_assembler = VectorAssembler(
inputCols=[string_indexer.getOutputCol()], outputCol="features")
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
featurePipeline.fit(df2)
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
On executing spark-submit script.py I get the following error:
17/09/18 13:26:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/opringle/Documents/Repos/finn/Magellan/src/no_spark_predict.py", line 58, in <module>
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
Any help would be much appreciated! I have installed mleap from pypy.
See Here
It seems MLeap isn't ready for Spark 2.3 yet. If you happen to be running Spark 2.3, try downgrading to 2.2 and retry. Hopefully, that helps!
I have solved the issue by attaching the following jar file when running:
spark-submit --packages ml.combust.mleap:mleap-spark_2.11:0.8.1 script.py
It seems you didn't follow the steps correctly, here http://mleap-docs.combust.ml/getting-started/py-spark.html it says that
Note: the import of mleap.pyspark needs to happen before any other PySpark libraries are imported.
Hence try importing your SparkContext after mleap

Access to Spark from Flask app

I wrote a simple Flask app to pass some data to Spark. The script works in IPython Notebook, but not when I try to run it in it's own server. I don't think that the Spark context is running within the script. How do I get Spark working in the following example?
from flask import Flask, request
from pyspark import SparkConf, SparkContext
app = Flask(__name__)
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("SparkContext1")
conf.set("spark.executor.memory", "1g")
sc = SparkContext(conf=conf)
#app.route('/accessFunction', methods=['POST'])
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080)
In IPython Notebook I don't define the SparkContext because it is automatically configured. I don't remember how I did this, I followed some blogs.
On the Linux server I have set the .py to always be running and installed the latest Spark by following up to step 5 of this guide.
Edit:
Following the advice by davidism I have now instead resorted to simple programs with increasing complexity to localise the error.
Firstly I created .py with just the script from the answer below (after appropriately adjusting the links):
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
This returns "Successfully imported Spark Modules". However, the next .py file I made returns an exception:
from pyspark import SparkContext
sc = SparkContext('local')
rdd = sc.parallelize([0])
print rdd.count()
This returns exception:
"Java gateway process exited before sending the driver its port number"
Searching around for similar problems I found this page but when I run this code nothing happens, no print on the console and no error messages. Similarly, this did not help either, I get the same Java gateway exception as above. I have also installed anaconda as I heard this may help unite python and java, again no success...
Any suggestions about what to try next? I am at a loss.
Okay, so I'm going to answer my own question in the hope that someone out there won't suffer the same days of frustration! It turns out it was a combination of missing code and bad set up.
Editing the code:
I did indeed need to initialise a Spark Context by appending the following in the preamble of my code:
from pyspark import SparkContext
sc = SparkContext('local')
So the full code will be:
from pyspark import SparkContext
sc = SparkContext('local')
from flask import Flask, request
app = Flask(__name__)
#app.route('/whateverYouWant', methods=['POST']) #can set first param to '/'
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080) #note set to 8080!
Editing the setup:
It is essential that the file (yourrfilename.py) is in the correct directory, namely it must be saved to the folder /home/ubuntu/spark-1.5.0-bin-hadoop2.6.
Then issue the following command within the directory:
./bin/spark-submit yourfilename.py
which initiates the service at 10.0.0.XX:8080/accessFunction/ .
Note that the port must be set to 8080 or 8081: Spark only allows web UI for these ports by default for master and worker respectively
You can test out the service with a restful service or by opening up a new terminal and sending POST requests with cURL commands:
curl --data "DATA YOU WANT TO POST" http://10.0.0.XX/8080/accessFunction/
I was able to fix this problem by adding the location of PySpark and py4j to the path in my flaskapp.wsgi file. Here's the full content:
import sys
sys.path.insert(0, '/var/www/html/flaskapp')
sys.path.insert(1, '/usr/local/spark-2.0.2-bin-hadoop2.7/python')
sys.path.insert(2, '/usr/local/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip')
from flaskapp import app as application
Modify your .py file as it is shown in the linked guide 'Using IPython Notebook with Spark' part second point. Insted sys.path.insert use sys.path.append. Try insert this snippet:
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)

Categories

Resources