I'm trying to do a basic task of Python Dict transformation with Spark. The sample code is as shown below.
from pyspark import SparkContext, SparkConf
import sys
sys.path.insert(0, 'path_to_myModule')
from spark_test import make_generic #the function that transforms the Dict
data = [{'a':1, 'b':2}, {'a':4, 'b':5}]
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
def testing(data):
data['a'] = data['a'] + 1
data['b'] = data['b'] + 2
return data
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(lambda x: testing(x))
print(rdd2.collect())
rdd3 = rdd1.map(lambda x: make_generic(x)) #does similar task as testing()
print(rdd3.collect())
The path to the module is getting inserted to sys. Yet, I get the below error.
Traceback (most recent call last):
File "/home/roshan/sample_spark.py", line 5, in <module>
from spark_test import make_generic
ImportError: No module named 'spark_test'
Also, the make_generic() function needs a few packages that are installed in a virtualenv.
All in all, I need help in:
1. I need Spark to be able to import the module successfully
2. Be able to use the virtualenv to run the Spark-Submit job.
Related
I am try to using pysparkling.ml.H2OMOJOModel for predict a spark dataframe using a MOJO model trained with h2o==3.32.0.2 in AWS Glue Jobs, how ever a got the error: TypeError: 'JavaPackage' object is not callable.
I opened a ticket in AWS support and they confirmed that Glue environment is ok and the problem is probably with sparkling-water (pysparkling). It seems that some dependency library is missing, but I have no idea which one.
The simple code bellow works perfectly if I run in my local computer (I only need to change the mojo path for GBM_grid__1_AutoML_20220323_233606_model_53.zip)
Could anyone ever run sparkling-water in Glue jobs successfully?
Job Details:
-Glue version 2.0
--additional-python-modules, h2o-pysparkling-2.4==3.36.0.2-1
-Worker type: G1.X
-Number of workers: 2
-Using script "createFromMojo.py"
createFromMojo.py:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd
from pysparkling.ml import H2OMOJOSettings
from pysparkling.ml import H2OMOJOModel
# from pysparkling.ml import *
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
#Job setup
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
caminho_modelo_mojo='s3://prod-lakehouse-stream/modeling/approaches/GBM_grid__1_AutoML_20220323_233606_model_53.zip'
print(caminho_modelo_mojo)
print(dir())
settings = H2OMOJOSettings(convertUnknownCategoricalLevelsToNa = True, convertInvalidNumbersToNa = True)
model = H2OMOJOModel.createFromMojo(caminho_modelo_mojo, settings)
data = {'days_since_last_application': [3, 2, 1, 0], 'job_area': ['a', 'b', 'c', 'd']}
base_escorada = model.transform(spark.createDataFrame(pd.DataFrame.from_dict(data)))
print(base_escorada.printSchema())
print(base_escorada.show())
job.commit()
I could run successfully following the steps:
Downloaded sparkling water distribution zip: http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/3.36.1.1-1-3.1/index.html
Dependent JARs path: s3://bucket_name/sparkling-water-assembly-scoring_2.12-3.36.1.1-1-3.1-all.jar
--additional-python-modules, h2o-pysparkling-3.1==3.36.1.1-1-3.1
I have a python script which is dependent on another file, which is also essential for other scripts, so i have zipped it and shipped it to run as a spark-submit job, but unfortunately it seems not to be working, here is my code snippet and the error i'm getting all the time
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def main(spark):
employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employees.json")
# employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employee.json")
employee.printSchema()
employee.show()
people = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/people.json")
people.printSchema()
people.show()
employee.createOrReplaceTempView("employee")
people.createOrReplaceTempView("people")
newDataFrame = employee.join(people,(employee.name==people.name),how="inner")
newDataFrame.distinct().show()
return "Hello I'm Done Processing the Operation"
which is the external dependencies called by other modules as well, and here is another script which is trying to execute the file
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def sampleTest(output):
print output
if __name__ == "__main__":
#Application Name for the Spark RDD using Python
# APP_NAME = "Spark Application"
spark = SparkSession \
.builder \
.appName("Spark Application") \
.config("spark.master", "spark://192.168.2.3:7077") \
.getOrCreate()
# .config() \
import SparkFileMerge
abc = SparkFileMerge.main(spark)
sampleTest(abc)
now when i'm executing the command
./spark-submit --py-files /home/varun/SparkPythonJob.zip /home/varun/main.py
it is giving me the following error.
Traceback (most recent call last):
File "/home/varun/main.py", line 18, in <module>
from SparkFileMerge import SparkFileMerge
ImportError: No module named SparkFileMerge
any help will be highly appreciated.
What composes SparkPythonJob.zip ?
First, you should check that the first code snippet is actually in a file called SparkFileMerge.py.
I am trying to make some code a bit more modular in Python and am running into one issue which I'm sure is straight forward but I can't seem to see what the problem is.
Suppose I have a script, say MyScript.py:
import pandas as pd
import myFunction as mF
data_frame = mF.data_imp()
print(data_frame)
where myFunction.py contains the following:
def data_imp():
return pd.read_table('myFile.txt', header = None, names = ['column'])
Running MyScript.py in the command line yields the following error:
Traceback (most recent call last):
File "MyScript.py", line 5, in <module>
data_frame = mF.data_imp()
File "/Users/tomack/Documents/python/StackQpd/myFunction.py", line 2, in data_imp
return pd.read_table('myFile.txt', header = None, names = ['column'])
NameError: name 'pd' is not defined
You need to import pandas in your function or script myFunction:
def data_imp():
import pandas as pd
return pd.read_table('myFile.txt', header = None, names = ['column'])
Answers here are right, because your module indeed lacks myFunction import.
If stated more broad this question can also contain following: in case of circular import the only 2 remedies are:
import pandas as pd , but not from pandas import something
Use imports right there in functions you need in-placd
I am having problems executing the example code from the mleap repository. I wish to run the code in a script instead of a jupyter notebook (which is how the example is run). My script is as follows:
##################################################################################
# start a local spark session
# https://spark.apache.org/docs/0.9.0/python-programming-guide.html
##################################################################################
from pyspark import SparkContext, SparkConf
conf = SparkConf()
#set app name
conf.set("spark.app.name", "train classifier")
#Run Spark locally with as many worker threads as logical cores on your machine (cores X threads).
conf.set("spark.master", "local[*]")
#number of cores to use for the driver process (only in cluster mode)
conf.set("spark.driver.cores", "1")
#Limit of total size of serialized results of all partitions for each Spark action (e.g. collect)
conf.set("spark.driver.maxResultSize", "1g")
#Amount of memory to use for the driver process
conf.set("spark.driver.memory", "1g")
#Amount of memory to use per executor process (e.g. 2g, 8g).
conf.set("spark.executor.memory", "2g")
#pass configuration to the spark context object along with code dependencies
sc = SparkContext(conf=conf)
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)
##################################################################################
import mleap.pyspark
# # Imports MLeap serialization functionality for PySpark
from mleap.pyspark.spark_support import SimpleSparkSerializer
# Import standard PySpark Transformers and packages
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
# Create a test data frame
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()
# Build a very simple pipeline using two transformers
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
feature_assembler = VectorAssembler(
inputCols=[string_indexer.getOutputCol()], outputCol="features")
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
featurePipeline.fit(df2)
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
On executing spark-submit script.py I get the following error:
17/09/18 13:26:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/opringle/Documents/Repos/finn/Magellan/src/no_spark_predict.py", line 58, in <module>
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
Any help would be much appreciated! I have installed mleap from pypy.
See Here
It seems MLeap isn't ready for Spark 2.3 yet. If you happen to be running Spark 2.3, try downgrading to 2.2 and retry. Hopefully, that helps!
I have solved the issue by attaching the following jar file when running:
spark-submit --packages ml.combust.mleap:mleap-spark_2.11:0.8.1 script.py
It seems you didn't follow the steps correctly, here http://mleap-docs.combust.ml/getting-started/py-spark.html it says that
Note: the import of mleap.pyspark needs to happen before any other PySpark libraries are imported.
Hence try importing your SparkContext after mleap
I am trying to run an example from ThinkStats2 code. I copied everything from github - the file structure are exactly the same as given on the github.
chap01soln.py
from __future__ import print_function
import numpy as np
import sys
import nsfg
import thinkstats2
def ReadFemResp(dct_file='2002FemResp.dct',
dat_file='2002FemResp.dat.gz',
nrows=None):
"""Reads the NSFG respondent data.
dct_file: string file name
dat_file: string file name
returns: DataFrame
"""
dct = thinkstats2.ReadStataDct(dct_file)
df = dct.ReadFixedWidth(dat_file, compression='gzip', nrows=nrows)
CleanFemResp(df)
return df
def CleanFemResp(df):
"""Recodes variables from the respondent frame.
df: DataFrame
"""
pass
def ValidatePregnum(resp):
"""Validate pregnum in the respondent file.
resp: respondent DataFrame
"""
# read the pregnancy frame
preg = nsfg.ReadFemPreg()
# make the map from caseid to list of pregnancy indices
preg_map = nsfg.MakePregMap(preg)
# iterate through the respondent pregnum series
for index, pregnum in resp.pregnum.iteritems():
caseid = resp.caseid[index]
indices = preg_map[caseid]
# check that pregnum from the respondent file equals
# the number of records in the pregnancy file
if len(indices) != pregnum:
print(caseid, len(indices), pregnum)
return False
return True
def main(script):
"""Tests the functions in this module.
script: string script name
"""
resp = ReadFemResp()
assert(len(resp) == 7643)
assert(resp.pregnum.value_counts()[1] == 1267)
assert(ValidatePregnum(resp))
print('%s: All tests passed.' % script)
if __name__ == '__main__':
main(*sys.argv)
I am getting ImportError like shown below:
Traceback (most recent call last):
File "C:\wamp\www\ThinkStats_py3\chap01soln.py", line 13, in <module>
import nsfg
File "C:\wamp\www\ThinkStats_py3\nsfg.py", line 14, in <module>
import thinkstats2
File "C:\wamp\www\ThinkStats_py3\thinkstats2.py", line 34, in <module>
import thinkplot
File "C:\wamp\www\ThinkStats_py3\thinkplot.py", line 11, in <module>
import matplotlib
File "C:\Python34\lib\site-packages\matplotlib\__init__.py", line 105, in <module>
import six
ImportError: No module named 'six'
All the files are listed under the src folder. No packages under src. I tried adding packages for nsfg and thinkstats but another error appeared. I tried upgrading python 2.7 to python 3.4. I am still getting the same error. I know I need to install the six package for matplotlib but why am I getting import error on nsfg?
You are getting the import error on nsfg because it internally imports matplotlib (not directly, but it imports thinkstats2 , which imports thinkplot which imports matplotlib , which has a dependency on six module) . And that library is not installed on your computer, so the import fails.
Most probably you do not have six module , you can try installing it using - pip install six .
Or get it from here, unzip it and install it using - python setup.py install