I have just installed pyspark2.4.5 in my ubuntu18.04 laptop, and when I run following codes,
#this is a part of the code.
import pubmed_parser as pp
from pyspark.sql import SparkSession
from pyspark.sql import Row
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
parse_results_rdd = medline_files_rdd.\
flatMap(lambda x: [Row(file_name=os.path.basename(x), **publication_dict)
for publication_dict in pp.parse_medline_xml(x)])
medline_df = parse_results_rdd.toDF()
# save to parquet
medline_df.write.parquet('raw_medline.parquet', mode='overwrite')
medline_df = spark.read.parquet('raw_medline.parquet')
I get such error,
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
NameError: name 'spark' is not defined
I have seen similiar questions on StackOverflow, but all of them can not solve my problem.Does anyone can help me?Thanks a lot.
By the way, I am new in spark, if I just want to use spark in Python, does it enough that I just install pyspark by using
pip install pyspark ? any others should I do? Should I install Hadoop or others?
Just create spark session in the starting
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
This is my first attempt to use xgboost in pyspark so my experience with Java and Pyspark is still in learning phase.
I saw an awesome article in towards datascience with title PySpark ML and XGBoost full integration tested on the Kaggle Titanic dataset where the author goes through use case of xgboost in pyspark.
I tried to follow the steps but was hit with ImportError.
I have downloaded two jar files from maven and put them in the same directory where my notebook is.
xgboost4j version 0.72
xgboost4j-spark version 0.72
I have also downloaded the xgboost wrapper file sparkxgb.zip to the path ~/Softwares/sparkxgb.zip.
My jupyter notebook first cell
import xgboost
print(xgboost.__version__) # 1.2.0
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars xgboost4j-spark-0.72.jar,xgboost4j-0.72.jar pyspark-shell'
HOME = os.path.expanduser('~')
import findspark
findspark.init(HOME + "/Softwares/spark-3.0.0-bin-hadoop2.7")
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
spark = SparkSession\
.appName("PySpark XGBOOST Titanic")\
spark.sparkContext.addPyFile(HOME + "/Softwares/sparkxgb.zip")
print(pyspark.__version__) # 3.0.0
# this does not give any error
# Computer: MacOS
This cell gives errror
from sparkxgb import XGBoostEstimator
ImportError Traceback (most recent call last)
<ipython-input-7-cf2ff39c26f4> in <module>
----> 1 from sparkxgb import XGBoostEstimator
/private/var/folders/tb/7xdk9scs79j9hxzcl3l_s6k00000gn/T/spark-1cf282a4-f3f2-42b3-a064-6bbd8751489e/userFiles-abca5e59-5af3-4b3d-a3bc-edc2973e9995/sparkxgb.zip/sparkxgb/__init__.py in <module>
19 from sparkxgb.pipeline import XGBoostPipeline, XGBoostPipelineModel
---> 20 from sparkxgb.xgboost import XGBoostEstimator, XGBoostClassificationModel, XGBoostRegressionModel
22 __all__ = ["XGBoostEstimator", "XGBoostClassificationModel", "XGBoostRegressionModel",
/private/var/folders/tb/7xdk9scs79j9hxzcl3l_s6k00000gn/T/spark-1cf282a4-f3f2-42b3-a064-6bbd8751489e/userFiles-abca5e59-5af3-4b3d-a3bc-edc2973e9995/sparkxgb.zip/sparkxgb/xgboost.py in <module>
19 from pyspark.ml.param import Param
20 from pyspark.ml.param.shared import HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol, HasCheckpointInterval
---> 21 from pyspark.ml.util import JavaMLWritable, JavaPredictionModel
22 from pyspark.ml.wrapper import JavaEstimator, JavaModel
23 from sparkxgb.util import XGBoostReadable
ImportError: cannot import name 'JavaPredictionModel' from 'pyspark.ml.util' (/Users/poudel/Softwares/spark-3.0.0-bin-hadoop2.7/python/pyspark/ml/util.py)
How to fix the error and run xgboost in pyspark?
Maybe I have not placed downloaded jar files to correct path. (I have them placed in my working directory where I have jupyter notebook file). Do I need to place these files somewhere else? I assume jupyter automatically loads the path . and sees these jar files but I may be wrong.
If any good samaritan has already ran xgboost in pyspark, their help is much appreciated.
This question was asked almost 6 months ago but still no solution were provided.
Even I was facing the same issue for past few days, and finally I got solution so would like to share with all my folks.
By now you might have also got solution but I thought it would be better if I share so that you or in future any one can get benefits from this solution.
You can get rid of this error in two ways,
JavaPredictionModel has been removed from latest version of pyspark, so your can downgrad pyspark to let say version 2.4.0, then error will resolve.
But by doing this you might have to follow all the structure of old pyspark version only like OneHotEncoder can not be used for multiple features at same time you have to do that one-by-one.
!pip install pyspark==2.4.0
The second and best solution is to modify sparkxgb codes, like you can import JavaPredictionModel from pyspark.ml.wrapper, so you don't need to downgrad your pyspark.
from pyspark.ml.wrapper import JavaPredictionModel
There are some problem with your versions. I know decision of similar problem fot catboost_spark. So i had a problem with versions (catboost_spark_version)
You need to go to https://catboost.ai/en/docs/installation/spark-installation-pyspark
Get the appropriate catboost_spark_version (see available versions at Maven central).
Choose the appropriate spark_compat_version (2.3, 2.4 or 3.0) and scala_compat_version (2.11 or 2.12).
Just add the catboost-spark Maven artifact with the appropriate spark_compat_version, scala_compat_version and catboost_spark_version to spark.jar.packages Spark config parameter and import the catboost_spark package:
So you go to https://search.maven.org/search?q=catboost-spark and
choose version (for example catboost-spark_3.3_2.12)
Then copy "Artifact ID". In this case is "catboost-spark_3.3_2.12:1.1.1"
Then paste it to your config parameter
And you will get something like this:
sparkSession = (SparkSession.builder
.config("spark.jars.packages", "ai.catboost:catboost-spark_3.3_2.12:1.1.1")
import catboost_spark
and it will works :)
I am trying to run some code on a spark kubernetes cluster
"spark.kubernetes.container.image", "kublr/spark-py:2.4.0-hadoop-2.6"
The code I am trying to run is the following
def getMax(row, subtract):
getMax takes two parameters -
row: array with parameters
subtract: normal value of the parameter
It outputs the value most distant from the normal
row = np.array(row)
out = row[np.argmax(row-subtract)]
except ValueError:
return None
return out.item()
from pyspark.sql.types import FloatType
udf_getMax = F.udf(getMax, FloatType())
The dataframe I am passing is as below
However I am getting the following error
ModuleNotFoundError: No module named 'numpy'
When I did a stackoverflow serach I could find similar issue of numpy import error in spark in yarn.
ImportError: No module named numpy on spark workers
And the crazy part is I am able to import numpy outside and
import numpy as np
command outside the function is not getting any errors.
Why is this happening? How to fix this or how to go forward. Any help is appreciated.
Below is the code I have written to compare two dataframes and impose intersection function on them.
import os
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://xxx:xxx").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxxx").option("password","xxxx").load()
df1= sqlContext.sql("select * from test where amitesh<= 300")
df2= sqlContext.sql("select * from test where amitesh <= 400")
df3= df1.intersection(df2)
I am getting below error:
AttributeError: 'DataFrame' object has no attribute 'intersection'
If my understanding is correct, intersection() is an inbuilt sub-function derived from python set function. So,
1) if I am trying to use it inside pyspark, do I need to import any special module inside my code, or it should work as in-built for pyspark as well?
2) To use this intersection() function, do we first need to convert df to rdd?
Please correct me wherever I am wrong. Can somebody give me a working example?
My motive is to get the common record from SQL server and move to HIVE. As of now, I am first trying to get my intersection function work and then start with the HIVE requirement that I can take care off if intersection() is working.
I got it working for me, instead of intersection(), I used intersect(), it worked.
I am learning Spark now. When I tried to load a json file, as follows:
I got the following error:
AttributeError: 'SQLContext' object has no attribute 'jsonFile'
I am running this on Windows 7 PC, with spark-2.1.0-bin-hadoop2.7, and Python 2.7.13 (Dec 17, 2016).
Thank you for any suggestions that you may have.
You probably forgot to import the implicits. This is what my solution looks like in Scala:
def loadJson(filename: String, sqlContext: SqlContext): Dataset[Row] = {
import sqlContext._
import sqlContext.implicits._
val df = sqlContext.read.json(filename)
First, the more recent versions of Spark (like the one you are using) involve .read.json(..) instead of the deprecated .readJson(..).
Second, you need to be sure that your SqlContext is setup right, as mentioned here: pyspark : NameError: name 'spark' is not defined. In my case, it's setup like this:
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
myObjects = sqlContext.read.json('file:///home/cloudera/Downloads/json_files/firehose-1-2018-08-24-17-27-47-7066324b')
Note that they have version-specific quick-start tutorials that can help with getting some of the basic operations right, as mentioned here: name spark is not defined
So, my point is to always check to ensure that with whatever library or language you are using (and this applies in general across all technologies) that you are following the documentation that matches the version you are running because it is very common for breaking changes to create a lot of confusion if there is a version mismatch. In cases where the technology you are trying to use is not well documented in the version you are running, that's when you need to evaluate if you should upgrade to a more recent version or create a support ticket with those who maintain the project so that you can help them to better support their users.
You can find a guide on all of the version-specific changes of Spark here: https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-16-to-20
You can also find version-specific documentation on Spark and PySpark here (e.g. for version 1.6.1): https://spark.apache.org/docs/1.6.1/sql-programming-guide.html
As mentioned before, .jsonFile (...) has been deprecated1, use this instead:
people = sqlContext.read.json("C:\wdchentxt\CustomerData.json").rdd
[1]: https://docs.databricks.com/spark/latest/data-sources/read-json.html
I have a python project that uses pyspark and i am trying to define a udf function inside the spark project (not in my python project) specifically in spark\python\pyspark\ml\tuning.py but i get pickling problems. it can't load the udf.
The code:
from pyspark.sql.functions import udf, log
test_udf = udf(lambda x : -x[1], returnType=FloatType())
d = data.withColumn("new_col", test_udf(data["x"]))
when i try d.show() i am getting exception of unknown attribute test_udf
In my python project i defined many udf and it worked fine.
add the following to your code. It isn't recognizing the datatype.
from pyspark.sql.types import *
Let me know if this helps. Thanks.
Found it there was 2 problems
1) for some reason it didn't like the returnType=FloatType() i needed to convert it to just FloatType() though this was the signature
2) The data in column x was a vector and for some reason i had to cast it to float
The working code:
from pyspark.sql.functions import udf, log
test_udf = udf(lambda x : -float(x[1]), FloatType())
d = data.withColumn("new_col", test_udf(data["x"]))