missing id from vertext dataframe in pyspark creating Graphframe - python

I have writting this code using Python, when I run it, the following errors show up.
spark = SparkSession\
.builder\
.appName("GraphX")\
.getOrCreate()
e = spark.read.parquet("hdfs://localhost:9000/gf/edge")
v = spark.read.parquet("hdfs://localhost:9000/gf/vertex")
s = GraphFrame(v, e)
s.edges.show()
s.vertices.show()

Related

Azure Blob storage error can't parse a date in spark

I am trying to read a file allocated in azure datalake gen2 into spark dataframe using python.
Code is
from pyspark import SparkConf
from pyspark.sql import SparkSession
# create spark session
key = "some_key"
appName = "DataExtract"
master = "local[*]"
sparkConf = SparkConf() \
.setAppName(appName) \
.setMaster(master) \
.set("fs.azure.account.key.myaccount.dfs.core.windows.net", key)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
data_csv="abfs://test-file-system#myaccount.dfs.core.windows.net/data.csv"
data_out = "abfs://test-file-system#myaccount.dfs.core.windows.net/data_out.csv"
# read csv
df = self.spark_session.read.csv(data_csv)
# write csv
df.write.csv(data_out)
The file is read and is written well, but I am getting following error
ERROR AzureBlobFileSystemStore: Failed to parse the date Thu, 09 Sep 2021 10:12:34 GMT
Date seems to be file creation date.
How can I parse the date to avoid getting the error?
I tried reproducing the same issue and found it is with these lines that is causing the error.
data_csv="abfs://test-file-system#myaccount.dfs.core.windows.net/data.csv" data_out =
"abfs://test-file-system#myaccount.dfs.core.windows.net/data_out.csv"
# read csv df = self.spark_session.read.csv(data_csv) ```
Here is the code that worked for me when I tried replacing the above lines of code i.e.. abfs to abfss
from pyspark import SparkConf
from pyspark.sql import SparkSession
# create spark session
key = "<Your Storage Account Key>"
appName = "<Synapse App Name>"
master = "local[*]"
sparkConf = SparkConf() \
.setAppName(appName) \
.setMaster(master) \
.set("fs.azure.account.key.<Storage Account Name>.dfs.core.windows.net", key)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
data_csv="abfss://<ContainerName>#<Storage Account Name>.dfs.core.windows.net/<Directory>"
# read csv
df1 = spark.read.option('header','true')\
.option('delimiter', ',')\
.csv(data_csv + '/sample1.csv')
df1.show()
# write csv
df2 = df1.write.csv(data_csv + '/<Give the name of blob you want to write to>.csv')
else you can even try the below code which perfectly worked for me
from pyspark.sql import SparkSession
from pyspark.sql.types import *
account_name = "<StorageAccount Name>"
container_name = "<Storage Account Container Name>"
relative_path = "<Directory path>"
adls_path = 'abfss://%s#%s.dfs.core.windows.net/%s'%(container_name,account_name,relative_path)
dataframe1 = spark.read.option('header','true')\
.option('delimiter', ',')\
.csv(adls_path + '/sample1.csv')
dataframe1.show()
dataframe2 = dataframe1.write.csv(adls_path + '/<Give the name of blob you want to write to>.csv')
REFERENCE :
Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python - SQL Stijn (sql-stijn.com)

Error when executing using ./bin/spark-submit in PySpark

I am getting the below error when executing the code from command line in centOS.
"(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o313.save.\n', JavaObject id=o315), <traceback object at 0x7fca49970320>)"
I am getting this issue only when I am submitting it through ./bin/spark-submit test.py
If I use just : spark-submit test.py everything works fine. But I am not able to run the code in yarn with it.
I have anaconda installed in my machine and I think it's using the anaconda spark-submit in the second method.
Can anyone please suggest what to do with it? Do I have set any env variables or update the libraries?
Edit:
As per the comments providing more details about the script and versions
This is the code:
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql import SparkSession,DataFrame
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoderEstimator, OneHotEncoder
import sys
from operator import add
import os
import pandas as pd
import numpy as np
from pyspark.sql.types import *
import pyspark.sql.functions as func
from IPython.display import display, HTML
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression,LogisticRegressionModel
path = 'hdfs://10.0.15.42/nih-poc-public-dataset/'
pipeline_path = '/opt/nih/pipeline_model'
try:
conf = SparkConf().setMaster("local[8]").setAppName("sparkstream").set("spark.driver.allowMultipleContexts", "true")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
spark = SparkSession.builder\
.config(conf = conf)\
.getOrCreate()
print "SparkContext Version:", sc.version
print "SparkContext Version:", sc.version
print "Python version: ", sc.pythonVer
print "Master URL to connect to: ", sc.master
print "Path where Spark is installed on worker nodes: ", str(sc.sparkHome)
print "Retrieve name of the Spark User running SparkContext: ", str(sc.sparkUser())
print "Application name: ", sc.appName
print "Application ID: ", sc.applicationId
print "Default level of parallelism: ", sc.defaultParallelism
print "Default number of partitions for RDDs: ", sc.defaultMinPartitions
########################################################################################################
############################################# DATA EXTRACTION ##########################################
########################################################################################################
file_name = path + 'maintenance_data.csv'
df_nih = spark.read.csv(file_name,sep = ';',header = "true",inferSchema="true")
df_nih.show(1)
#print(df_nih.columns)
print(df_nih.count())
########################################################################################################
########################################################################################################
######################################## PIPELINE FORMATION ############################################
########################################################################################################
categoricalcolumns = ['team','provider']
numericalcolumns = ['lifetime', 'pressureInd', 'moistureInd', 'temperatureInd']
stages = []
for categoricalcol in categoricalcolumns:
stringindexer = StringIndexer(inputCol = categoricalcol, outputCol = categoricalcol + '_index')
encoder = OneHotEncoderEstimator(inputCols=[stringindexer.getOutputCol()], outputCols=[categoricalcol + "_classVec"])
stages += [stringindexer,encoder]
#stages += [stringindexer]
assemblerinputs = [c + "_classVec" for c in categoricalcolumns] + numericalcolumns
vectorassembler_stage = VectorAssembler(inputCols = assemblerinputs
,outputCol='features')
stages += [vectorassembler_stage]
#output = assembler.transform(df_nih)
#print(output.show(1))
indexer = StringIndexer(inputCol = 'broken', outputCol = 'label')
stages += [indexer]
#pipeline = Pipeline(stages = [stages,vectorassembler_stage,indexer])
pipeline = Pipeline(stages = stages)
model_pipeline = pipeline.fit(df_nih)
model_pipeline.write().overwrite().save(pipeline_path)
except:
print(sys.exc_info())
finally:
print('stopping spark session')
spark.stop()
sc.stop()
This is the output:
> /usr/hdp/3.0.0.0-1634/spark2> ./bin/spark-submit --master yarn
> /opt/nih/sample_spark.py SparkContext Version: 2.3.1.3.0.0.0-1634
> SparkContext Version: 2.3.1.3.0.0.0-1634 Python version: 2.7 Master
> URL to connect to: yarn Path where Spark is installed on worker
> nodes: None Retrieve name of the Spark User running SparkContext:
> spark Application name: sparkstream Application ID:
> application_1540925663097_0023 Default level of parallelism: 2
> Default number of partitions for RDDs: 2
> +--------+------+----------------+----------------+----------------+-----+---------+ |lifetime|broken| pressureInd| moistureInd| temperatureInd|
> team| provider|
> +--------+------+----------------+----------------+----------------+-----+---------+ | 56|
> 0|92.1788540640753|104.230204454489|96.5171587259733|TeamA|Provider4|
> +--------+------+----------------+----------------+----------------+-----+---------+ only showing top 1 row
>
> 999 (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error
> occurred while calling o309.save.\n', JavaObject id=o311), <traceback
> object at 0x7f9061c86290>) stopping spark session
If I comment the saving part(model_pipeline.write) everything works fine. Please provide your suggestions

can't pickle lock objects

i get this error when using this code :
def createLabeledPoints(fields):
q1 = fields[1]
q2 = fields[12]
q3 = fields[23]
result = fields[40]
return LabeledPoint(result, array([q1, q2, q3])
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
dt = df.rdd.map(createLabeledPoints)
model111 = DecisionTree.trainClassifier(dt, numClasses=467,
categoricalFeaturesInfo={0:2,1:2, 2:2}, impurity='gini', maxDepth=30, maxBins=32)
but when i want to save my model "model111" and using it with flask
import cPickle as pickle
pickle.dump(model111, open("rfc1.pkl","wb"))
this give an error :
TypeError: can't pickle lock objects
i am new in python...is there any way to unlock m model so i can use the pickle or can someone please suggest to me any solution

create labeledpoint from mongodb using python

I want to create labeledpoint from mongodb using python,
I already tried to do that with a csv file instead of mongodb
here is the code of function that return the labeledpoint
def createLabeledPoints(fields):
q1 = int(fields[0])
q2 = int(fields[1])
result = int(fields[38])
return LabeledPoint(result, array([q1, q2, q3))
this code works for me with csv file
and I get my collection from mongodb as a pandas dataframe using the code below
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db1 = client.newumc
collection1 = db.data_classification
rawData1 = DataFrame(list(collection.find({})))
and I get each field using the code below
field_for_test = collection.find({}, {'field_from_mongodb':1,'_id':0})
i solved the problem by using
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
field1 = df[1]
field2 = df[2]

Using Hive SQLContext in spark executors

I am getting a complex json message through kafka. I need to extract the required fields from the json and store them in hive tables. I know I cannot use the spark driver sqlContext in the executors. I want to know how to use the sqlContext in the code run by the executors. Here is the code :
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming", topic)
msgs = kvs.map(lambda msg: msg[1])
msgs.foreachRDD(lambda rdd: rdd.foreach(lambda m : timeline_events(m)))
def timeline_events(m):
msg = json.loads(m)
for msgJson in msg:
event_id = msgJson['events'][0]['event_id']
event_type = msgJson['events'][0]['type']
incidence_source = msgJson['incident']['source']
csr_description = msgJson['incident']['data']['csr_description']
sc_display_priority = msgjson['incident']['data']['display_priority']
launch_tool_rec_label = msgJson['incident']['data']['LaunchTool'][0]['Label']
launch_tool_rec_uri = msgJson['incident']['data']['LaunchTool'][0]['URI']
launch_itg_rec_label = msgJson['incident']['data']['LaunchItg'][0]['Label']
launch_itg_rec_uri = msgJson['incident']['data']['LaunchItg'][0]['URI']
sqlContext.sql("Insert into nexus.timeline_events values({},{},{},{},{},{},{},{},{},{},{})".format(event_id, event_type, csr_description, incidence_source, sc_display_priority, launch_tool_rec_label,launch_tool_rec_uri, launch_tool_rec_id, launch_itg_rec_label, launch_itg_rec_uri, launch_itg_rec_id))

Categories

Resources