Read xml file using RDD using local cluster pyspark

Read xml file using RDD using local cluster pyspark - python

I have a very large xml 100MB using pyspark for the following reason,
to reduce the run time of data and convert it into data frame.
Any idea how to read to xml file. (modification required into below code) One more. I took from
google. In precise did not understand, how can I define get_values.
spark = SparkSession.builder.master("local[2]").appName('finale').getOrCreate()
xml = os.path.join(self.violation_path, xml)
file_rdd = spark.read.text(xml, wholetext=False)
pyspark.sql.udf.UDFRegistration.register(name="get_values", f = get_values,
returnType=StringType())
myRDD = spark.parallelize(file_rdd.take(4), 4)
parsed_records = spark.runJob(myRDD, lambda part: [get_values(x) for x in part])
print (parsed_records)
another method
root = ET.parse(xml).getroot()
is it right approach to use pyspark as local cluster? will it be fatser
Can it be run only cloud container and can not use local machine

Related

Write data from broadcast variable (databricks) to azure blob

I have a url from where I download the data (which is in JSON format) using Databricks:
url="https://tortuga-prod-eu.s3-eu-west-1.amazonaws.com/%2FNinetyDays/amzf277698d77514b44"
testfile = urllib.request.URLopener()
testfile.retrieve(url, "file.gz")
with gzip.GzipFile("file.gz", 'r') as fin:
json_bytes = fin.read()
json_str = json_bytes.decode('utf-8')
data = json.loads(json_str)
Now I want to save this data in Azure container as a blob .json file.
I have tried saving data in a dataframe and write df to mounted location but data is huge in GBs and I get spark.rpc.message.maxSize (268435456 bytes) error.
I have tried saving data in a broadcast variable (it saves successfully) but I am not sure how to write data from broadcast variable to mounted location.
Here is how I save data in broadcast variable
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
broadcastStates = spark.sparkContext.broadcast(data)
print(broadcastStates.value)
M question is
Is there any way I can write data from broadcast variable to azure mounted location
if not then please guide me what is right/best way to get this job done.

It is not possible to write the broadcast variable into mounted azure blob storage. However, there is a way you can write the value of broadcast variable into a file.
pyspark.broadcast.Broadcast provides 2 methods, dump() and load_from_path(), using which you can write and read the value of a broadcast variable. Since you have created a broadcast variable using:
broadcastStates = spark.sparkContext.broadcast(data)
Use the following syntax to write the value of broadcast variable to a file.
<broadcast_variable>.dump(<broadcast_variable>.value, filename)
Note: the file must have write attribute and the write() argument must be string.
To read this data from the file, you can use load_from_path() as shown below:
<broadcast_variable>.load_from_path(filename)
Note: the file must have read and readline attributes.
There might also be a way to avoid “spark.rpc.message.maxSize (268435456 bytes) error”. The default value of spark.rpc.message.maxSize is 128. Refer to the following document to know more about this error.
https://spark.apache.org/docs/latest/configuration.html#networking
While creating a cluster in Databricks, we can configure and increase the value to avoid this error. The steps to configure a cluster are:
⦁ While creating cluster, choose advanced options (present at the bottom).
⦁ Under spark tab, choose the configuration and its value as shown below.
Click create cluster.
This might help in writing the dataframe directly to mounted blob storage without using broadcast variables.
You can also try increasing the number of partitions to save the dataframe into multiple smaller files to avoid maxSize error. Refer to the following document about configuring spark and partitioning.
https://kb.databricks.com/execution/spark-serialized-task-is-too-large.html

Converting Python script to be able to run in Spark/Hadoop

I have a python script that currently runs on my desktop. It takes a csv file with roughly 25 million lines (Maybe 15 or so columns) and performs line by line operations.
For each line of input, multiple output lines are produced. The results are then output line by line into a csv file, the output ends up at around 100 million lines.
Code looks something like this:
with open(outputfile,"a") as outputcsv:
with open(inputfile,"r") as input csv:
headerlist=next(csv.reader(csvfile)
for row in csv.reader(csvfile):
variable1 = row[headerlist.index("VAR1")]
variableN = row[headerlist.index("VARN")]
while calculations not complete:
do stuff #Some complex calculations are done at this point
outputcsv.write(stuff)
We're now trying to convert the script to run via Hadoop, using pyspark.
I have no idea how to even start. I'm trying to work out how to iterate through an RDD object but don't think it can be done.
Is a line by line calculation like this suitable for distributed processing?

If you directly want to run the script, you could do so via spark-submit:
spark-submit master local[*]/yarn other_parameters path_to_your_script.py
But I would suggest to go for spark API's as they are easy to use. It will lower the coding overhead.
First you have to create a spark session variable so that you could access all spark functions:
spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("parameters", "value")
.getOrCreate()
Next, if you want to load a csv file:
file = spark.read.csv("path to file")
You can specify optional parameters like headers, inferschema, etc:
file=spark.read.option("header","true").csv("path to your file")
'file' will now be a pyspark dataframe.
You can now write the end output like this:
file.write.csv("output_path")
Please refer to the documentation : spark documentation for transformations and other information.

Using Hadoop InputFormat in Pyspark

I'm working on a file parser for Spark that can basically read in n lines at a time and place all of those lines as a single row in a dataframe.
I know I need to use InputFormat to try and specify that, but I cannot find a good guide to this in Python.
Is there a method for specifying a custom InputFormat in Python or do I need to create it as a scala file and then specify the jar in spark-submit?

You can directly use the InputFormats with Pyspark.
Quoting from the documentation,
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs.
Pass the HadoopInputFormat class to any of these methods of pyspark.SparkContext as suited,
hadoopFile()
hadoopRDD()
newAPIHadoopFile()
newAPIHadoopRDD()
To read n lines, org.apache.hadoop.mapreduce.lib.NLineInputFormat can be used as the HadoopInputFormat class with the newAPI methods.

I cannot find a good guide to this in Python
In the Spark docs, under "Saving and Loading Other Hadoop Input/Output Formats", there is an Elasticsearch example + links to an HBase example.
can basically read in n lines at a time... I know I need to use InputFormat to try and specify that
There is NLineInputFormat specifically for that.
This is a rough translation of some Scala code I have from NLineInputFormat not working in Spark
def nline(n, path):
sc = SparkContext.getOrCreate
conf = {
"mapreduce.input.lineinputformat.linespermap": n
}
hadoopIO = "org.apache.hadoop.io"
return sc.newAPIHadoopFile(path,
"org.apache.hadoop.mapreduce.lib.NLineInputFormat",
hadoopIO + ".LongWritable",
hadoopIO + ".Text",
conf=conf).map(lambda x : x[1]) # To strip out the file-offset
n = 3
rdd = nline(n, "/file/input")
and place all of those lines as a single row in a dataframe
With NLineInputFormat, each string in the RDD is actually new-line delimited. You can rdd.map(lambda record : "\t".join(record.split('\n'))), for example to put make one line out them.

Use Apache Spark to implement the python function

I have a python code to implement in Spark, however I am unable to get the logic right for the RDD working to implement in Spark 1.1 version. This code is perfectly working in Python ,but I would like to implement in Spark with this code.
import lxml.etree
import csv
sc = SparkContext
data = sc.textFile("pain001.xml")
rdd = sc.parallelize(data)
# compile xpath selectors for ele ment text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]
# open result csv file
with open('pain.csv', 'w') as paincsv:
writer = csv.writer(paincsv)
# read file with 1 'CstmrCdtTrfInitn' record per line
with open(rdd) as painxml:
# process each record
for index, line in enumerate(painxml):
if not line.strip(): # allow empty lines
continue
try:
# each line is an xml doc
pain001 = lxml.etree.fromstring(line)
# move to the customer elem
elem = pain001.find('CstmrCdtTrfInitn')
# select each value and write to csv
writer.writerow([xp(elem)[0].strip() for xp in xpath])
except Exception, e:
# give a hint where things go bad
sys.stderr.write("Error line {}, {}".format(index, str(e)))
raise
I am getting error as RDD not iteratable
I want to implement this code as a function and implement as a standalone program in Spark
I would want the input file to be processed in HDFS as well as local mode in Spark with the python module.
Appreciate responses for the problem.

The error you are getting is very informative, when you do with open(rdd) as painxml: and after that, you try to iterate over the RDD as if it was a normal List or Tuple in python, and an RDD is not iterable, furthermore if you read the textFile documentation, you can notice that it returns an RDD.
I think the problem you have is that you are trying to achieve this in a classic way, and you must approach it inside the MapReduce paradigm, if you are really new into Apache Spark, you can audit this course Scalable Machine Learning with Apache Spark, furthermore I would recommend you to update your spark's version to 1.5 or 1.6 (that will come out soon).
Just as a small example (but not using xmls):
Import the required files
import re
import csv
Read the input file
content = sc.textFile("../test")
content.collect()
# Out[8]: [u'1st record-1', u'2nd record-2', u'3rd record-3', u'4th record-4']
Map the RDD to manipulate each row
# Map it and convert it to tuples
rdd = content.map(lambda s: tuple(re.split("-+",s)))
rdd.collect()
# Out[9]: [(u'1st record', u'1'),
# (u'2nd record', u'2'),
# (u'3rd record', u'3'),
# (u'4th record', u'4')]
Write your data
with open("../test.csv", "w") as fw:
writer = csv.writer(fw)
for r1 in rdd.toLocalIterator():
writer.writerow(r1)
Take a look...
$ cat test.csv
1st record,1
2nd record,2
3rd record,3
4th record,4
Note: If you want to read a xml with Apache Spark, there are some libraries in GitHub like spark-xml; you can also find this question interesting xml processing in spark.

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.

This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.