Use Apache Spark to implement the python function - python

I have a python code to implement in Spark, however I am unable to get the logic right for the RDD working to implement in Spark 1.1 version. This code is perfectly working in Python ,but I would like to implement in Spark with this code.
import lxml.etree
import csv
sc = SparkContext
data = sc.textFile("pain001.xml")
rdd = sc.parallelize(data)
# compile xpath selectors for ele ment text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]
# open result csv file
with open('pain.csv', 'w') as paincsv:
writer = csv.writer(paincsv)
# read file with 1 'CstmrCdtTrfInitn' record per line
with open(rdd) as painxml:
# process each record
for index, line in enumerate(painxml):
if not line.strip(): # allow empty lines
continue
try:
# each line is an xml doc
pain001 = lxml.etree.fromstring(line)
# move to the customer elem
elem = pain001.find('CstmrCdtTrfInitn')
# select each value and write to csv
writer.writerow([xp(elem)[0].strip() for xp in xpath])
except Exception, e:
# give a hint where things go bad
sys.stderr.write("Error line {}, {}".format(index, str(e)))
raise
I am getting error as RDD not iteratable
I want to implement this code as a function and implement as a standalone program in Spark
I would want the input file to be processed in HDFS as well as local mode in Spark with the python module.
Appreciate responses for the problem.

The error you are getting is very informative, when you do with open(rdd) as painxml: and after that, you try to iterate over the RDD as if it was a normal List or Tuple in python, and an RDD is not iterable, furthermore if you read the textFile documentation, you can notice that it returns an RDD.
I think the problem you have is that you are trying to achieve this in a classic way, and you must approach it inside the MapReduce paradigm, if you are really new into Apache Spark, you can audit this course Scalable Machine Learning with Apache Spark, furthermore I would recommend you to update your spark's version to 1.5 or 1.6 (that will come out soon).
Just as a small example (but not using xmls):
Import the required files
import re
import csv
Read the input file
content = sc.textFile("../test")
content.collect()
# Out[8]: [u'1st record-1', u'2nd record-2', u'3rd record-3', u'4th record-4']
Map the RDD to manipulate each row
# Map it and convert it to tuples
rdd = content.map(lambda s: tuple(re.split("-+",s)))
rdd.collect()
# Out[9]: [(u'1st record', u'1'),
# (u'2nd record', u'2'),
# (u'3rd record', u'3'),
# (u'4th record', u'4')]
Write your data
with open("../test.csv", "w") as fw:
writer = csv.writer(fw)
for r1 in rdd.toLocalIterator():
writer.writerow(r1)
Take a look...
$ cat test.csv
1st record,1
2nd record,2
3rd record,3
4th record,4
Note: If you want to read a xml with Apache Spark, there are some libraries in GitHub like spark-xml; you can also find this question interesting xml processing in spark.

Related

Read xml file using RDD using local cluster pyspark

I have a very large xml 100MB using pyspark for the following reason,
to reduce the run time of data and convert it into data frame.
Any idea how to read to xml file. (modification required into below code) One more. I took from
google. In precise did not understand, how can I define get_values.
spark = SparkSession.builder.master("local[2]").appName('finale').getOrCreate()
xml = os.path.join(self.violation_path, xml)
file_rdd = spark.read.text(xml, wholetext=False)
pyspark.sql.udf.UDFRegistration.register(name="get_values", f = get_values,
returnType=StringType())
myRDD = spark.parallelize(file_rdd.take(4), 4)
parsed_records = spark.runJob(myRDD, lambda part: [get_values(x) for x in part])
print (parsed_records)
another method
root = ET.parse(xml).getroot()
is it right approach to use pyspark as local cluster? will it be fatser
Can it be run only cloud container and can not use local machine

Python XML to CSV Conversion & Append Data to a Main .csv file

Links:
Sample XML
Tools in use:
Power Automate Desktop (PAD)
Command Prompt (CMD) Session operated by PAD
Python 3.10 (importing: Numpy, Pandas & using Element Tree)
SQL to read data and insert into primary database
Method:
XML to CSV conversion
Im using Power Automate Desktop (PAD) to automate all of this because it is what I know.
The conversion method uses Python, inside of CMD, importing numpy & pandas, using element tree
Goal(s):
I would like to avoid the namespace being written to headers so SQL can interact with the csv data
I would like to append a master csv file with the new data coming in from every call
This last Goal Item is more of a wishlist and I will likely take it over to a PAD forum. Just including it in the rare event someone has a solution:
(I'd like to avoid using CMD all together) but given the limitations of PAD, using IronPython, I have to use a CMD session to utilize regular Python in the CMS session.
I am using Power Automate Desktop (PAD), which has a Python module. The only issue is that is uses Iron Python 2.7 & I cannot access the libraries & dont know how to code the Python code below in IronPython. If I could get this sorted out, the entire process would be much more efficient.
IronPython Module Operators (Again, this would be a nice to have.. The other goals are the priority.
Issues:
xmlns is being added to some of the headers when writing converted data to csv
The python code is generating a new .csv file each loop, where I would like for it to simply append the data into a common .csv file, to build the dataset.
Details (Probaably Unnecessary):
First, I am not Python expert and I am pretty novice with SQL as well.
Currently, the Python code (see below) converts a webservice call, body formatted XML, to csv format. Then it exports that .csv data to an actual csv file. This works great but it overwrites the file each time, which means I need to program scripts to read this data before the file is deleted/replaced with the new file. I then use SQL to INSERT INTO a main dataset (Append data) using SQL. It also prints the XMLNS in some of the headers, which is something I need to avoid.
If you are wondering why I am performing the conversion & not simply parsing the xml data, my client requires csv format for their datasets. Otherwise, I'd just parse out the XML data as needed. The other reason is that this data is being updated incrementally at set intervals, building an audit trail. So a bulk conversion is not possible due to
What I would like to do is have the Python code perform the conversion & then append a main dataset inside of a csv file.
The other issue here is that the XMLNS from the XML is being pulled into some (Not all) of the headers of the CSV table, which has made the use of SQL to read and insert into the main table an issue. I cannot figure a way around this (Again, novice)
Also, if anyone knows how one would write this in IronPython 2.7, that would be great too because I'd be able to get around using the CMD Session.
So, if I could use Python to append a primary table with the converted data, while escaping the namespace, this would solve all of my (current) issues & would have the added benefit of being 100% more efficient in the movement of this data.
Also, due to the limited toolset I have, I am scripting this using Power Automate in the CMD, using a CMD session.
Python Code (within CMD environment):
cd\WORK\XML
python
import numpy as np
import pandas as pd
from xml.etree import ElementTree
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
rows = []
for child in parentroot:
temp_dict = {}
for i in all_tags:
tag_values = {}
for inners in child.iter(i):
temp_tag_value = {}
temp_dict.update(inners.attrib)
temp_tag_value[i.rsplit("}", 1)[1]] = inners.text
tag_values.update(temp_tag_value)
temp_dict.update(tag_values)
rows.append(temp_dict)
dataframe = pd.DataFrame.from_dict(rows, orient='columns')
dataframe = dataframe.replace({np.nan: None})
dataframe.to_csv('FILE_TABLE_CMD.csv', index=False)
Given no analysis is needed, avoid pandas and consider building the CSV by using csv.DictWriter where you can specify the append mode of write context. Below parses all underlying descendants of <ClinicalData> and migrates each set into CSV row.
from csv import DictWriter
from xml.etree import ElementTree
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
nmsp = {"doc": "http://www.cdisc.org/ns/odm/v1.3"}
# RETRIEVE ALL ELEMENT TAGS
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
# RETRIEVE ALL ATTRIB KEYS
all_keys = [list(elem.attrib.keys()) for elem in maintree.iter()]
# UNNEST AND DE-DEDUPE
all_keys = set(sum([key for key in all_keys], []))
# COMBINE ELEM AND ATTRIB NAMES
all_tags = all_tags + list(all_keys)
all_tags = [(tag.split('}')[1] if '}' in tag else tag) for tag in all_tags]
# APPEND TO EXISTING DATA WIH 'a'
with open('FILE_TABLE_CMD.csv', 'a') as f:
writer = DictWriter(f, fieldnames = all_tags)
writer.writeheader()
# ITERATE THROUGH ALL ClincalData ELEMENTS
for cd in parentroot.findall('doc:ClinicalData', namespaces=nmsp):
temp_dict = {}
# ITERATE THROUGH ALL DESCENDANTS
for elem in cd.iter():
# UPDATE DICT FOR ELEMENT TAG/TEXT
temp_dict[elem.tag.split("}", 1)[1]] = elem.text
# MERGE ELEM DICT WITH ATTRIB DICT
temp_dict = {**temp_dict, **elem.attrib}
# REMOVE NAMESPACES IN KEYS
temp_dict = {
(k.split('}')[1] if '}' in k else k):v
for k,v
in temp_dict.items()
}
# WRITE ROW TO CSV
writer.writerow(temp_dict)
Actually, you can use the new iterparse feature of pandas.read_xml in latest v1.5. Though intended for very large XML, this feature allows parsing any underlying element or attribute without the restriction of relationships required of XPath.
You will still need to find all element and attributes names. CSV output will differ with above as method removes any all-empty columns and retains order of elements/attributes presented in XML. Also, pandas.DataFrame.csv does support append mode but conditional logic may be needed for writing headers.
import os
from xml.etree import ElementTree
import pandas as pd # VERSION 1.5+
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
# RETRIEVE ALL ELEMENT TAGS
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
# RETRIEVE ALL ATTRIB KEYS
all_keys = [list(elem.attrib.keys()) for elem in maintree.iter()]
# UNNEST AND DE-DEDUPE
all_keys = set(sum([key for key in all_keys], []))
# COMBINE ELEM AND ATTRIB NAMES
all_tags = all_tags + list(all_keys)
all_tags = [(tag.split('}')[1] if '}' in tag else tag) for tag in all_tags]
clinical_data_df = pd.read_xml(
"FILE_XML.xml", iterparse = {"ClinicalData": all_tags}, parser="etree"
)
if os.path.exists("FILE_TABLE_CMD.csv"):
# CREATE CSV WITH HEADERS
clinical_data_df.to_csv("FILE_TABLE_CMD.csv", index=False)
else:
# APPEND TO CSV WITHOUT HEADERS
clinical_data_df.to_csv("FILE_TABLE_CMD.csv", index=False, mode="a", header=False)
I recommend to split large XML in branches and parse this parts separately. This can be done in an object. The object can also hold the data until written to the csv or a database like sqlite3, MySQL, etc.. The object can be called also from different threat.
I have not define the csv write, because I don't know which data you like to catch. But I think you will finish this easy.
Her is my recommended concept:
import xml.etree.ElementTree as ET
import pandas as pd
#import re
class ClinicData:
def __init__(self, branch):
self.clinical_data = []
self.tag_list = []
for elem in branch.iter():
self.tag_list.append(elem.tag)
#print(elem.tag)
def parse_cd(self, branch):
for elem in branch.iter():
if elem.tag in self.tag_list:
print(f"{elem.tag}--->{elem.attrib}")
if elem.tag == "{http://www.cdisc.org/ns/odm/v1.3}AuditRecord":
AuditRec_val = pd.read_xml(ET.tostring(elem))
print(AuditRec_val)
branch.clear()
def main():
"""Parse each clinic data into the class """
xml_file = 'Doug_s.xml'
#tree = ET.parse(xml_file)
#root = tree.getroot()
#ns = re.match(r'{.*}', root.tag).group(0)
#print("Namespace:",ns)
parser = ET.XMLPullParser(['end'])
with open(xml_file, 'r', encoding='utf-8') as et_xml:
for line in et_xml:
parser.feed(line)
for event, elem in parser.read_events():
if elem.tag == "{http://www.cdisc.org/ns/odm/v1.3}ClinicalData" and event =='end':
#print(event, elem.tag)
elem_part = ClinicData(elem)
elem_part.parse_cd(elem)
if __name__ == "__main__":
main()

Converting Python script to be able to run in Spark/Hadoop

I have a python script that currently runs on my desktop. It takes a csv file with roughly 25 million lines (Maybe 15 or so columns) and performs line by line operations.
For each line of input, multiple output lines are produced. The results are then output line by line into a csv file, the output ends up at around 100 million lines.
Code looks something like this:
with open(outputfile,"a") as outputcsv:
with open(inputfile,"r") as input csv:
headerlist=next(csv.reader(csvfile)
for row in csv.reader(csvfile):
variable1 = row[headerlist.index("VAR1")]
variableN = row[headerlist.index("VARN")]
while calculations not complete:
do stuff #Some complex calculations are done at this point
outputcsv.write(stuff)
We're now trying to convert the script to run via Hadoop, using pyspark.
I have no idea how to even start. I'm trying to work out how to iterate through an RDD object but don't think it can be done.
Is a line by line calculation like this suitable for distributed processing?
If you directly want to run the script, you could do so via spark-submit:
spark-submit master local[*]/yarn other_parameters path_to_your_script.py
But I would suggest to go for spark API's as they are easy to use. It will lower the coding overhead.
First you have to create a spark session variable so that you could access all spark functions:
spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("parameters", "value")
.getOrCreate()
Next, if you want to load a csv file:
file = spark.read.csv("path to file")
You can specify optional parameters like headers, inferschema, etc:
file=spark.read.option("header","true").csv("path to your file")
'file' will now be a pyspark dataframe.
You can now write the end output like this:
file.write.csv("output_path")
Please refer to the documentation : spark documentation for transformations and other information.

Using Hadoop InputFormat in Pyspark

I'm working on a file parser for Spark that can basically read in n lines at a time and place all of those lines as a single row in a dataframe.
I know I need to use InputFormat to try and specify that, but I cannot find a good guide to this in Python.
Is there a method for specifying a custom InputFormat in Python or do I need to create it as a scala file and then specify the jar in spark-submit?
You can directly use the InputFormats with Pyspark.
Quoting from the documentation,
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs.
Pass the HadoopInputFormat class to any of these methods of pyspark.SparkContext as suited,
hadoopFile()
hadoopRDD()
newAPIHadoopFile()
newAPIHadoopRDD()
To read n lines, org.apache.hadoop.mapreduce.lib.NLineInputFormat can be used as the HadoopInputFormat class with the newAPI methods.
I cannot find a good guide to this in Python
In the Spark docs, under "Saving and Loading Other Hadoop Input/Output Formats", there is an Elasticsearch example + links to an HBase example.
can basically read in n lines at a time... I know I need to use InputFormat to try and specify that
There is NLineInputFormat specifically for that.
This is a rough translation of some Scala code I have from NLineInputFormat not working in Spark
def nline(n, path):
sc = SparkContext.getOrCreate
conf = {
"mapreduce.input.lineinputformat.linespermap": n
}
hadoopIO = "org.apache.hadoop.io"
return sc.newAPIHadoopFile(path,
"org.apache.hadoop.mapreduce.lib.NLineInputFormat",
hadoopIO + ".LongWritable",
hadoopIO + ".Text",
conf=conf).map(lambda x : x[1]) # To strip out the file-offset
n = 3
rdd = nline(n, "/file/input")
and place all of those lines as a single row in a dataframe
With NLineInputFormat, each string in the RDD is actually new-line delimited. You can rdd.map(lambda record : "\t".join(record.split('\n'))), for example to put make one line out them.

How to read hadoop map file using python?

I have map file that is block compressed using DefaultCodec. The map file is created by java application like this:
MapFile.Writer writer =
new MapFile.Writer(conf, path,
MapFile.Writer.keyClass(IntWritable.class),
MapFile.Writer.valueClass(BytesWritable.class),
MapFile.Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()));
This file is stored in hdfs and I need to read some key,values from it in another application using python. I can't find any library that can do that. Do you have any suggestion and example?
Thanks
I would suggest using Spark which has a function called textFile() which can read files from HDFS and turn them into RDDs for further processing using other Spark libraries.
Here's the documentation : Pyspark
Create a reader as follow:
path = '/hdfs/path/to/file'
key = LongWritable()
value = LongWritable()
reader = MapFile.Reader(path)
while reader.next(key, value):
print key, value
Check out these hadoop.io.MapFile Python examples
And available methods in MapFile.py

Categories

Resources