We are trying to read large number of XML's and run Xquery on them in pyspark for example books xml. We are using spark-xml-utils library.
We want to feed the directory containing xmls to pyspark.
Run Xquery on all of them to get our results.
reference answer: Calling scala code in pyspark for XSLT transformations
The definition of xquery processor where xquery is the string of xquery:
proc = sc._jvm.com.elsevier.spark_xml_utils.xquery.XQueryProcessor.getInstance(xquery)
We are reading the files in a directory using:
sc.wholeTextFiles("xmls/test_files")
This gives us an RDD containing all the files as a list of tuples:
[ (Filename1,FileContentAsAString), (Filename2,File2ContentAsAString) ]
The xquery evaluates and gives us results if we run on the string (FileContentAsAString)
whole_files = sc.wholeTextFiles("xmls/test_files").collect()
proc.evaluate(whole_files[1][1])
# Prints proper xquery result for that file
Problem:
If we try to run proc.evaluate() on the RDD using lambda function, it is failing.
test_file = sc.wholeTextFiles("xmls/test_files")
test_file.map(lambda x: proc.evaluate(x[1])).collect()
# Should give us a list of xquery results
Error:
PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
These functions work somehow but not the evaluate above:
Print the content xquery is applied on
test_file.map(lambda x: x[1]).collect()
# Outputs the content. if x[0], gives us the list of filenames
Return the len of characters in the contents
test_file.map(lambda x: len(x[1])).collect()
# Output: [15274, 13689, 13696]
Books example for reference:
books_xquery = """for $x in /bookstore/book
where $x/price>30
return $x/title/data()"""
proc_books = sc._jvm.com.elsevier.spark_xml_utils.xquery.XQueryProcessor.getInstance(books_xquery)
books_xml = sc.wholeTextFiles("xmls/books.xml")
books_xml.map(lambda x: proc_books.evaluate(x[1])).collect()
# Error
# I can share the stacktrace if you guys want
Unfortunately it is not possible to call a Java/Scala library directly within a map call from Python code. This answer gives a good explanation why there is no easy way to do this. In short the reason is that the Py4J gateway (which is necessary to "translate" the Python calls into the JVM world) only lives on the driver node while the map calls that you are trying to execute are running on the executor nodes.
One way around that problem would be to wrap the XQuery function in a Scala UDF (explained here), but it still would be necessary to write a few lines of Scala code.
EDIT: If you are able to switch from XQuery to XPath, a probably easier option is to change the (XPath) library. ElementTree is an XML libary written in Python and also XPath.
The code
xmls = spark.sparkContext.wholeTextFiles("xmls/test_files")
import xml.etree.ElementTree as ET
xpathquery = "...your query..."
xmls.flatMap(lambda x: ET.fromstring(x[1]).findall(xpathquery)) \
.map(lambda x: x.text) \
.foreach(print)
would print all results of running the xpathquery against all documents loaded from the directory xmls/test_files.
At first a flatMap is used as the findall call returns a list of all matching elements within each document. By using flatMap this list is flattened (the result might contain more than one element per file). In the second map call the elements are mapped to their text in order to get a readable output.
Related
Currently im consuming xml data from a kafka source and process them with Spark Structured Streaming.
To get the needed information out of the xml i am using xpath. As i want to make the pipeline more dynamic i tried to implement a dictionary which hold the column name to be extracted and the expression itself. In a future version the dictionary could get filled by some configuration files without touching the python job.
Unfortunatly it seems not to work as desired (i am a python noob, maybe that's why...).
The xml could be described as followed:
<root>
<header eventId ="1234" .../>
<.../>
</root>
My python code looks like this:
df = spark.readStream.format("kafka")...load()
df = df.selectExpr("CAST(timestamp) AS String)", "CAST(value AS String"))
xml_data = df
.selectExpr("xpath(value, './root/header/#eventId')event_id", ...)
.selectExpr("explode(arrays_zip(event_id,...)) value"
.select('value.*')
My next step was defining the dict:
mapping_dict = {
'event_id' : './root/header/#eventId',
...
}
I tried to rebuild the expression like this:
event_id = "\"xpath(value,'" + mapping_dict.get('event_id) + "') event_id\""
xml_data = df.selectExpr(event_id,...)
.selectExpr("explode(arrays_zip(event_id,...)) value"
.select('value.*')
Now i tried to use the dict value in the selectExpr but it fails with an error
org.apache.spark.sql.AnalysisException: cannot resolve '`event_id`' given input columns: [xpath(value, './root/header/#eventId') event_id ...
So this is my first problem, the second one would be, that i want to iterate over this dict and try to extract each entry from the xml. I don't know if i can do that with structured streaming that easily or if i would have to use an udf. And if so, how could an udf for this purpose look like?
Cheers
I currently am trying to build an XML file from a CSV file. Currently my code reads the CSV file to data and begins creating the XML from the data that is stored within the CSV.
CSV Example:
Element,XMLFile
SubElement,XMLName,XMLFile
SubElement,XMLDate,XMLName
SubElement,XMLInformation,XMLDate
SubElement,XMLTime,XMLName
Expected Output:
<XMLFile>
<XMLName>
<XMLDate>
<XMLInformation />
</XMLDate>
<XMLTime />
</XMLName>
</XMLFile>
Currently my code attempts to look at the CSV to see what the parent is for the new subelement:
# Defines main element
# xmlElement = xml.Element(XMLFile)
xmlElement = xml.Element(csvData[rowNumber][columnNumber])
# Should Define desired parent (FAIL) and SubElement name (PASS)
# xmlSubElement = xml.SubElement(XMLFile, XMLName)
xmlSubElement = xml.SubElement(csvData[rowNumber][columnNumber + 2], csvData[rowNumber][columnNumber + 1])
When the code attempts to use the CSV source string as the parent parameter, Python 3.5 generates the following error:
TypeError: must be xml.etree.ElementTree.Element, not str
Known cause of the error is that the parent paramenter is being returned as a string, when it is expected to be an Element or SubElement.
Is it possible to recall the stored value from the CSV and have it reference the Element or SubElement, instead of a string? The goal is to allow the code to read the CSV file and assign any SubElement to the parent listed in the CSV.
I cannot tell for sure, but it looks like you are doing:
ElementTree.SubElement(str, str)
when you should be doing:
ElementTree.SubElement(Element, str)
It also seems like you already know this. The real question, then, is how are you going to reference the parent object when you only know its tag string? You could search for Elements in the ElementTree with that particular tag string, but this is generally not a good idea as XML allows multiple instances of similar elements.
I would suggest you either:
Find a strategy to store references to parent elements
See if there is a way to uniquely identify the parent element using XPath
I am trying to read in a log file using Apache Pig. After reading in the file I want to use my own User Defined Functions in Python. What I'm trying to do is somthing like the following code, but it results in ERROR 1066:Unable to open iterator for alias B, which I have been unable to find a solution for via google.
register 'userdef.py' using jython as parser;
A = LOAD 'test_data' using PigStorage() as (row);
B = FOREACH A GENERATE parser.split(A.row);
DUMP B;
However, if I replace A.row with an empty string '' the function call is completed and no error ocurrs (but the data is not passed nor processed either).
What is the proper way to pass the row of data to the UDF in string format?
You do not need to specify A.row, row alone or $0 should work.
$0 is the first column, $1 the second one.
Be careful, PigStorage will automatically split your data if it finds any delimiter, so row may be only first element of each row.
Antony.
I have a config file that I'm reading using the following code:
import configparser as cp
config = cp.ConfigParser()
config.read('MTXXX.ini')
MT=identify_MT(msgtext)
schema_file = config.get(MT,'kbfile')
fold_text = config.get(MT,'fold')
The relevant section of the config file looks like this:
[536]
kbfile=MT536.kb
fold=:16S:TRANSDET\n
Later I try to find text contained in a dictionary that matches the 'fold' parameter, I've found that if I find that text using the following function:
def test (find_text)
return {k for k, v in dictionary.items() if find_text in v}
I get different results if I call that function in one of two ways:
test(fold_text)
Fails to find the data I want, but:
test(':16S:TRANSDET\n')
returns the results I know are there.
And, if I print the content of the dictionary, I can see that it is, as expected, shown as
:16S:TRANSDET\n
So, it matches when I enter the search text directly, but doesn't find a match when I load the same text in from a config file.
I'm guessing that there's some magic being applied here when reading/handling the \n character pattern in from the config file, but don't know how to get it to work the way I want it to.
I want to be able to parameterise using escape characters but it seems I'm blocked from doing this due to some internal mechanism.
Is there some switch I can apply to the config reader, or some extra parsing I can do to get the behavior I want? Or perhaps there's an alternate solution. I do find the configparser module convenient to use, but perhaps this is a limitation that requires an alternative, or even self-built module to lift data out of a parameter file.
I've to filter the list of URLs in a jsonpath expression containing a substring in Python, I've tried the following but not able to get the desired results.
I referred to http://goessner.net/articles/JsonPath/ and http://mikelev.in/2012/08/implementing-jsonpath-in-python-with-examples/
Here are the details of all what I have tried:
My json response:
{
"127.0.0.1": {
"URLs": [
"http://www.test.ca/",
"http://b.scorecardresearch.com/p?ns__t=1387392184071&ns__c=ISO-8859-1&c1=3&c3=_es_7948950&c4=56568219&c5=105139691&c6=&c10=1&c11=1016510&c13=728x90&c16=dfa&c2=14397547&ax_iframe=2&ns_ce_mod=vce_st&ns__p=1387391507295&ax_cid=14397547&ax_bl=0&ax_blt=1228&ns_ad_event=show&ns_ad_id=DCF277937840&ns_ad_sz=728x90",
"http://cdn.media.ca/a/mediative/sites/test_en.js",
"http://pt200233.unica.com/ntpage.gif?js=1&ts=1387392184554.791&lc=http%3A%2F%2Fwww.test.ca%2F%3Fni_title%3D%2Fhome%2Fhomepage&rf=http%3A%2F%2Fwww.test.ca%2F&rs=1680x1050&cd=32&ln=en&tz=GMT%20-05%3A00&jv=1&ck=UnicaID%3DwQVZatfvXZ5-YZ0yaPj&m.pn=homepage&m.mlc=%2Fhome&m.cv_c13=ctest-new&m.cv_c14=en&m.utv=ut.ctest.2.2.131022.74&m.host=www.test.ca&m.page=%2Fhome%2Fhomepage&m.mlc0=home&ets=1387392184559.194&site=test",
]
}
}
Above Json response is parsed as:
parsed_input = json.loads(urllib.urlopen('<URL for the above JSON response>').read())
To get the list of all URLs from the JSON response, I tried the following, which works great:
'\n'.join(jsonpath.jsonpath(parsed_input, '$..URLs[*]'))
Output:
http://www.test.ca/
http://b.scorecardresearch.com/p?ns__t=1387392184071&ns__c=ISO-8859-1&c1=3&c3=_es_7948950&c4=56568219&c5=105139691&c6=&c10=1&c11=1016510&c13=728x90&c16=dfa&c2=14397547&ax_iframe=2&ns_ce_mod=vce_st&ns__p=1387391507295&ax_cid=14397547&ax_bl=0&ax_blt=1228&ns_ad_event=show&ns_ad_id=DCF277937840&ns_ad_sz=728x90"
http://cdn.media.ca/a/mediative/sites/test_en.js"
http://pt200233.unica.com/ntpage.gif?js=1&ts=1387392184554.791&lc=http%3A%2F%2Fwww.test.ca%2F%3Fni_title%3D%2Fhome%2Fhomepage&rf=http%3A%2F%2Fwww.test.ca%2F&rs=1680x1050&cd=32&ln=en&tz=GMT%20-05%3A00&jv=1&ck=UnicaID%3DwQVZatfvXZ5-YZ0yaPj&m.pn=homepage&m.mlc=%2Fhome&m.cv_c13=ctest-new&m.cv_c14=en&m.host=www.test.ca&m.page=%2Fhome%2Fhomepage&m.mlc0=home&ets=1387392184559.194&site=test
Next I've to retrieve only those URLs that contain the word "unica".
I've tried everything below, but receive the TypeError,
what am I missing?:
'\n'.join(jsonpath.jsonpath(parsed_input, '$..URLs[?(/unica/)]'))
'\n'.join(jsonpath.jsonpath(parsed_input, '$..URLs[?(#(unica))]'))
'\n'.join(jsonpath.jsonpath(parsed_input, '$..URLs[?(#.(*.unica.*))]'))
'\n'.join(jsonpath.jsonpath(parsed_input, '$.*.URLs[?(unica)]'))
'\n'.join(jsonpath.jsonpath(parsed_input, '$.*.URLs[?:unica]'))
thanks,
Sam
The ? operator introduces a script element which runs in Python so needs to use Python syntax.
In this case you could use:
print '\n'.join(jsonpath.jsonpath(parsed_input, "$..URLs[?('unica' in #)]"))
A useful option for these cases is to use the debug option via:
jsonpath.jsonpath(parsed_input, '$..URLs[?(/unica/)]',debug=True)
This prints out various output including:
evalx /unica/
eval /unica/
invalid syntax (<string>, line 1)
The line "eval /unica/" shows you what is being run in Python so you can see what is failing.
Following Peter's explanation, you can actually use a regular expression in jsonpath filter expression if need be, using the dunder import built-in.
jsonpath.jsonpath(parsed_input, "$..URLs[?(__import__('re').match('.*unic', #))]")
Upon further look, jsonpath is a collection of hacks, specifically there's this line:
# Get caller globals so eval can pick up user functions!!!
caller_globals = sys._getframe(1).f_globals
Hence, if re is imported in the module from where you call jsonpath.jsonpath this would as well:
jsonpath.jsonpath(parsed_input, "$..URLs[?(re.match('.*unic', #))]")