Spark Structured Streaming Dynamic Parsing of XML

Spark Structured Streaming Dynamic Parsing of XML - python

Currently im consuming xml data from a kafka source and process them with Spark Structured Streaming.
To get the needed information out of the xml i am using xpath. As i want to make the pipeline more dynamic i tried to implement a dictionary which hold the column name to be extracted and the expression itself. In a future version the dictionary could get filled by some configuration files without touching the python job.
Unfortunatly it seems not to work as desired (i am a python noob, maybe that's why...).
The xml could be described as followed:
<root>
<header eventId ="1234" .../>
<.../>
</root>
My python code looks like this:
df = spark.readStream.format("kafka")...load()
df = df.selectExpr("CAST(timestamp) AS String)", "CAST(value AS String"))
xml_data = df
.selectExpr("xpath(value, './root/header/#eventId')event_id", ...)
.selectExpr("explode(arrays_zip(event_id,...)) value"
.select('value.*')
My next step was defining the dict:
mapping_dict = {
'event_id' : './root/header/#eventId',
...
}
I tried to rebuild the expression like this:
event_id = "\"xpath(value,'" + mapping_dict.get('event_id) + "') event_id\""
xml_data = df.selectExpr(event_id,...)
.selectExpr("explode(arrays_zip(event_id,...)) value"
.select('value.*')
Now i tried to use the dict value in the selectExpr but it fails with an error
org.apache.spark.sql.AnalysisException: cannot resolve '`event_id`' given input columns: [xpath(value, './root/header/#eventId') event_id ...
So this is my first problem, the second one would be, that i want to iterate over this dict and try to extract each entry from the xml. I don't know if i can do that with structured streaming that easily or if i would have to use an udf. And if so, how could an udf for this purpose look like?
Cheers

Related

how to get nested data with pandas and request

I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax

I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.

How can I extract the data from these strings?

I am making a program that consists of scraping data from a job page, and I get to this data
{"job":{"ciphertext":"~01142b81f148312a7c","rid":225177647,"uid":"1416152499115024384","type":2,"access":4,"title":"Need app developers to handle our app upgrades","status":1,"category":{"name":"Mobile Development","urlSlug":"mobile-development"
,"contractorTier":2,"description":"We have an app currently built, we are looking for someone to \n\n1) Manage the app for bugs etc \n2) Provide feature upgrades \n3) Overall Management and optimization \n\nPlease get in touch and i will share more details. ","questions":null,"qualifications":{"type":0,"location":null,"minOdeskHours":0,"groupRecno":0,"shouldHavePortfolio":false,"tests":null,"minHoursWeek":40,"group":null,"prefEnglishSkill":0,"minJobSuccessScore":0,"risingTalent":true,"locationCheckRequired":false,"countries":null,"regions":null,"states":null,"timezones":null,"localMarket":false,"onSiteType":null,"locations":null,"localDescription":null,"localFlexibilityDescription":null,"earnings":null,"languages":null
],"clientActivity":{"lastBuyerActivity":null,"totalApplicants":0,"totalHired":0,"totalInvitedToInterview":0,"unansweredInvites":0,"invitationsSent":0
,"buyer":{"isPaymentMethodVerified":false,"location":{"offsetFromUtcMillis":14400000,"countryTimezone":"United Arab Emirates (UTC+04:00)","city":"Dubai","country":"United Arab Emirates"
,"stats":{"totalAssignments":31,"activeAssignmentsCount":3,"feedbackCount":27,"score":4.9258937139,"totalJobsWithHires":30,"hoursCount":7.16666667,"totalCharges":{"currencyCode":"USD","amount":19695.83
,"jobs":{"postedCount":59,"openCount":2
,"avgHourlyJobsRate":{"amount":19.999534874418824
But the problem is that the only data I need is:
-Title
-Description
-Customer activity (lastBuyerActivity, totalApplicants, totalHired, totalInvitedToInterview, unansweredInvites, invitationsSent)
-Buyer (isPaymentMethodVerified, location (Country))
-stats (All items)
-jobs (all items)
-avgHourlyJobsRate

These sort of data are JSON type data. Python understands these sort of data through dictionary data type.
Suppose you have your data stored in a string. You can use di = exec(myData) to convert the string to dictionary. Then you can access the structured data like: di["job"] which return's the job section of the data.
di = exec(myData)
print(`di["job"]`)
However this is just a hack and it is not recommended because it's a
bit messy and unpythonic.
The appropriate way is to use JSON library to convert the data to dictionary. Take a look at the code snippet below to get an idea of what is the appropriate way:
import json
myData = "Put your data Here"
res = json.loads(myData)
print(res["jobs"])

convert the data to dictionary using json.loads
then you can easily use the dictionary keys that your want to lookup or filter the data.

This seems to be a dictionary so you can extract something from it by doing: dictionary["job"]["uid"] for example. If it is a Json file convert the data to a Python dictionary

Evaluate Xquery in pyspark on RDD elements

We are trying to read large number of XML's and run Xquery on them in pyspark for example books xml. We are using spark-xml-utils library.
We want to feed the directory containing xmls to pyspark.
Run Xquery on all of them to get our results.
reference answer: Calling scala code in pyspark for XSLT transformations
The definition of xquery processor where xquery is the string of xquery:
proc = sc._jvm.com.elsevier.spark_xml_utils.xquery.XQueryProcessor.getInstance(xquery)
We are reading the files in a directory using:
sc.wholeTextFiles("xmls/test_files")
This gives us an RDD containing all the files as a list of tuples:
[ (Filename1,FileContentAsAString), (Filename2,File2ContentAsAString) ]
The xquery evaluates and gives us results if we run on the string (FileContentAsAString)
whole_files = sc.wholeTextFiles("xmls/test_files").collect()
proc.evaluate(whole_files[1][1])
# Prints proper xquery result for that file
Problem:
If we try to run proc.evaluate() on the RDD using lambda function, it is failing.
test_file = sc.wholeTextFiles("xmls/test_files")
test_file.map(lambda x: proc.evaluate(x[1])).collect()
# Should give us a list of xquery results
Error:
PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
These functions work somehow but not the evaluate above:
Print the content xquery is applied on
test_file.map(lambda x: x[1]).collect()
# Outputs the content. if x[0], gives us the list of filenames
Return the len of characters in the contents
test_file.map(lambda x: len(x[1])).collect()
# Output: [15274, 13689, 13696]
Books example for reference:
books_xquery = """for $x in /bookstore/book
where $x/price>30
return $x/title/data()"""
proc_books = sc._jvm.com.elsevier.spark_xml_utils.xquery.XQueryProcessor.getInstance(books_xquery)
books_xml = sc.wholeTextFiles("xmls/books.xml")
books_xml.map(lambda x: proc_books.evaluate(x[1])).collect()
# Error
# I can share the stacktrace if you guys want

Unfortunately it is not possible to call a Java/Scala library directly within a map call from Python code. This answer gives a good explanation why there is no easy way to do this. In short the reason is that the Py4J gateway (which is necessary to "translate" the Python calls into the JVM world) only lives on the driver node while the map calls that you are trying to execute are running on the executor nodes.
One way around that problem would be to wrap the XQuery function in a Scala UDF (explained here), but it still would be necessary to write a few lines of Scala code.
EDIT: If you are able to switch from XQuery to XPath, a probably easier option is to change the (XPath) library. ElementTree is an XML libary written in Python and also XPath.
The code
xmls = spark.sparkContext.wholeTextFiles("xmls/test_files")
import xml.etree.ElementTree as ET
xpathquery = "...your query..."
xmls.flatMap(lambda x: ET.fromstring(x[1]).findall(xpathquery)) \
.map(lambda x: x.text) \
.foreach(print)
would print all results of running the xpathquery against all documents loaded from the directory xmls/test_files.
At first a flatMap is used as the findall call returns a list of all matching elements within each document. By using flatMap this list is flattened (the result might contain more than one element per file). In the second map call the elements are mapped to their text in order to get a readable output.

Structuring Firebase Database

I'm following this tutorial to structure Firebase data. Near the end, it says the following:
With this kind of structure, you should keep in mind to update the data at 2 locations under the user and group too. Also, I would like to notify you that everywhere on the Internet, the object keys are written like "user1","group1","group2" etc. where as in practical scenarios it is better to use firebase generated keys which look like '-JglJnGDXcqLq6m844pZ'. We should use these as it will facilitate ordering and sorting.
So based on that, I'm assuming that the final result should be the following:
I'm using this python wrapper to post the data.
How can I achieve this?

When you write data to a Firebase array (for example in Javascript) using a line like this
var newPostKey = firebase.database().ref().child('users').push().key;
var updates = {item1: value1, item2: value2};
return firebase.database().ref().update(updates);
Like is described here, you will get a generated key for data "pushed". In the example above newPostKey will contain this generated key
UPDATE
To answer the updated question with with the Python wrapper:
Look for the section "Saving Data" in the page you linked to.
The code would look something like this;
data = {"Title": "The Animal Book"}
book = db.child("AllBooks").push(data)
data = {"Title": "Animals"}
category = db.child("Categories").push(data)
data = {category['name']: true }
db.child("AllBooks").child(book['name']).child("categories").push(data)

How do you create a non-nested xml element using Python's lxml.objectify?

My current code is
xml_obj = lxml.objectify.Element('root_name')
xml_obj[root_name] = str('text')
lxml.etree.tostring(xml_obj)
but this creates the following xml:
<root_name><root_name>text</root_name></root_name>
In the application I am using this for I could easily use text substitution to solve this problem, but it would be nice to know how to do it using the library.

I'm not that familiar with objectify, but i don't think that's the way it's intended to be used. The way it represents objects, is that a node at any given level is, say, a classname, and the subnodes are field names (with types) and values. And the normal way to use it would be something more like this:
xml_obj = lxml.objectify.Element('xml_obj')
xml_obj.root_path = 'text'
etree.dump(xml_obj)
<root_name xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="TREE">
<root_name py:pytype="str">text</root_name>
</root_name>
What you want would be way easier to do with etree:
xml_obj = lxml.etree.Element('root_path')
xml_obj.text = 'text'
etree.dump(xml_obj)
<root_path>text</root_path>
If you really need it to be in objectify, it looks like while you shouldn't mix directly, you can use tostring to generate XML, then objectify.fromstring to bring it back. But probably, if this is what you want, you should just use etree to generate it.

I don't think you can write data into the root element. You may need to create a child element like this:
xml_obj = lxml.objectify.Element('root_name')
xml_obj.child_name = str('text')
lxml.etree.tostring(xml_obj)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark Structured Streaming Dynamic Parsing of XML - python

Related

how to get nested data with pandas and request

How can I extract the data from these strings?

Evaluate Xquery in pyspark on RDD elements

Structuring Firebase Database

How do you create a non-nested xml element using Python's lxml.objectify?

Categories

Resources