Trying to Parse SOAP Response in Python - python

I'm struggling to find a way to parse the data that I'm getting back from a SOAP response. I'm only familiar with Python (v3.4), but relatively new to it. I'm using suds-jurko to pull the data from a 3rd party SOAP server. The response comes back in the form of "ArrayOfXmlNode". I've tried using ElementTree in different ways to parse the data, but I either get no information or I get "TypeError: invalid file: (ArrayOfXmlNode)" errors. Googling how to handle the ArrayOfXMLNode type response has gotten me nowhere.
The first part of the SOAP response is:
(ArrayOfXmlNode){
XmlNode[] =
(XmlNode){
Hl =
(Hl){
ID = "22437790"
Name = "Cameron"
SpeciesID = "1"
Sex = "Male"
PrimaryBreed = "German Shepherd"
SecondaryBreed = "Mix"
SN = ""
Age = "35"
OnHold = "No"
Location = "Foster Home"
BehaviorResult = ""
Photo = "http://sms.petpoint.com/sms/photos/615/123.jpg"
}
},
I've tried iterating through the data with code similar to:
from suds.client import Client
url = 'http://qag.petpoint.com/webservices/AdoptableSearch.asmx?WSDL'
client = Client(url)
result = client.service.adoptableSearchExtended('nunya', 0, 'A', 'All', 'N')
tree = result[0]
for node in tree:
pet_info = []
pet_info.extend(node)
print(pet_info)
The code above gives me the entire response in "result[0]". Below that I try to create a list from the data, but only get very last node (node being 1 set of information from ID to Photo). Attempts to modify this approach gives me either everything, nothing, or only the last node.
So then I tried to make use of ElementTree with simple code to test it out, but only get the "invalid file" errors.
import xml.etree.ElementTree as ET
from suds.client import Client
url = 'http://qag.petpoint.com/webservices/AdoptableSearch.asmx?WSDL'
client = Client(url)
result = client.service.adoptableSearchExtended('nunya', 0, 'A', 'All', 'N')
pet_info = ET.parse(result)
print(pet_info)
The result:
Traceback (most recent call last):
File "D:\Python\Eclipse Workspace\KivyTest\src\root\nested\Parse.py", line 11, in <module>
pet_info = ET.parse(result)
File "D:\Programs\Python34\lib\xml\etree\ElementTree.py", line 1186, in parse
tree.parse(source, parser)
File "D:\Programs\Python34\lib\xml\etree\ElementTree.py", line 587, in parse
source = open(source, "rb")
TypeError: invalid file: (ArrayOfXmlNode){
XmlNode[] =
(XmlNode){
Hl =
(Hl){
ID = "20840097"
Name = "Daisy"
SpeciesID = "1"
Sex = "Female"
PrimaryBreed = "Terrier, Pit Bull"
SecondaryBreed = ""
SN = ""
Age = "42"
OnHold = "No"
Location = "Dog Adoption"
BehaviorResult = ""
Photo = "http://sms.petpoint.com/sms/photos/615/40f428de-c015-4334-9101-89c707383817.jpg"
}
},
Can someone get me pointed in the right direction?

I had a similar problem parsing data from a web service using Python 3.4 and suds-jurko. I was able to solve the issue using the code in this post, https://stackoverflow.com/a/34844428/5874347. I used the fastest_object_to_dict function to convert the web service response into a dictionary. From there you can parse the data ...
Add the fastest_object_to_dict function to the top of your file
Make your web service call
Create a new variable to save the dictionary response to
result = client.service.adoptableSearchExtended('nunya', 0, 'A', 'All', 'N')
ParsedResponse = fastest_object_to_dict(result)
Your data will now be in the form of a dictionary, you can parse the dictionary on the python side as needed or send it back to your ajax call via json, and parse it with javascript.
To send it back as json
import json
import sys
sys.stdout.write("content-type: text/json\r\n\r\n")
sys.stdout.write(json.dumps(ParsedReponse))

Please try this:
result[0][0]
which will give you the first element of the array (ArrayOfXmlNode).
Similarly, try this:
result[0][1][2]
which will give you the third element of element result[0][1].
Hopefully, this offers an alternative solution.

If you are using Python, you can parse this result JSON from a XML result.
But your SOAP result needs to be a XML output, you can use the retxml=True on suds library.
I needed this result as a JSON output as well, and I ended up solving this way:
import xmltodict
# Parse the XML result into dict
data_dict = xmltodict.parse(soap_response)
# Dump the dict result into a JSON result
json_data = json.dumps(data_dict)
# Load the JSON string result
json = json.loads(json_data)

Related

Output non JSON data from regex web scraping to a JSON file

I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)

Python API gets invalid json

I have just started using python a few days ago and get work out the JSON format.
I use requests to get JSON data by API. However, I get the wrong decoded JSON format (JSON validator finds errors).
webpage = 'https://parser-api.com/parser/arbitr_api/run.php'
API = 'cant post it' #
output_results = []
cases = ['А65-22925/2017']
for i in cases:
params = {'key':API, 'CaseNumber':i}
results = requests.get(webpage, params = params)
output_results.append(results.text)
print (output_results)
with open ('file_name_case.json', 'w', encoding='utf8') as wr:
wr.write (str(output_results))
that is the snippet of the response that I get, which is wrong:
['{"Cases":[{"CaseId":"998ecaef-3da8-45ab-9f56-90bfc3375e11","CaseNumber":"\\u041065-22925\\/2017","CaseType":"\\u0410","Thirds":[],"Plaintiffs":[{"Name":"\\u041e\\u041e\\u041e \\"\\u0412\\u0420-\\u041f\\u043b\\u0430\\u0441\\u0442\\", \\u0433.\\u041a\\u0430\\u0437\\u0430\\u043d\\u044c","Address":"421001, \\u0420\\u043e\\u0441\\u0441\\u0438\\u044f, \\u0433.\\u041a\\u0430\\u0437\\u0430\\u043d\\u044c, \\u0420\\u0422, \\u0443\\u043b.\\u0421.\\u0425\\u0430\\u043a\\u0438\\u043c\\u0430, \\u0434.60, \\u043e\\u0444\\u0438\\u0441 164","Id":"dc26df83-6361-4de0-bc93-8c20ae0a4417"}],"Respondents":[{"Name":"\\u0424\\u0435\\u0434\\u0435\\u0440\\u0430\\u043b\\u044c\\u043d\\u0430\\u044f \\u0422\\u0430\\u043c\\u043e\\u0436\\u0435\\u043d\\u043d\\u0430\\u044f \\u0441\\u043b\\u0443\\u0436\\u0431\\u0430 \\u041f\\u0440\\u0438\\u0432\\u043e\\u043b\\u0436\\u0441\\u043a\\u043e\\u0435 \\u0442\\u0430\\u043c\\u043e\\u0436\\u0435\\u043d\\u043d\\u043e\\u0435 \\u0443\\u043f\\u0440\\u0430\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435 \\u0422\\u0430\\u0442\\u0430\\u0440\\u0441\\u0442\\u0430\\u043d\\u0441\\u043a\\u0430\\u044f \\u0442\\u0430\\u043c\\u043e\\u0436\\u043d\\u044f, \\u0433.\\u041a\\u0430\\u0437\\u0430\\u043d\\u044c","Address":"420094, \\u0420\\u043e\\u0441\\u0441\\u0438\\u044f, \\u0433.\\u041a\\u0430\\u0437\\u0430\\u043d\\u044c, \\u0420\\u0422, \\u0443\\u043b.\\u041a\\u043e\\u0440\\u043e\\u043b\\u0435\\u043d\\u043a\\u043e, \\u0434.56","Id":"4b21e3e9-9d0c-42ce-bbec-4e1615e34698"}]...
the right format suppose to be like this:
{"Cases":[{"CaseId":"998ecaef-3da8-45ab-9f56-90bfc3375e11","CaseNumber":"\u041065-22925\/2017","CaseType":"\u0410","Thirds":[],"Plaintiffs":[{"Name":"\u041e\u041e\u041e \"\u0412\u0420-\u041f\u043b\u0430\u0441\u0442\", \u0433.\u041a\u0430\u0437\u0430\u043d\u044c","Address":"421001, \u0420\u043e\u0441\u0441\u0438\u044f, \u0433.\u041a\u0430\u0437\u0430\u043d\u044c, \u0420\u0422, \u0443\u043b.\u0421.\u0425\u0430\u043a\u0438\u043c\u0430, \u0434.60, \u043e\u0444\u0438\u0441 164","Id":"dc26df83-6361-4de0-bc93-8c20ae0a4417"}],"Respondents":[{"Name":"\u0424\u0435\u0434\u0435\u0440\u0430\u043b\u044c\u043d\u0430\u044f \u0422\u0430\u043c\u043e\u0436\u0435\u043d\u043d\u0430\u044f \u0441\u043b\u0443\u0436\u0431\u0430 \u041f\u0440\u0438\u0432\u043e\u043b\u0436\u0441\u043a\u043e\u0435 \u0442\u0430\u043c\u043e\u0436\u0435\u043d\u043d\u043e\u0435 \u0443\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u0438\u0435 \u0422\u0430\u0442\u0430\u0440\u0441\u0442\u0430\u043d\u0441\u043a\u0430\u044f \u0442\u0430\u043c\u043e\u0436\u043d\u044f, \u0433.\u041a\u0430\u0437\u0430\u043d\u044c","Address":"420094, \u0420\u043e\u0441\u0441\u0438\u044f, \u0433.\u041a\u0430\u0437\u0430\u043d\u044c, \u0420\u0422, \u0443\u043b.\u041a\u043e\u0440\u043e\u043b\u0435\u043d\u043a\u043e, \u0434.56","Id":"4b21e3e9-9d0c-42ce-bbec-4e1615e34698"}]
Please help
You can do something like this:
import json
with open('file_name_case.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
Although Python syntax is very similar in many ways to JSON syntax it is not valid JSON to just str(<some python object>). You need to use the json module to write JSON.
Instead of taking results.text directly, use results.json() to decode the JSON response from the server.
Now you have a Python list of dicts as opposed to a Python list of strings containing JSON.
Then, as in kup's answer you can convert back to JSON:
output_results = []
for idx in cases:
params = {'key': API, 'CaseNumber': idx}
results = requests.get(webpage, params=params)
output_results.append(results.json())
with open('file_name_case.json', 'w') as fobj:
json.dump(output_results, fobj)

Getting the xml soap response instead of raw data

I am trying to get the soap response and read few tags and then put the key and values inside a dictionary.
Better would be if I could use the response generated directly and preform regd operations to it.
But since I was not able to do that, I tried storing the response in an xml file and then using that for operations.
My problem is that the response generated is in a raw form. How to resolve this.
Example: <medical:totEeCnt val="2" />
<medical:totMbrCnt val="2" />
<medical:totDepCnt val="0" />
def soapTest():
request = """<soapenv:Envelope.......
auth = HTTPBasicAuth('', '')
headers = {'content-type': 'application/soap+xml', 'SOAPAction': "", 'Host': 'bfx-b2b....com'}
url = "https://bfx-b2b....com/B2BWEB/services/IProductPort"
response = requests.post(url, data=request, headers=headers, auth=auth, verify=True)
# Open local file
fd = os.open('planRates.xml', os.O_RDWR|os.O_CREAT)
# Convert response object into string
response_str = str(response.content)
# Write response to the file
os.write(fd, response_str)
# Close the file
os.close(fd)
tree = ET.parse('planRates.xml')
root = tree.getroot()
dict = {}
print root
for plan in root.findall('.//{http://services.b2b.../types/rates/dental}dentPln'): # type: object
plan_id = plan.get('cd')
print plan
print plan_id
for rtGroup in plan.findall('.//{http://services.b2b....com/types/rates/dental}censRtGrp'):
#print rtGroup
for amt in rtGroup.findall('.//{http://services.b2b....com/types/rates/dental}totAnnPrem'):
# print amt
print amt.get('val')
amount = amt.get('val')
dict[plan_id] = amount
print dict
Update-:
I did few things, what I am not able to understand is that ,
using this, the operations further are working,
tree = ET.parse('data/planRates.xml')
root = tree.getroot()
dict = {}
print tree
print root
for plan in root.findall(..
output -
<xml.etree.ElementTree.ElementTree object at 0x100d7b910>
<Element '{http://schemas.xmlsoap.org/soap/envelope/}Envelope' at 0x101500450>
But after using this ,it is not working
tree = ET.fromstring(response.text)
print tree
for plan in tree.findall(..
output-:
<Element '{http://schemas.xmlsoap.org/soap/envelope/}Envelope' at 0x10d624910>
Basically I am using the same object only .
Supposing you get a response that you want as proper xml object:
rt = resp.text.encode('utf-8')
# printed rt is something like:
'<soap:Envelope xmlns:soap="http://...">
<soap:Envelope
<AirShoppingRS xmlns="http://www.iata.org/IATA/EDIST" Version="16.1">
<Document>...</soap:Envelope</soap:Envelope>'
# striping soapEnv
startTag = '<AirShoppingRS '
endTag = '</AirShoppingRS>'
trimmed = rt[rt.find(startTag): rt.find(endTag) + len(endTag)]
# parsing
from lxml import etree as et
root = et.fromstring(trimmed)
With this root element you can use find method, xpath or whatever you prefer.
Obviously you need to change the start and endtags to extract the correct element from the response but you get the idea, right?

Parsing HTML using LXML Python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.
You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

How to use Hacker News API in Python?

Hacker News has released an API, how do I use it in Python?
I want get all the top posts. I tried using urllib, but I don't think I am doing right.
here's my code:
import urllib2
response = urllib2.urlopen('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
html = response.read()
print response.read()
It just prints empty
''
I missed a line, had updated my code.
As #jonrsharpe, explained read() is only one time operation. So if you print html, you will get list of all ids. And if you go through that list, you have to make each request again to get story of each id.
First you have to convert the received data to python list and go through them all.
base_url = 'https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty'
top_story_ids = json.loads(html)
for story in top_story_ids:
response = urllib2.urlopen(base_url.format(story))
print response.read()
Instead of all this, you could use haxor, it's a Python wrapper for Hacker News API. Following code will fetch you all the ids of top stories :
from hackernews import HackerNews
hn = HackerNews()
top_story_ids = hn.top_stories()
# >>> top_story_ids
# [8432709, 8432616, 8433237, ...]
Then you can go through that loop and print all them, for example:
for story in top_story_ids:
print hn.get_item(story)
Disclaimer: I wrote haxor.
You should
print html
instead of
print response.read()
Why? Because the read is a one-time operation; after you've done it, you can't repeat it:
>>>import ullrib2
>>> response = urllib2.urlopen('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
>>> response.read()
'[ 8445087, 8444739, 8444603, 8443981, 8444976, 8443902, 8444252, 8444634, 8444931, 8444272, 8444025, 8441939, 8444510, 8444640, 8443830, 8445076, 8443470, 8444785, 8443028, 8444077, 8444832, 8443841, 8443467, 8443309, 8443187, 8443896, 8444971, 8443360, 8444601, 8443287, 8441095, 8441681, 8441055, 8442712, 8444909, 8443621, 8442596, 8443836, 8442266, 8443298, 8445122, 8443096, 8441699, 8442119, 8442965, 8440486, 8442093, 8443393, 8442067, 8444989, 8440985, 8444622, 8438728, 8442555, 8444880, 8442004, 8443185, 8444370, 8436210, 8437671, 8439641, 8443727, 8441702, 8436309, 8441041, 8437367, 8422087, 8441711, 8438063, 8444212, 8439408, 8442049, 8440989, 8439367, 8438515, 8437403, 8435278, 8442486, 8442730, 8428522, 8438904, 8443450, 8432703, 8430412, 8422928, 8443635, 8439267, 8440191, 8439560, 8437230, 8442556, 8439977, 8444140, 8441682, 8443776, 8441209, 8428632, 8441388, 8422599, 8439547 ]\n'
>>> response.read()
''
In your case, though, you've assigned the string from read to the name html, so you can still access it.
Once you have the story IDs, you can access each one via '.../v0/item/{item number}.json?print=pretty':
>>> response = urllib2.urlopen('https://hacker-news.firebaseio.com/v0/item/8445087.json?print=pretty')
>>> print response.read()
{
"by" : "lalmachado",
"id" : 8445087,
"kids" : [ 8445205, 8445195, 8445173, 8445103 ],
"score" : 21,
"text" : "",
"time" : 1413116430,
"title" : "Show HN: Powerful ASCII art editor designed for the Mac",
"type" : "story",
"url" : "http://monodraw.helftone.com/"
}
You should read through the API documentation before continuing. It's also worth getting to grips with the json module.

Categories

Resources