Python urllib2 decode chunked encoding - python

I have the following code to open and read URLs:
html_data = urllib2.urlopen(req).read()
and I believe this is the most standard way to read data from HTTP.
However, when the response have chunked tranfer-encoding, the response starts with the following characters:
1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...
This happens due to the mentioned above chunked encoding and thus my XML data becomes corrupted.
So I wonder how I can get rid of all meta-data related to the chunked encoding?

I ended up with custom xml stripping, like this:
xml_start = html_data.find('<?xml')
xml_end = html_data.rfind('</mytag>')
if xml_start !=0:
log_user_action(req.get_host() ,'chunked data', html_data, {})
html_data = html_data[xml_start:]
if xml_end != len(html_data)-len('</mytag>')-1:
html_data = html_data[:xml_end+1]
Can't find any simple solution.

1eb0\r\n2625\r\n is the segment start/stop positions (in hex) in the reassembled payload

You can remove everything before ?xml
html_data = html_data[html_data.find('<?xml'):]

Related

Extract XML data from SOAP Response in Python

I am using a Python script to receive an XML response from a SOAP web service, and I'd like to extract specific values from the XML response. I'm trying to use the 'untangle' library, but keep getting the following error:
AttributeError: 'None' has no attribute 'Envelope'
Below is a sample of my code. I'm trying to extract the RequestType value from the below
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Header/>
<soap:Body>
<Response>\n
<RequestType>test</RequestType>
</Response>
</soap:Body>
</soap:Envelope>
Sample use of untangle
parsed_xml = untangle.parse(xml)
print(parsed_xml.Envelope.Response.RequestType.cdata)
I've also tried parsed_xml.Envelope.Body.Response.RequestType.cdata
This will solve your problem, assuming you want to extract 'test'. By the way, i think your response should not have 'soap:Header/':
import xmltodict
stack_d = xmltodict.parse(response.content)
stack_d['soap:Envelope']['soap:Body']['Response']['RequestType']
I think you will find the xml.etree library to be more usable in this context.
import requests
from xml.etree import ElementTree
Then we need to define the namespaces for the SOAP Response
namespaces = {
'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
'a': 'http://www.etis.fskab.se/v1.0/ETISws',
}
dom = Element.tree.fromstring(response.context)
Then simply find all the DOMs
names = dom.findall('./soap:Body',namespaces)

Ahref html issue, html tags not showing

I don't seem to be having much luck solving this issue, i am pulling data from an xml url that looks like:
<?xml version="1.0" encoding="utf-8"?>
<task>
<tasks>
<taskId>46</taskId>
<taskUserId>4</taskUserId>
<taskName>test</taskName>
<taskMode>2</taskMode>
<taskSite>thetest.org</taskSite>
<taskUser>NULL</taskUser>
<taskPass>NULL</taskPass>
<taskEmail>NULL</taskEmail>
<taskUrl>https://www.thetest.com/</taskUrl>
<taskTitle>test</taskTitle>
<taskBody>This is a test using html tags.</taskBody>
<taskCredentials>...</taskProxy>
</tasks>
</task>
This part is where i'm having issues:
<taskBody>This is a test using html tags.</taskBody>
I pull the data using BeautifulSoup like:
# beautifulsoup setup
soup = BeautifulSoup(projects.text, 'xml')
xml_task_id = soup.find('taskId')
xml_task_user_id = soup.find('taskUserId')
xml_task_name = soup.find('taskName')
xml_mode = soup.find('taskMode')
xml_site_name = soup.find('taskSite')
xml_username = soup.find('taskUser')
xml_password = soup.find('taskPass')
xml_email = soup.find('taskEmail')
xml_url = soup.find('taskUrl')
xml_content_title = soup.find('taskTitle')
xml_content_body = soup.find('taskBody')
xml_credentials = soup.find('taskCredentials')
xml_proxy = soup.find('taskProxy')
print(xml_content_body.get_text())
When i print out this part, it prints like: This is a test using html tags.
Instead of showing the ahref tag in full like: This is a test
I literally need the full string printed as is, but it keeps executing the html code instead of printing the string.
Any help would be appreciated.
Python doesn't "execute" HTML code. I'm guessing you're viewing it in a web browser, and that's interpreting the <a> tags just like it's designed to.
Use the html.escape method to turn all tags into escape sequences (with > and < and the like), which stops the browser from interpreting them.
For Phython - Escape HTML Tags: https://wiki.python.org/moin/EscapingHtml
For PHP - Escape HTMl Tags:
https://www.w3schools.com/php/showphp.asp?filename=demo_func_string_strip_tags2

How to iterate through XML values to make XML API call with iterative values?

I am working on a project to extract data via this API (http://www.yourmembership.com/company/api-reference/). The API requires that the calls be made via XML. I have successfully been able to make calls and retrieve data back via HTTP response object using the 'requests' library in python.
My question is I want to develop a function that will iterate through IDs from one API call and plug them into another function which will make iterative API calls to get more data back about each ID.
For example the script I have now gives me a dump of all event IDs:
import requests
xml ="""
<?xml version="1.0" encoding="UTF-8"?>
<YourMembership>
<Version>2.25</Version>
<ApiKey></ApiKey>
<CallID>008</CallID>
<></>
<SaPasscode></SaPasscode>
<Call Method = "Sa.Events.All.GetIDs">
<StartDate>2017/01/1</StartDate>
<EndDate>2017/12/31</EndDate>
</Call>
</YourMembership>
"""
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r = requests.post('https://api.yourmembership.com', data=xml,headers=headers)
print(r.text)
The response looks like this(no child objects)for 'print(r.text)':
<EventID>12345</EventID>
<EventID>67890</EventID>
<EventID>24680</EventID>
<EventID>13579</EventID>
<EventID>08642</EventID>
How can I take these event_IDs and iterate through the response object and plug them into this function below(function is not complete, rough draft) and get info for each event_ID by making iterative XML API calls by somehow plugging values for the field ?
import requests
def xml_event_info():
xml ="""
<?xml version="1.0" encoding="UTF-8"?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>xxx-xxx</ApiKey>
<CallID>001</CallID>
<></>
<SaPasscode>xxxx</SaPasscode>
<Call Method = "Sa.Events.Event.Get">
<EventID>12345</EventID>
</Call>
</YourMembership>
"""
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r = requests.post('https://api.yourmembership.com', data=xml,headers=headers)
print(r.text)
Thank you in advance, please let me know if my question does not make sense.
EDIT:
<YourMembership_Response>
<ErrCode>0</ErrCode>
<ExtendedErrorInfo></ExtendedErrorInfo>
<Sa.Events.All.GetIDs>
<EventID>98765</EventID>
</Sa.Events.All.GetIDs>
</YourMembership_Response>
Consider interfacing Python's built-in etree to parse the YM_Response and extract the EventID text as seen in example:
import xml.etree.ElementTree as et
txt="""
<YourMembership_Response>
<ErrCode>0</ErrCode>
<ExtendedErrorInfo></ExtendedErrorInfo>
<Sa.Events.All.GetIDs>
<EventID>98765</EventID>
</Sa.Events.All.GetIDs>
</YourMembership_Response>
"""
dom = et.fromstring(txt)
for i in dom.iterfind('.//EventID'):
print(i.text)
# 98765
Altogether, iteratively call your method passing Event IDs as a parameter that is then string formatted to XML string. See curly brace inside XML string and then .format in the requests line:
import requests
import xml.etree.ElementTree as et
def xml_event_info(eventID):
xml ='''
<?xml version="1.0" encoding="UTF-8"?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>xxx-xxx</ApiKey>
<CallID>001</CallID>
<></>
<SaPasscode>xxxx</SaPasscode>
<Call Method = "Sa.Events.Event.Get">
<EventID>{}</EventID>
</Call>
</YourMembership>
'''
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r = requests.post('https://api.yourmembership.com',
data=xml.format(eventID), headers=headers)
print(r.text)
xml ='''
<?xml version="1.0" encoding="UTF-8"?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>xxxxx</ApiKey>
<CallID>008</CallID>
<></>
<SaPasscode>xxxx</SaPasscode>
<Call Method = "Sa.Events.All.GetIDs">
<StartDate>2017/01/1</StartDate>
<EndDate>2017/12/31</EndDate>
</Call>
</YourMembership>
'''
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r = requests.post('https://api.yourmembership.com', data=xml, headers=headers)
# BUILD XML TREE OBJECT
dom = et.fromstring(r.text)
# PARSE EVENT ID TEXT AND PASS INTO FUNCTION
for i in dom.iterfind('.//EventID'):
xml_event_info(i.text)

How to replace/remove XML tag with BeautifulSoup?

I have XML in a local file that is a template for a final message that gets POSTed to a REST service. The script pre processes the template data before it gets posted.
So the template looks something like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<singleElement>
<subElementX>XYZ</subElementX>
</singleElement>
<repeatingElement id="11" name="Joe"/>
<repeatingElement id="12" name="Mary"/>
</root>
The message XML should look the same except that the repeatingElement tags need to be replaced with something else (XML generated by the script based on the attributes in the existing tag).
Here is my script so far:
xmlData = None
with open('conf//test1.xml', 'r') as xmlFile:
xmlData = xmlFile.read()
xmlSoup = BeautifulSoup(xmlData, 'html.parser')
repElemList = xmlSoup.find_all('repeatingelement')
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
# now I do something with repElemID and repElemName
# and no longer need it. I would like to replace it with <somenewtag/>
# and dump what is in the soup object back into a string.
# is it possible with BeautifulSoup?
Can I replace the repeating elements with something else and then dump the soup object into a new string that I can post to my REST API?
NOTE: I am using html.parser because I can't get the xml parser to work but it works alright, understanding HTML is more permissive than XML parsing.
You can use .replace_with() and .new_tag() methods:
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
repElem.replace_with(xmlSoup.new_tag("somenewtag"))
Then, you can dump the "soup" using str(soup) or soup.prettify().

Unable to Access Child Node in Parsing XML with Python Language

I am very new to the python scripting language and am recently working on a parser which parses a web-based xml file.
I am able to retrieve all but one of the elements using minidom in python with no issues however I have one node which I am having trouble with. The last node that I require from the XML file is 'url' within the 'image' tag and this can be found within the following xml file example:
<events>
<event id="abcde01">
<title> Name of event </title>
<url> The URL of the Event <- the url tag I do not need </url>
<image>
<url> THE URL I DO NEED </url>
</image>
</event>
Below I have copied brief sections of my code which I feel may be of relevance. I really appreciate any help with this to retrieve this last image url node. I will also include what I have tried and the error I recieved when I ran this code in GAE. The python version I am using is Python 2.7 and I should probably also point out that I am saving them within an array (for later input to a database).
class XMLParser(webapp2.RequestHandler):
def get(self):
base_url = 'http://api.eventful.com/rest/events/search?location=Dublin&date=Today'
#downloads data from xml file:
response = urllib.urlopen(base_url)
#converts data to string
data = response.read()
unicode_data = data.decode('utf-8')
data = unicode_data.encode('ascii','ignore')
#closes file
response.close()
#parses xml downloaded
dom = mdom.parseString(data)
node = dom.documentElement #needed for declaration of variable
#print out all event names (titles) found in the eventful xml
event_main = dom.getElementsByTagName('event')
#URLs list parsing - MY ATTEMPT -
urls_list = []
for im in event_main:
image_url = image.getElementsByTagName("image")[0].childNodes[0]
urls_list.append(image_url)
The error I receive is the following any help is much appreciated, Karen
image_url = im.getElementsByTagName("image")[0].childNodes[0]
IndexError: list index out of range
First of all, do not reencode the content. There is no need to do so, XML parsers are perfectly capable of handling encoded content.
Next, I'd use the ElementTree API for a task like this:
from xml.etree import ElementTree as ET
response = urllib.urlopen(base_url)
tree = ET.parse(response)
urls_list = []
for event in tree.findall('.//event[image]'):
# find the text content of the first <image><url> tag combination:
image_url = event.find('.//image/url')
if image_url is not None:
urls_list.append(image_url.text)
This only consideres event elements that have a direct image child element.

Categories

Resources