Round-trip HTML using Python xml.etree.ElementTree or lxml.ElementTree - python

I have code that creates and saves XML fragments. Now I would like to handle HTML as well. ElementTree.write() has a method="html" parameter that suppresses end tags for "area", "base", "basefont", "br", "col", "frame", "hr", "img", "input", "isindex", "link", "meta", and "param" elements. Unfortunately, reading a file that contains those tags throws an error. I'm reluctant to rewrite a lot of my code to use an HTML-specific package. At this point, I'm seriously considering sub-classing the XMLParser to insert an end tag for those tags. Is there an accepted way to do this?
And it just occurred to me that perhaps html.parser.HTMLParser will solve my problems. I will investigate and report back what I find.

Related

Python parsing json data

I have a json object saved inside test_data and I need to know if the string inside test_data['sign_in_info']['package_type'] contains the string "vacation_package" in it. I assumed that in could help but I'm not sure how to use it properly or if it´s correct to use it. This is an example of the json object:
"checkout_details": {
"file_name" : "pnc04",
"test_directory" : "test_pnc04_package_today3_signedout_noinsurance_cc",
"scope": "wdw",
"number_of_adults": "2",
"number_of_children": "0",
"sign_in_info": {
"should_login": false,
**"package_type": "vacation_package"**
},
package type has "vacation_package" in it, but it's not always this way.
For now I´m only saving the data this way:
package_type = test_data['sign_in_info']['package_type']
Now, is it ok to do something like:
p= "vacation_package"
if(p in package_type):
....
Or do I have to use 're' to cut the string and find it that way?
You answer depends on what exactly you expect to get from test_data['sign_in_info']['package_type']. Will 'vacation_package' always be by itself? Then in is fine. Could it be part of a larger string? Then you need to use re.search. It might be safer just to use re.search (and a good opportunity to practice regular expressions).
No need to use re, assuming you are using the json package. Yes, it's okay to do that, but are you trying to see if there is a "package type" listed, or if the package type contains vacation_package, possibly among other things? If not, this might be closer to what you want, as it checks for exact matches:
import json
data = json.load(open('file.json'))
if data['sign_in_info'].get('package_type') == "vacation_package":
pass # do something

JSON object parsing and how to escape unicode characters

I'm fairly new to javascript and such so I don't know if this will be worded correctly, but I'm trying to parse a JSON object that I read from a database. I send the html page the variable from a python script using Django where the variable looks like this:
{
"data":{
"nodes":[
{
"id":"n0",
"label":"Redditor(user_name='awesomeasianguy')"
},
...
]
}
}
Currently, the response looks like:
"{u'data': {u'nodes': [{u'id': u'n0', u'label': u"Redditor(user_name='awesomeasianguy')"}, ...
I tried to take out the characters like u&#39 with a replaceAll type statement as seen below. This however is not that easy of a solution and it seems like there has got to be a better way to escape those characters.
var networ_json = JSON.parse("{{ networ_json }}".replace(/u'/g, '"').replace(/'/g, '"').replace(/u"/g, '"').replace(/"/g, '"'));
If there are any suggestions on a method I'm not using or even a tool to use for this, it would be greatly appreciated.
Use the template filter "|safe" to disable escaping, like,
var networ_json = JSON.parse("{{ networ_json|safe }}";
Read up on it here: https://docs.djangoproject.com/en/dev/ref/templates/builtins/#safe

Pythonic way to import multiple dictionaries from text file

So I have a text file,
question_one = {question:"what is 2+2", answer: "4", fake1: "5"}
question_two = {question:"what is the meaning of life?", answer:"pizza", fake:"42"}
How can I then import these dictionaries so that I could use them like this,
print(question_one["question"])
print(question_two["question"])
So the out come would be
what is 2+2
what is the meaning of life?
I would like this so that I can add questions to a text file from within the program and then save them should I add more, If this is possible another way please let me know!
The simplest way would be to store your questions into a JSON file, like #Thom Wiggers is suggesting.
Here's an example:
[
{
"question": "what is 2+2",
"answer": "4",
"fake1": "5"
},
{
"question": "what is the meaning of life?",
"answer": "pizza",
"fake1": "42"
}
]
import json
with open('questions.json') as f:
questions = json.load(f)
for question in questions:
print(question['question'])
You can read more about the JSON module in the official documentation.
If you only want to serialize data, you want to use pickle or json. exec will execute all Python code, and can be a serious security problem.
pickle is faster, and is specificity tailored to Python, while json can be read & written work by just about any programming language, and is still fairly human-readable & human-editable.
Now, to answer the question as you asked it (you probably don't want to do this):
You can use exec()
This function supports dynamic execution of Python code. object must
be either a string or a code object. If it is a string, the string is
parsed as a suite of Python statements which is then executed (unless
a syntax error occurs).
ie.
exec(open('data.txt', 'r').read())
Another way to do is would be to (ab)use import, assuming your file is named data.py:
import data
data.question_one['question']
This is obviously not what import was intended for... I've 'used' import like this in the past, and regretted it (there are a number of caveats, I'll leave it as an exercise to the reader to think about what they might be).
Warning Both are eval-like statements, and should be used with care, any Python code in data.txt will be executed, which may be potentially dangerous. Be very sure you trust the source of whatever you pass to exec(), and don't use if you only want to serialize data (instead of running Python code as such).

Create a engine to replace template entries with json objects in python

I want to create an engine in python that can replace tags in a template file with objects from json? I've loooked at python based engines that use regex but they are overly complicated and I am a little confused as to how to start an go about it. Any code to start would help
Sample json file
{
"webbpage": {
"title": "Stackoverflow"
},
"Songs": {
"name": "Mr Crowley"
},
"CoverArtists": [
{ "name": "Ozzy", "nicknames": ["Ozzman","Ozzster"] },
{ "name": "Slipknot", "nicknames": ["Slip"] }
]
}
Sample template file
<html>
<head>
<title><%% webbpage.title %%></title>
</head>
<body>
<h1><%% Songs.name %%></h1>
<%% EACH CoverArtists artist %%>
<%% artist.name %%>
<%% EACH CoverArtists.nicknames nickname %%>
<p><%% nickname %%></p>
<%% END %%>
<%% END %%>
</body>
</html>
Basically variables are identified between <%% and %%> and loops are identifed between <%% EACH .. %%> AND <%% END %%> and basically output html
Check out Alex Mitchells blog on templating (python based), it's extremely helpful, and the related GIT here
Basically,
Take input files on command line
Parse template file - build Abstract Syntax tree to parse easily. Create a node base type, and then for each new type implement a concrete class
Build stack to keep track of loops, pop when loops close, look for errors of missing loop ends
Parse json file, build dictionary of objects, map AST entries and replace in html
Write output to file.
I have a version of this based on Alex mitchells engine here, where I've fixed some issues with the original one( and put it in some code of my own) and will be trying to get rid of the reg exp based matching and put in contextual matching, because regular expressions dont work too well on large complex HTML data.
So basically you need to identify tokens within <%% and %%>. This step can be done with a regular expression, e.g. :
>>> import re
>>> t='<h1><%% Songs.name %%></h1>'
>>> re.search(r'<%%(.+?)%%>', t).groups()
(' Songs.name ',)
Parsing the code that this gives is quite a complicated task, and you might find some help in shlex or tokenizer. Even if you need to use regular expressions for the entire project, I suggest you take a look at these, and examples from other languages, to understand how a syntax parser works.
Once you've parsed the language, the next step is to replace the tokens with values from json. The best way would be to first load the data into python dicts, and then write a renderer to insert the data into the html.
Even if you can't use them, you should still have a look at some templating engines such as jinja or chameleon and study their source code to get an idea of how such a project is put together.

Python XML parsing - equivalent of "grep -v" in bash

This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one

Categories

Resources