looking to convert html to ascii text (ansi possible) in python

looking to convert html to ascii text (ansi possible) in python - python

I've trouble finding a library to convert simple HTML (with <b>, <i>, <p>, <li> ...) to a simple representation. Obviously this can't match HTML spec very far, but I don't need fancy things. For instance lynx is good for the task (except bold and italic are ignored and could probably be translated in some ANSI attributes):
$ echo "<b>hello</b> <p>this is a <i>list</i> <ul><li>foo</li><li>bar</li></ul></p>" |
lynx -stdin -dump
hello
this is a list
* foo
* bar
The ideal solution would be a python library. Otherwise I will stick to use lynx... So any command better than the one I've proposed here would also be accepted.

There is html2text which is not quite what wanted, but could match some other viewers constraints.
It produces text from html. This text is following Markdown format. So there are no use of ANSI attributes for instance. However, as Markdown is meant to be a visual text-only format, it can satisfy probably some needs.

Related

Sublime Text syntax: Python 3.6 f-strings

I am trying to modify the default Python.sublime_syntax file to handle Python’s f-string literals properly. My goal is to have expressions in interpolated strings recognised as such:
f"hello {person.name if person else 'there'}"
-----------source.python----------
------string.quoted.double.block.python------
Within f-strings, ranges of text between a single { and another } (but terminating before format specifiers such as !r}, :<5}, etc—see PEP 498) should be recognised as expressions. As far as I know, that might look a little like this:
...
string:
- match: "(?<=[^\{]\{)[^\{].*)(?=(!(s|r|a))?(:.*)?\})" # I'll need a better regex
push: expressions
However, upon inspecting the build-in Python.sublime_syntax file, the string contexts especially are to unwieldy to even approach (~480 lines?) and I have no idea how to begin. Thanks heaps for any info.

There was an update to syntax highlighting in BUILD 3127 (Which includes: Significant improvements to Python syntax highlighting).
However, a couple users have stated that in BUILD 3176 syntax highlighting still is not set to correctly highlight Python expressions that are located within f strings. According to #Jollywatt, it is set to source.python f"string.quoted.double.block {constant.other.placeholder}" rather than f"string.quoted.double.block {source.python}"
It looks like Sublime uses this tool, PackageDev, "to ease the creation of snippets, syntax definitions, etc. for Sublime Text."

Markdown: Is there a way to specify raw text in markdown?

I am using python to generated markdown, is there a way to specify a "raw" string in markdown terms?
i.e.
<-- magic markdown formatting to indicate not to format the following text
# This is a comment
--- end of text ---
<-- end of magic markdown formatting
Should appear as is without letting markdown touch it at all.

If you want to add comments, you could use the syntax for code.
In github flavoured markdown it normally uses ``` (3 backticks)
In the python-markdown it is like Stack overflow, where you put 4 spaces in front of the line.
if you do not want to format it like code, you can simply escape the markdown syntax like :
\# comment
Will display # comment rather than the word "comment" as a heading.

Django/Textile/Pygments: " ' > being escaped

I have a blog written in django that I am attempting to add syntax highlighting to. The posts are written and stored in the database as textile markup. Here is how they are supposed to be rendered via template engine:
{{ body|textile|pygmentize|safe }}
It renders all the HTML correctly and the code gets highlighted, but some characters within the code blocks are being escaped. Specifically double quotes, single quotes, and greater than signs.
Here is the Pygments filter I am using: http://djangosnippets.org/snippets/416/
I'm not sure which filter is actually putting the escaped characters in there or how to make it stop that. Any suggestions?

shameless plug to me answering this on another page:
https://stackoverflow.com/a/10138569/1224926
the problem is beautifulsoup (rightly) assumes code is unsafe. but if you parse it into a tree, and pass that in, it works. So your line:
code.replaceWith(highlight(code.string, lexer, HtmlFormatter()))
should become:
code.replaceWith(BeautifulSoup(highlight(code.string, lexer, HtmlFormatter())))
and you get what you would expect.

Python inline of XML or ASCII string/template?

I am generating complex XML files through Python scripts that require a bunch of conditional statements (example http://repository.azgs.az.gov/uri_gin/azgs/dlio/536/iso19139.xml). I am working with multiple XML or ASCII metadata standards that often have poor schema or are quite vague.
In PHP, I just wrote out the XML and inserted PHP snippets where needed. Is there an easy way to do that in Python? I am trying to avoid having to escape all that XML. The inline method is also very helpful for tweaking the template without much rewrite.
I have looked a bit into Python templeting solutions but they appeared either too static or were overkill. Moving the whole XML into an XML object is a lot of work at a cost of flexibility when changing the XML or ASCII template.
Thanks for the newbie support!

wgrunberg,
I use Python's built-in string.Template class like so:
from string import Template
the_template = Template("<div id='$section_id'>First name: $first</div>")
print the_template.substitute(section_id="anID", first="Sarah")
The output of the above is:
<div id='anID'>First name: Sarah</div>
In the above example I showed an XML template, but any "template" that you can describe as a string would work.
To do conditionals you can do something like:
print the_template.substitute(section_id="theID", first="Sarah" if 0==0 else "John")
If your conditionals are complex, instead of expressing them inline as above, consider breaking them out into closures/functions.

Try any modern non-XML-based Python templating engine, e.g. Mako or Jinja2. They are fairly easy to integrate into your project, and then you will be able to write things such as:
<sometag>
%if a > b:
<anothertag />
%endif
</sometag>
You can also use inline python, including assignments.

#Gintautas' suggestion of using a good template engine would also be my first choice (especially Mako, whose "templating language" is basically Python plus some markup;-).
However, alternatives many people prefer to build XML files include writing them from the DOM, e.g. using (to stick with something in the standard library) ElementTree or (a third-party package, but zero trouble to install and very popular) BeautifulSoup (by all means stick with its 3.0.x release unless you're using Python 3!), specifically the BeautifulStoneSoup class (the BeautifulSoup class is for processing HTML, the stone one for XML).

Going for a template is the sensible thing to do. Using the build-in string template over external libraries would be my preferred method because I need to be able to pass on the Python ETL scripts easily to other people. However, I could get away with using templates by putting the XML string inside single quotes and using multi-line strings. Following is an example of this crude method.
for row in metadata_dictionary:
iso_xml = ''
iso_xml += ' *** a bunch of XML *** '
iso_xml += '\
<gmd:contact> \n\
<gmd:CI_ResponsibleParty> \n'
if row['MetadataContactName']:
iso_xml += '\
<gmd:individualName> \n\
<gco:CharacterString>'+row['MetadataContactName'].lower().strip()+'</gco:CharacterString> \n\
</gmd:individualName> \n'
if row['MetadataContactOrganisation']:
iso_xml += '\
<gmd:organisationName> \n\
<gco:CharacterString>'+row['MetadataContactOrganisation'].lower().strip()+'</gco:CharacterString> \n\
</gmd:organisationName> \n'
iso_xml += ' *** more XML *** '

Searching for specific HTML string using Python

What modules would be the best to write a python program that searches through hundreds of html documents and deletes a certain string of html that is given.
For instance, if I have an html doc that has Test and I want to delete this out of every html page that has it.
Any help is much appreciated, and I don't need someone to write the program for me, just a helpful point in the right direction.

If the string you are searching for will be in the HTML literally, then simple string replacement will be fine:
old_html = open(html_file).read()
new_html = old_html.replace(my_string, "")
if new_html != old_html:
open(html_file, "w").write(new_html)
As an example of the string not being in the HTML literally, suppose you are looking for "Test" as you said. Do you want it to match these snippets of HTML?:
<a href='test.html'>Test</a>
<A HREF='test.html'>Test</A>
Test
Test
and so on: the "same" HTML can be expressed in many different ways. If you know the precise characters used in the HTML, then simple string replacement is fine. If you need to match at an HTML semantic level, then you'll need to use more advanced tools like BeautifulSoup, but then you'll also have potentially very different HTML output than you started with, even in the sections not affected by the deletion, because the entire file will have been parsed and reconstituted.
To execute code over many files, you'll find os.path.walk useful for finding files in a tree, or glob.glob for matching filenames to shell-like wildcard patterns.

BeautifulSoup or lxml.

htmllib
This module defines a class which can serve as a base for parsing text
files formatted in the HyperText Mark-up Language (HTML). The class is
not directly concerned with I/O — it must be provided with input in
string form via a method, and makes calls to methods of a “formatter”
object in order to produce output. The HTMLParser class is designed to
be used as a base class for other classes in order to add
functionality, and allows most of its methods to be extended or
overridden. In turn, this class is derived from and extends the
SGMLParser class defined in module sgmllib. The HTMLParser
implementation supports the HTML 2.0 language as described in RFC
1866.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

looking to convert html to ascii text (ansi possible) in python - python

Related

Sublime Text syntax: Python 3.6 f-strings

Markdown: Is there a way to specify raw text in markdown?

Django/Textile/Pygments: " ' > being escaped

Python inline of XML or ASCII string/template?

Searching for specific HTML string using Python

Categories

Resources