Until recently, I posted Python code (whitespace matters) to blogspot.com using something like this:
<div style="overflow-x: scroll ">
<table bgcolor="#ffffb0" border="0" width="100%" padding="4">
<tbody><tr><td><pre style=" hidden;font-family:monaco;">
my code here
</pre></table></div>
About a week ago, the posts started acquiring additional newlines so all of this is double-spaced. Using a simple <pre> tag is no good (besides losing the color) b/c it also results in double newlines, while a <code> tag messes with the whitespace. I guess I could just add *4---but that's frowned upon or something by the HTML style gods.
The standard answer to this (like right here on SO) is to get syntax coloring or highlighting through the use of css (which I don't know much about), for example as discussed in a previous SO question here. The problem I have with that is that all such solutions require loading of a resource from a server on the web. But if (say 5 years from now) that resource is gone, the html version of the code will not render at all. If I knew Javascript I guess I could probably fix that.
The coloring problem itself is trivial, it could be solved through use of <style> tags with various definitions. But parsing is hard; at least I've not made much progress trying to parse Python myself. Multi-line strings are a particular pain. I could just ignore the hard cases and code the simple ones.
TextMate has a command Create HTML from Document. The result is fairly wordy but could just be pasted into a post. But say if you had 3 code segments, then it's like 1000 lines or something. And of course it's a document, so you have to actually cut before you paste.
Is there a simple Python parser? A better solution?
UPDATE: I wrote my own parser for syntax highlighting. Still a little buggy perhaps, but it is quite simple and a self-contained solution. I posted it here. Pygments is also a good choice as well.
Why don't you use pygments?
What worked for me was to use prettify. At the top of the HTML, add the line
<script src="https://cdn.jsdelivr.net/gh/google/code-prettify#master/loader/run_prettify.js"></script>
to auto-run prettify. Then use
<code class="prettify">
... enter your code here ...
</code>
in your HTML.
I actually used this code on blogspot.com; here is an example.
Related
Just received a task from a basic programming course in uni.
I am a complete newbie regarding computers. I am a freshman and have no prior programming experience.
The task requires making a python source code that would print a html table as output.
No use of modules is allowed.
We covered some basic python things like if, for loop, while, print, etc...
but didn't learn anything about creating html in python.
I've been searching on the internet for hours and hours, but all solutions seem so advanced and they all involve use of third-party modules, which in my case is not allowed.
Professor knows that we are all complete newbies, so there's got to be a way to do this without much professional knowledge.
Can anyone please tell me the basics of making a html table in python?
Like do I just type in things like
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
in python? Basically have no idea where to start.
** The code should be written in a way that when it is executed in bash shell ($ python file_name.py), it prints out a html table.
P.S. I'm using vs code as an editor for python.
As I can imagine the only way to make it simple without going bananas is to write
html file. So basically its .write rows into html file using python. If this is what you want - here is an example how to do that plain and simple:
from random import randint
with open('table.html', 'w') as html:
html.write('<table style="border:2px solid black;">\n')
for i in range(5):
html.write('<tr>\n')
for i in range(3):
html.write(f'<td>{randint(10,150)}</td>\n')
html.write('</tr>\n')
html.write('</table>')
"\n" - used to make you Html document look readable.
Random.randint module is used to generate data - its basic simple module just for data placeholders.
For the table to look nice you can add border to td -
<td style="border:2px solid black;" >(random data)</td>
It will look like solid html/excel table.
You won't learn anything by copy-pasting working examples that you don't understand and you can't expect to understand anything, when you search for solutions to complex problems without knowing the basics.
Given the level of experience you have according to your question, you should instead search for a Python tutorial to get a grip of the language. Read about the synthax, Python's object model and the type hierarchy. With that knowledge you will be able to understand the documentation, and with that, you should be able to solve your problem without searching for pre-made solutions.
As we haven't covered any HTML-related topic in class yet, doing the task doesn't require much complex solution. It looks like the task can be done just by using for loop, lists, if, print... etc, all of which I've already learned and am quite confident with. Many thanks to everyone who cared to give help and answers.
I'm parsing a website with the requests module and I'm trying to get specific URLs inside tags (but a table of data as the tags are used more than once) without using BeautifulSoup. Here's part of the code I'm trying to parse:
<td class="notranslate" style="height:25px;">
<a class="post-list-subject" href="/Forum/ShowPost.aspx?PostID=80631954">
<div class="thread-link-outer-wrapper">
<div class="thread-link-container notranslate">
Forum Rule: Don't Spam in Any Way
</div>
I'm trying to get the text inside the tag:
/Forum/ShowPost.aspx?PostID=80631954
The thing is, because I'm parsing a forum site, there are multiple uses of those divider tags. I'd like to retrieve a table of post URLs using string.split using code similar to this:
htmltext.split('<a class="post-list-subject" href="')[1].split('"><div class="thread-link-outer-wrapper">')[0]
There is nothing in the HTML code to indicate a post number on the page, just links.
In my opinion there are better ways to do this. Even if you don't want to use BeautifulSoup, I would lean towards regular expressions. However, the task can definitely be accomplished using the code you want. Here's one way, using a list comprehension:
results = [chunk.split('">')[0] for chunk in htmltext.split('<a class="post-list-subject" href="')[1:]]
I tried to model it as closely off of your base code as possible, but I did simplify one of the split arguments to avoid whitespace issues.
In case regular expressions are fair game, here's how you could do it:
import re
target = '<a class="post-list-subject" href="(.*)">'
results = re.findall(target, htmltext)
Consider using Beautiful Soup. It will make your life a lot easier. Pay attention to the choice of parser so that you can get the balance of speed and leniency that is appropriate for your task.
It seems really dicey to try to pre-optimize without establishing your bottleneck is going to be html parsing. If you're worried about performance, why not use lxml? Module imports are hardly ever the bottleneck, and it sounds like you're shooting yourself in the foot here.
That said, this will technically do what you want, but it seriously is not more performant than using an HTML parser like lxml in the long run. Explicitly avoiding an HTML parser will also probably drastically increase your development time as you figure out obscure string manipulation snippets rather than just using the nice tree structure that you get for free with HTML.
strcleaner = lambda x : x.replace('\n', '').replace(' ', '').replace('\t', '')
S = strcleaner(htmltext)
S.split(strcleaner('<a class="post-list-subject" href="'))[1].split(strcleaner('"><div class="thread-link-outer-wrapper">'))[0]
The problem with the code you posted is that whitespace and newlines are characters too.
see: http://diveintopython.net/native_data_types/lists.html#d0e5623
I have a website with code examples on it, generated through docutils, and the CSS is always not quite right.
I would like to know if there is
best practise CSS for displaying code (ie can it handle wrap arounds, long lines, any chance of getting colourisation)
best practise for the little numerical callouts (see diveintopython above)
and finally, I am wondering if there is (open) CSS that is designed to work with docutils HTML output and actually look "nice". I would be happy to contribute some CSS that makes tables look "microsoft professional grey" and so forth.
You can't do syntax highlighting with CSS alone. You need the various parts of the code to be marked up; you can do that on the server if you are using dynamic pages, or you can use JavaScript on the client. Here is a comparison of a few JavaScript syntax highlighters.
The circled numbers are images in the site you linked, but I would use Unicode instead: ❶➋➌➍➎➏➐➑➒➓
I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
You didn't write that awful page.
You're just trying to get some data
out of it. Right now, you don't really
care what HTML is supposed to look
like.
Neither does this parser.
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.
I need to let users enter Markdown content to my web app, which has a Python back end. I don’t want to needlessly restrict their entries (e.g. by not allowing any HTML, which goes against the spirit and spec of Markdown), but obviously I need to prevent cross-site scripting (XSS) attacks.
I can’t be the first one with this problem, but didn’t see any SO questions with all the keywords “python,” “Markdown,” and “XSS”, so here goes.
What’s a best-practice way to process Markdown and prevent XSS attacks using Python libraries? (Bonus points for supporting PHP Markdown Extra syntax.)
I was unable to determine “best practice,” but generally you have three choices when accepting Markdown input:
Allow HTML within Markdown content (this is how Markdown originally/officially works, but if treated naïvely, this can invite XSS attacks).
Just treat any HTML as plain text, essentially letting your Markdown processor escape the user’s input. Thus <small>…</small> in input will not create small text but rather the literal text “<small>…</small>”.
Throw out all HTML tags within Markdown. This is pretty user-hostile and may choke on text like <3 depending on implementation. This is the approach taken here on Stack Overflow.
My question regards case #1, specifically.
Given that, what worked well for me is sending user input through
Markdown for Python, which optionally supports Extra syntax and then through
html5lib’s sanitizer.
I threw a bunch of XSS attack attempts at this combination, and all failed (hurray!); but using benign tags like <strong> worked flawlessly.
This way, you are in effect going with option #1 (as desired) except for potentially dangerous or malformed HTML snippets, which are treated as in option #2.
(Thanks to Y.H Wong for pointing me in the direction of that Markdown library!)
Markdown in Python is probably what you are looking for. It seems to cover a lot of your requested extensions too.
To prevent XSS attacks, the preferred way to do it is exactly the same as other languages - you escape the user output when rendered back. I just took a peek at the documentation and the source code. Markdown seems to be able to do it right out of the box with some trivial config tweaks.