pyPandoc md to html conversion lose code-block style - python

I'm trying to convert a string with markdown formatting into an html
text = """
# To be approved
This is a markdown editor, Type here your article body and use the tools or markdown code to style it.
If you need help or you want to know more about markdown, click on the **light bulb** icon in the bottom left of this form.
You can preview your `article ` by clicking on the icons in the bottom right of this form.
**Click here to begin writing**
\```js
var UID = loadUID();
if (UID != false){
var create_article_btn = window.parent.document.getElementById('create_article_btn');
create_article_btn.style.display = 'block';
}
\```
"""
text = pypandoc.convert_text(text,'html',format='md')
text = text.replace('"',"'")
text = text.replace('\n','')
It all works fine except for code blocks and inline codes which are displayed oddly:
the htmlgenerated by pypandoc is:
<h1 id="to-be-approved">
To be approved
</h1>
<p>
<strong>
Please
</strong>
, begin
<em>
your
</em>
article with a title like this:
</p>
<p>
This is a markdown editor, Type here your article body and use the tools or markdown code to style it. If you need help or you want to know more about markdown, click on the
<strong>
light bulb
</strong>
icon in the bottom left of this form. You can preview your
<code>
article
</code>
by clicking on the icons in the bottom right of this form.
</p>
<p>
<strong>
Click here to begin writing
</strong>
</p>
<div class="sourceCode" id="cb1">
<pre class="sourceCode js"><code class="sourceCode javascript"><span id="cb1-1">
<span class="kw">var</span> UID <span class="op">=</span> loadUID()
<span class="op">;</span></span><span id="cb1-2"><span
class="cf">if</span> (UID <span class="op">!=</span> <span class="kw">false</span>)
{</span><span id="cb1-3"> <span class="kw">var</span> create_article_btn
<span class="op">=</span> <span class="bu">window</span><span class="op">.
</span><span class="at">parent</span><span class="op">.</span><span class="at">document</span>
<span class="op">.</span><span class="fu">getElementById</span>(<span
class="st">'create_article_btn'</span>)<span class="op">;</span></span>
<span id="cb1-4"> create_article_btn<span class="op">.
</span><span class="at">style</span><span class="op">.</span><span class="at">display
</span> <span class="op">=</span> <span class="st">'block'</span><span class="op">;
</span></span><span id="cb1-5">}</span></code></pre>
</div>
Is there something I'm missing in the pypandoc conversion? How do I stylise the code block with syntax highlight and proper indentation?
Judging by the presence of classes such as source code etc. it seems that there should be a style associated to that.

I got this sorted in a very simple way: I downloaded a css file specific for Pandoc from GitHub: https://gist.github.com/forivall/7d5a304a8c3c809f0ba96884a7cf9d7e
and then since I'm using the srcdoc property of an iframe to populate the html, I'm adding the style link in the srcdoc before the parsed html:
var article_frame = document.getElementById('article_frame');
// add all the styles here (also pandoc.css)
var temp_frame = '<link rel="stylesheet" type="text/css" href="../static/styles/main.css"><link rel="stylesheet" type="text/css" href="../static/styles/read_article.css"><link href="https://fonts.googleapis.com/css?family=Noto+Serif:400,400i,700,700i&display=swap" rel="stylesheet"><link rel="stylesheet" type="text/css" href="../static/styles/pandoc.css">';
temp_frame += //article parsed with pyPandoc...
article_frame.srcdoc = temp_frame;
Also notice that in the css that I linked, the code highlight wasn't working. I figure that removing the >in the lines: 709-737 it works:
code > span.kw { color: #a71d5d; font-weight: normal; } /* Keyword */
code > span.dt { color: inherit; } /* DataType */
code > span.dv { color: #0086b3; } /* DecVal */
...
code span.kw { color: #a71d5d; font-weight: normal; } /* Keyword */
code span.dt { color: inherit; } /* DataType */
code span.dv { color: #0086b3; } /* DecVal */
...

Related

How to get text from hr tag using BeautifulSoup?

This is an example of the HTML (I've tried to make it a lot neater than what it actually looks like):
<P>
random text
<br>
<br>
<i>Anonymous</i>
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 12:46pm</span>
<span style="font-size: 10px; margin-left: 20px;">
<a style="color: #888; text-decoration: none;" title="Flag as offensive post"
href="/flag?a=248830&r=1">FLAG
</a>
</span>
<hr> **THIS IS THE TEXT I NEED**
<br>
<br>
<i>Anonymous</i>
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 3:40pm</span>
<span style="font-size: 10px; margin-left: 20px;">
<a style="color: #888; text-decoration: none;" title="Flag as offensive post"
href="/flag?a=248830&r=2">FLAG
</a>
</span>
<hr>**THIS IS THE TEXT I NEED**
<br>
<br>
<script type="text/javascript">
<script type="text/javascript" src="//cdn.chitika.net/getads.js" async></script>
**THIS IS THE TEXT I NEED**
<br>
<br>
<i>Anonymous</i>
I'm trying to get the text from the hr tag. However, doing
for i in soup.find_all('hr'):
print(i.text)
does not work. Instead, I get a blank output.
I've also tried
soup.find('i').previousSibling
but that outputs a blank, I'm not sure if that's because there's <br> <br> before.
How can I get the **THIS IS THE TEXT I NEED**?
The text you need isn't in an <hr> it's in a p. So you can get it like this:
soup = BeautifulSoup(doc, "html.parser")
ps = soup.findAll("p")
print(ps[0].getText())
Now considering that this prints:
random text
Anonymous
Nov 30 12:46pm
FLAG
**THIS IS THE TEXT I NEED**
Anonymous
Nov 30 3:40pm
FLAG
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
Anonymous
Process finished with exit code 0
You'll need to parse out the text you need with something like:
import re
rawText = ps[0].getText()
matches = re.findall(r'\*\*.*\*\*',rawText)
for m in matches:
print(m)
Which prints out:
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
But You'll need to fish out your text some other way because I doubt it is surrounded by asterixis. Edit: As a side not you can use soup.find instead of soup.findAll but I don't think that really matters.
You could try just accessing the next element:
for hr in soup.find_all('hr'):
print(hr.next_element.get_text(strip=True))
For your HTML this displays:
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**

Detecting white text on HTML file

I have HTML file like this:
<HTML>
<HEAD>
<style>
.secret {
background-color: black;
color: black;
}
</style>
</HEAD>
<BODY>
<p>This text is VISIBLE</p>
<p id="hidden-1" style="color: white;">This text is hidden (white text background)</p>
<p id="hidden-2" class="secret">This text is hidden (black text/background)</p>
</BODY>
<HTML>
I want to write a small Python application that get HTML file as an input and detects the HTML element that makes this trick. In the case above, the output should be "hidden-1" + "hidden-2".
Additional to the example above, there are many more options to hide text in HTML. I'm looking for a solution that has the highest rate of success.
Is this possible?
Thanks
A general solution could be to use bs4 to strip all the ids / text from the html. Then use imgkit to convert the .html to .png, and read the visible text from it with an OCR such as pytesseract, then do a diff to find the "hidden" text.

Position absolute for <img/> when using xhtml2pdf (Django)?

Can I generate PDF with CSS position: absolute; for <img src="..."/> html tag?
I need to place handwritten signature and company stamp (PNG files) to bottom of order voucher at non standard place, that they run a little on the goods table. Position absolute will save my time for that, but it's don't working.
EDIT:
I have an answer from xhtml2pdf GitHub repo:
Well absolute position is not supported right now, but if you are looking for how to set images in specific part of page in all pages, see frames.
So, my question is still actual. Real usage example with xhtml2pdf frames for images will be great.
And real usage example from Luis Zárate (xhtml2pdf collaborator):
<html>
<head>
<style>
#page {
size: a4 portrait;
#frame content_frame { /* Content Frame */
left: 50pt; width: 512pt; top: 90pt; height: 632pt;
}
#frame footer_frame { /* Another static Frame */
-pdf-frame-content: footer_content;
left: 450pt; width: 300pt; top: 672pt; height: 200pt;
}
}
</style>
</head>
<body>
<!-- Content for Static Frame 'footer_frame' -->
<div id="footer_content">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/Tux.svg/123px-Tux.svg.png?download">
</div>
{% lorem 10 p %}
<pdf:pdf-next-page />
{% lorem 10 p %}
</body>
</html>
Code generates this PDF file: https://github.com/xhtml2pdf/xhtml2pdf/files/1754033/report-7.pdf

How can I using uikit to fixed the navbar on the top?

I'm using the uikit css framework, I want to fixed the navbar on the top?
You need to include 'components/sticky.js' file from UIKit package and add data-uk-sticky directive to your fixed bar div.
For example:
<div data-uk-sticky>...</div>
I suggest to use it with navbar component, for example:
<nav id='top-bar' class="uk-navbar" data-uk-sticky>
...
<ul class="uk-navbar-nav">...</ul>
<div class="uk-navbar-content">...</div>
<div class="uk-navbar-content uk-navbar-center">...</div>
</nav>
Add this to your css file:
.uk-fixed-navigation {
position: fixed;
right: 0;
left: 0;
top: 0;
z-index: 1030;
}
And add uk-fixed-navigtion class to your navbar.
EDIT
You also can use their sticky component

Printing dynamic django view template

I'm working on a django app. I have a page that displays a log of items, and each item has a "Print label" link. At the moment, clicking the link displays the label for that particular item in a popup screen, but does not send the label to a printer. The view function behind the "Print label" link is shown below:
#login_required
def print_label(request, id):
s = Item.objects.get(pk = id)
return render_to_response('templates/label.html', {'s': s}, context_instance=RequestContext(request))
The HTML for the label is shown below:
{% load humanize %}
<head>
<style type="text/css">
div{
min-width: 350px;
max-width: 350px;
text-align: center;
}
body{
font-family: Arial;
width: 370px;
height: 560px;
text-align: center;
}
</style>
</head>
<body>
<div id="labelHeader">
<img src="{{ STATIC_URL }}img/label-header.png" width="350px">
</div>
<hr/>
<p></p>
<div id="destinationAddress">
<span style="font-size: xx-large; font-weight: bold;">{{ s.item_number }}</span>
</p>
DESTINATION:
<br/>
<strong>{{s.full_name}}</strong><br/>
<strong>{{ s.address }}</strong><br/>
<strong>{{s.city}}, {{s.state}}</strong><br/>
<strong>Tel: {{s.telephone}}</strong>
</div>
<p></p>
<hr/>
<div id="labelfooter">
<img src="{{ STATIC_URL }}img/label-footer.png" width="350px">
</div>
</body>
My question is, how can I also send the label displayed to a printer in the same function? I researched and found some libraries (like xhtml2pdf, webkit2png, pdfcrowd, etc), but they'll create a pdf or image file of the label and I'll have to send it to a printer. Is it possible to send straight to a printer without creating a pdf copy of the label? If so, please show me how to achieve this.
Your answers and suggestions are highly welcome. Thank you.
Presumably, as this is a Django app, it's the client's printer that you need to use. The only way to do this is to tell the user's browser to print. You will need to use Javascript for this: window.print().

Categories

Resources