Detecting white text on HTML file

Detecting white text on HTML file - python

I have HTML file like this:
<HTML>
<HEAD>
<style>
.secret {
background-color: black;
color: black;
}
</style>
</HEAD>
<BODY>
<p>This text is VISIBLE</p>
<p id="hidden-1" style="color: white;">This text is hidden (white text background)</p>
<p id="hidden-2" class="secret">This text is hidden (black text/background)</p>
</BODY>
<HTML>
I want to write a small Python application that get HTML file as an input and detects the HTML element that makes this trick. In the case above, the output should be "hidden-1" + "hidden-2".
Additional to the example above, there are many more options to hide text in HTML. I'm looking for a solution that has the highest rate of success.
Is this possible?
Thanks

A general solution could be to use bs4 to strip all the ids / text from the html. Then use imgkit to convert the .html to .png, and read the visible text from it with an OCR such as pytesseract, then do a diff to find the "hidden" text.

Related

How to download local html file in streamlit application?

I write a program that generates an HTML report. First, my program performs the calculations and saves them in the DataFrame. The next step is to add some text before DF, format DF and save the ready HTML report. This code looks like this:
html = f'''
<html>
<head>
<title>Some title</title>
</head>
<style type ="text/css">
Here I have some table formatting like:
table {{
font-family: Arial, Helvetica, sans-serif;
width: 100%;
text-align: center;
border-collapse: collapse;
}}
</style>
<body>
Here I write some text before DF like:
<h1 style="text-align: center;">Some Text</h1>
Next, I use my DF:
{DF.to_html()}
</body>
</html>
'''
Next step is to save the ready HTML file:
with open('report.html', 'w') as f:
f.write(html)
The last step is to download this HTML file by st.download
st.download_button(label="download report",
data='report.html',
file_name="test_report.html")
When I try to open the downloaded test_report.html file, inside is only "test_report.html". I tried to open the file named: "report.html" and this looks like it should, I mean this is my ready-formatted report. I probably made something wrong when I downloaded "test_report.html", but I don't know what.

Generate a specific designed PDF from basic HTML input

I want to generate a PDF with a specific background from three simple input-fields.
A title, a message and a signature as shown in the picture below.
Example of desired result
I have some experience with creating web-sites with Python Flask, but I struggle with how to tackle this challenge.

create a h1 containing the title, a h2 containing the message and a bottom text with some css
.bottom{
position:fixed;
bottom:0;
}
put the background-image tag on the body to have your own custom image
it would be something like this
<html>
<head>
<style>
.bottom{
position:fixed;
bottom:0;
}
</style>
</head>
<body style="background-image:url('mypicture.png'); text-align: center">
<h1>my title</h1>
<h2>my subtitle</h2>
<h2 class="bottom">bottom text</h2>
</body>
</html>

pyPandoc md to html conversion lose code-block style

I'm trying to convert a string with markdown formatting into an html
text = """
# To be approved
This is a markdown editor, Type here your article body and use the tools or markdown code to style it.
If you need help or you want to know more about markdown, click on the **light bulb** icon in the bottom left of this form.
You can preview your `article ` by clicking on the icons in the bottom right of this form.
**Click here to begin writing**
\```js
var UID = loadUID();
if (UID != false){
var create_article_btn = window.parent.document.getElementById('create_article_btn');
create_article_btn.style.display = 'block';
}
\```
"""
text = pypandoc.convert_text(text,'html',format='md')
text = text.replace('"',"'")
text = text.replace('\n','')
It all works fine except for code blocks and inline codes which are displayed oddly:
the htmlgenerated by pypandoc is:
<h1 id="to-be-approved">
To be approved
</h1>
<p>
<strong>
Please
</strong>
, begin
<em>
your
</em>
article with a title like this:
</p>
<p>
This is a markdown editor, Type here your article body and use the tools or markdown code to style it. If you need help or you want to know more about markdown, click on the
<strong>
light bulb
</strong>
icon in the bottom left of this form. You can preview your
<code>
article
</code>
by clicking on the icons in the bottom right of this form.
</p>
<p>
<strong>
Click here to begin writing
</strong>
</p>
<div class="sourceCode" id="cb1">
<pre class="sourceCode js"><code class="sourceCode javascript"><span id="cb1-1">
<span class="kw">var</span> UID <span class="op">=</span> loadUID()
<span class="op">;</span></span><span id="cb1-2"><span
class="cf">if</span> (UID <span class="op">!=</span> <span class="kw">false</span>)
{</span><span id="cb1-3"> <span class="kw">var</span> create_article_btn
<span class="op">=</span> <span class="bu">window</span><span class="op">.
</span><span class="at">parent</span><span class="op">.</span><span class="at">document</span>
<span class="op">.</span><span class="fu">getElementById</span>(<span
class="st">'create_article_btn'</span>)<span class="op">;</span></span>
<span id="cb1-4"> create_article_btn<span class="op">.
</span><span class="at">style</span><span class="op">.</span><span class="at">display
</span> <span class="op">=</span> <span class="st">'block'</span><span class="op">;
</span></span><span id="cb1-5">}</span></code></pre>
</div>
Is there something I'm missing in the pypandoc conversion? How do I stylise the code block with syntax highlight and proper indentation?
Judging by the presence of classes such as source code etc. it seems that there should be a style associated to that.

I got this sorted in a very simple way: I downloaded a css file specific for Pandoc from GitHub: https://gist.github.com/forivall/7d5a304a8c3c809f0ba96884a7cf9d7e
and then since I'm using the srcdoc property of an iframe to populate the html, I'm adding the style link in the srcdoc before the parsed html:
var article_frame = document.getElementById('article_frame');
// add all the styles here (also pandoc.css)
var temp_frame = '<link rel="stylesheet" type="text/css" href="../static/styles/main.css"><link rel="stylesheet" type="text/css" href="../static/styles/read_article.css"><link href="https://fonts.googleapis.com/css?family=Noto+Serif:400,400i,700,700i&display=swap" rel="stylesheet"><link rel="stylesheet" type="text/css" href="../static/styles/pandoc.css">';
temp_frame += //article parsed with pyPandoc...
article_frame.srcdoc = temp_frame;
Also notice that in the css that I linked, the code highlight wasn't working. I figure that removing the >in the lines: 709-737 it works:
code > span.kw { color: #a71d5d; font-weight: normal; } /* Keyword */
code > span.dt { color: inherit; } /* DataType */
code > span.dv { color: #0086b3; } /* DecVal */
...
code span.kw { color: #a71d5d; font-weight: normal; } /* Keyword */
code span.dt { color: inherit; } /* DataType */
code span.dv { color: #0086b3; } /* DecVal */
...

Extract css from a HTML page

I need to extract the css codes from serveral HTML files but I can't figure out how to solve the follwing two problems:
A HTML file might have more than one block containing CSS code.
In HTML CSS is placed inside tags. But so is other code. I only need the code coming from <style type="text/css">.
I looked into beautifulsoup but haven't yet been able to figure out if this is possible using this library or if I need to write something myself.
Hopefully anyone on here can help me out.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_code,'html.parser')
soup.find('style',{"type" : "text/css"})
I've tried this code on the below html code
<html>
<head>
<style type="text/css">
body {background-color: powderblue;}
h1 {color: blue;}
p {color: red;}
</style>
</head>
<body>
<h1>This is a heading</h1>
<p>This is a paragraph1.</p>
<h4>This is a paragraph2.</h4>
<style>
h4 {color: red;}
</style>
And this was the output i got -
Output
<style type="text/css">
body {background-color: powderblue;}
h1 {color: blue;}
p {color: red;}
</style>
You can see that i got only the style tag which has type="text/css"

CSS Background Image doesn't load

I'm trying to load an image to my page and set it as my background, below is the css and html that I am working with. The CSS page seems to load, as my test font and background colors show up but for whatever reason my background image doesn't.
I am using Django as my web framework.
Appreciate the help!
HTML:
<!DOCTYPE html>
{% load staticfiles %}
<html lang="en">
<head>
<title>Test</title>
<meta charset="utf-8" />
<link rel="stylesheet" href="{% static 'personal/css/frontpagebackground.css' %}" type = "text/css"/>
</head>
<body>
<p>This is a test</p>
</body>
</html>
CSS:
body
{
background-image:url(personal/static/personal/img/home.jpg) no-repeat;
background-repeat:no-repeat;
background-size:100%;
background-color: green;
min-height: 100%;
}
p {
font-style: italic;
color: red;
}

You used background-image, so the no-repeat is not actually working. Try adding the no-repeat below as background-repeat: no-repeat;.
background-image: url('image-url-here.png');
background-repeat: no-repeat;

I've had problems like this with django, primarily because my {% static %} location was confusing me. I think this is a non-css issue, and a path issue.
It helped me to use the dev console to check for Resource Failed To Load error. It will spit out the path that the machine is looking for, and it's likely different than what you specified.
I.e. is your css file located along
personal/static/personal/img/home.jpg
relative to your css file? You might just need to back up a directory, i.e.
../personal/static/personal/img/home.jpg
Either way, the console should give you your path, and use that to determine where your CSS file is digging for the image.

If your CSS files are in a path such as /static/css/style.css
and the images are under a path such as /static/images/test.jpg
then you have to place the test.jpg image directly in the folder with the style.css i.e. /static/css/test.jpg and in your HTML file give it the name test.jpg without using the static path i.e. background-image:url('test.jpg');:)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Detecting white text on HTML file - python

A general solution could be to use bs4 to strip all the ids / text from the html. Then use imgkit to convert the .html to .png, and read the visible text from it with an OCR such as pytesseract, then do a diff to find the "hidden" text.

Related

How to download local html file in streamlit application?

Generate a specific designed PDF from basic HTML input

pyPandoc md to html conversion lose code-block style

Extract css from a HTML page

CSS Background Image doesn't load

Categories

Resources