wxpython - create a wx.html.HtmlWindow() that has highlighted text - python

Using wxpython 4.1.0, Windows 10 x64, Python 3.7.7 x64...
What I want to achieve is pretty basic, but cannot figure it out from reading wxpython documentation and searching the internet.
I used python's native difflib module to create an HTML difference report. If you open this html difference report using any popular browser, Chrome, Firefox, Edge, and etc. you get nice highlighting that distinguishes the differences that were found.
When I open this same file using wxpythons wx.html.HtmlWindow() widget, the nice visuals are not shown. The highlighting that clearly shows the differences is not displayed and instead just text is displayed making it nearly impossible to find the differences.
wxpython is very complete in terms of functionality and I assume there is a way to achieve this using the wx.html.HtmlWindow() widget or a similar widget(s) in the wx.html API. I'm thinking the only way of achieving this is maybe using the wx.html2 API.
Minimal amount of code to view my problem (not including in classes for simplicity):
DIFFLIB Code:
import difflib
import pathlib as path
file1 = path.Path('text_file1.txt')
file2 = path.Path('text_file2.txt')
with file1.open() as file_obj1:
contents1 = file_obj1.readlines()
with file2.open() as file_obj2:
contents2 = file_obj2.readlines()
html = difflib.HtmlDiff().make_file(contents1, contents2, file1.name, file2.name)
with open('report.html', 'w') as file_obj:
file_obj.write(html)
GUI Code:
import wx
import wx.html
def html_window():
frame = wx.Frame(parent=None)
html = wx.html.HtmlWindow(frame)
html.LoadFile('report.html')
frame.Show()
app = wx.App(False)
html_window()
app.MainLoop()

Your problem lies with wx.html not supporting css.
wx.html2 does but even that may struggle with inline css which is what you have in the report.html file.
Still, you could always try wx.html2.
Other than that, why not use wx.LaunchDefaultBrowser("path_to_file/report.html")
One other option that springs to mind, is use:
difflib.HtmlDiff().make_table(contents1, contents2, file1.name, file2.name)
and then add your own css.

Related

markdown2 module - how do I convert a a markdown file from within a Python script?

I am using markdown2 module (found here: https://github.com/trentm/python-markdown2) The module is great but documentation is lacking.
I attempted something simple like converting a file using a Python script. I wanted to then use this to convert a batch of files. But even the first part is hitting a dead end. My code:
from markdown2 import Markdown
markdowner = Markdown()
markdowner.convert("file.md", "file.htm")
Could someone show me how?
NVM, I got the answer (and not from the repository mentioned above). Here's a simple Pythonic two-liner:
from markdown2 import Markdown
with open("file.htm", 'w') as output_file: output_file.write(Markdown().convert(open("file.md").read()))

How to embed HTML into IPython output?

Is it possible to embed rendered HTML output into IPython output?
One way is to use
from IPython.core.display import HTML
HTML('link')
or (IPython multiline cell alias)
%%html
link
Which return a formatted link, but
This link doesn't open a browser with the webpage itself from the console. IPython notebooks support honest rendering, though.
I'm unaware of how to render HTML() object within, say, a list or pandas printed table. You can do df.to_html(), but without making links inside cells.
This output isn't interactive in the PyCharm Python console (because it's not QT).
How can I overcome these shortcomings and make IPython output a bit more interactive?
This seems to work for me:
from IPython.core.display import display, HTML
display(HTML('<h1>Hello, world!</h1>'))
The trick is to wrap it in display as well.
Source: http://python.6.x6.nabble.com/Printing-HTML-within-IPython-Notebook-IPython-specific-prettyprint-tp5016624p5016631.html
Edit:
from IPython.display import display, HTML
In order to avoid:
DeprecationWarning: Importing display from IPython.core.display is
deprecated since IPython 7.14, please import from IPython display
Some time ago Jupyter Notebooks started stripping JavaScript from HTML content [#3118]. Here are two solutions:
Serving Local HTML
If you want to embed an HTML page with JavaScript on your page now, the easiest thing to do is to save your HTML file to the directory with your notebook and then load the HTML as follows:
from IPython.display import IFrame
IFrame(src='./nice.html', width=700, height=600)
Serving Remote HTML
If you prefer a hosted solution, you can upload your HTML page to an Amazon Web Services "bucket" in S3, change the settings on that bucket so as to make the bucket host a static website, then use an Iframe component in your notebook:
from IPython.display import IFrame
IFrame(src='https://s3.amazonaws.com/duhaime/blog/visualizations/isolation-forests.html', width=700, height=600)
This will render your HTML content and JavaScript in an iframe, just like you can on any other web page:
<iframe src='https://s3.amazonaws.com/duhaime/blog/visualizations/isolation-forests.html', width=700, height=600></iframe>
Related: While constructing a class, def _repr_html_(self): ... can be used to create a custom HTML representation of its instances:
class Foo:
def _repr_html_(self):
return "Hello <b>World</b>!"
o = Foo()
o
will render as:
Hello World!
For more info refer to IPython's docs.
An advanced example:
from html import escape # Python 3 only :-)
class Todo:
def __init__(self):
self.items = []
def add(self, text, completed):
self.items.append({'text': text, 'completed': completed})
def _repr_html_(self):
return "<ol>{}</ol>".format("".join("<li>{} {}</li>".format(
"☑" if item['completed'] else "☐",
escape(item['text'])
) for item in self.items))
my_todo = Todo()
my_todo.add("Buy milk", False)
my_todo.add("Do homework", False)
my_todo.add("Play video games", True)
my_todo
Will render:
☐ Buy milk
☐ Do homework
☑ Play video games
Expanding on #Harmon above, looks like you can combine the display and print statements together ... if you need. Or, maybe it's easier to just format your entire HTML as one string and then use display. Either way, nice feature.
display(HTML('<h1>Hello, world!</h1>'))
print("Here's a link:")
display(HTML("<a href='http://www.google.com' target='_blank'>www.google.com</a>"))
print("some more printed text ...")
display(HTML('<p>Paragraph text here ...</p>'))
Outputs something like this:
Hello, world!
Here's a link:
www.google.com
some more printed text ...
Paragraph text here ...
First, the code:
from random import choices
def random_name(length=6):
return "".join(choices("abcdefghijklmnopqrstuvwxyz", k=length))
# ---
from IPython.display import IFrame, display, HTML
import tempfile
from os import unlink
def display_html_to_frame(html, width=600, height=600):
name = f"temp_{random_name()}.html"
with open(name, "w") as f:
print(html, file=f)
display(IFrame(name, width, height), metadata=dict(isolated=True))
# unlink(name)
def display_html_inline(html):
display(HTML(html, metadata=dict(isolated=True)))
h="<html><b>Hello</b></html>"
display_html_to_iframe(h)
display_html_inline(h)
Some quick notes:
You can generally just use inline HTML for simple items. If you are rendering a framework, like a large JavaScript visualization framework, you may need to use an IFrame. Its hard enough for Jupyter to run in a browser without random HTML embedded.
The strange parameter, metadata=dict(isolated=True) does not isolate the result in an IFrame, as older documentation suggests. It appears to prevent clear-fix from resetting everything. The flag is no longer documented: I just found using it allowed certain display: grid styles to correctly render.
This IFrame solution writes to a temporary file. You could use a data uri as described here but it makes debugging your output difficult. The Jupyter IFrame function does not take a data or srcdoc attribute.
The tempfile
module creations are not sharable to another process, hence the random_name().
If you use the HTML class with an IFrame in it, you get a warning. This may be only once per session.
You can use HTML('Hello, <b>world</b>') at top level of cell and its return value will render. Within a function, use display(HTML(...)) as is done above. This also allows you to mix display and print calls freely.
Oddly, IFrames are indented slightly more than inline HTML.
to do this in a loop, you can do:
display(HTML("".join([f"<a href='{url}'>{url}</a></br>" for url in urls])))
This essentially creates the html text in a loop, and then uses the display(HTML()) construct to display the whole string as HTML

Script to pull html and completely de-relativise it. (single file offline)

I am trying to learn python and also create a web utility. One task I am trying to accomplish is creating a single html file which can be run locally but link to everything it needs to look like the original web page. (if you are going to ask why i want this, its because it may act of a part of a utility i am creating, or if not, just for education) So i have two questions, a theoretical one and a practical one:
1) Is this, for visual (as opposed to functional) purposes, possible? Can a html page work offline while linking to everything it needs online? or if their something fundamental about having the html file itself execute on the web server which does not allow this to be possible? How far can I go with it?
2) I have started a python script which de-relativises (made that one up) linked elements on a html page, but I am a noob so most likely I missed some elements or attributes which would also link to outside resources. I have noticed after trying a few pages that the one in the code below does not work properly, their appears to be a .js file which is not linking correctly. (the first of many problems to come) Assuming the answer to my first question was at least a partial yes, can anyone help me fix the code for this website?
Thank you.
Update, I missed the script tag on this, but even after I added it it still does not work correctly.
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen("http://"+site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in tags_to_derealitivise:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
active_html = ""
tags_to_derealitivise = ("//img", "//a", "//link", "//embed", "//audio", "//video", "//script")
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Furthermore, I could make the code more through by checking all of the elements...
It would look kind of like this, but I do not know the exact way to iterate through all of the elements. This is a seperate problem, and I will most likely figure it out by the time anyone responds...:
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in active_html.xpath:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
update
Thanks to Burhan Khalid's solution, which seemed too simple to be viable at first glance, I got it working. The code is so simple most of you will most likely not require it, but I will post it anyway incase it helps:
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen(site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = html_input.replace('<head>', '<head> <base href='+site+'>')
return active_html
active_html = ""
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Despite all of this, and its great simplicity, the .js object running on the website I have listed in the script still will not load correctly. Does anyone know if this is possible to fix?
while i am trying to make only the html file offline, while using the
linked resources over the web.
This is a two step process:
Copy the HTML file and save it to your local directory.
Add a BASE tag in the HEAD section, and point the href attribute of it to the absolute URL.
Since you want to learn how to do it yourself, I will leave it at that.
#Burhan has an easy answer using <base href="..."> tag in the <head>, and it works as you have found out. I ran the script you posted, and the page downloaded fine. As you noticed, some of the JavaScript now fails. This can be for multiple reasons.
If you are opening the HTML file as a local file:/// URL, the page may not work. Many browsers heavily sandbox local HTML files, not allowing them to perform network requests or examine local files.
The page may perform XmlHTTPRequests or other network operations to the remote site, which will be denied for cross domain scripting reasons. Looking in the JS console, I see the following errors for the script you posted:
XMLHttpRequest cannot load http://www.script-tutorials.com/menus.php?give=menu. Origin http://localhost:8000 is not allowed by Access-Control-Allow-Origin.
Unfortunately, if you do not have control of www.script-tutorials.com, there is no easy way around this.

How to write a python script for downloading?

I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php
But I only want to get certain ones.
Is there a way to write a python script to do this? I have intermediate knowledge of python
I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this
thanks,
Shrub
Here's a link to my code
I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.
I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.
Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:
f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()
Hope that helps
Make a url request for the page. Once you have the source, filter out and get urls.
The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria.
After filtration, then do a url request for each matched url's data and write it to memory.
Sample code:
#!/usr/bin/python
import re
import sys
import urllib
#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()
#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)
#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)
#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)
#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)
def fileDl(targetUrl):
#Function to handle downloading of files.
#Arg: url => a String
#Output: Boolean to signify if file has been written to memory
#Validation of the url assumed, for the sake of keeping the illustration short
urlAddInfo = urllib.urlopen(targetUrl)
data = urlAddInfo.read()
fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
if not fileNameSearch:
sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
return False
fileName = fileNameSearch.groups(1)[0]
with open(fileName, "wb") as f:
f.write(data)
sys.stderr.write("Wrote %s to memory\n"%(fileName))
return True
#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))
#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.
The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X
Why yes! 5 years later and, not only is this possible, but you've now got a lot of ways to do it.
I'm going to avoid code-examples here, because mainly want to help break your problem into segments and give you some options for exploration:
Segment 1: GET!
If you must stick to the stdlib, for either python2 or python3, urllib[n]* is what you're going to want to use to pull-down something from the internet.
So again, if you don't want dependencies on other packages:
urllib or urllib2 or maybe another urllib[n] I'm forgetting about.
If you don't have to restrict your imports to the Standard Library:
you're in luck!!!!! You've got:
requests with docs here. requests is the golden standard for gettin' stuff off the web with python. I suggest you use it.
uplink with docs here. It's relatively new & for more programmatic client interfaces.
aiohttp via asyncio with docs here. asyncio got included in python >= 3.5 only, and it's also extra confusing. That said, it if you're willing to put in the time it can be ridiculously efficient for exactly this use-case.
...I'd also be remiss not to mention one of my favorite tools for crawling:
fake_useragent repo here. Docs like seriously not necessary.
Segment 2: Parse!
So again, if you must stick to the stdlib and not install anything with pip, you get to use the extra-extra fun and secure (<==extreme-sarcasm) xml builtin module. Specifically, you get to use the:
xml.etree.ElementTree() with docs here.
It's worth noting that the ElementTree object is what the pip-downloadable lxml package is based on, and made make easier to use. If you want to recreate the wheel and write a bunch of your own complicated logic, using the default xml module is your option.
If you don't have to restrict your imports to the Standard Library:
lxml with docs here. As i said before, lxml is a wrapper around xml.etree that makes it human-usable & implements all those parsing tools you'd need to make yourself. However, as you can see by visiting the docs, it's not easy to use by itself. This brings us to...
BeautifulSoup aka bs4 with docs here. BeautifulSoup makes everything easier. It's my recommendation for this.
Segment 3: GET GET GET!
This section is nearly exactly the same as "Segment 1," except you have a bunch of links not one.
The only thing that changes between this section and "Segment 1" is my recommendation for what to use: aiohttp here will download way faster when dealing with several URLs because it's allows you to download them in parallel.**
* - (where n was decided-on from python-version to ptyhon-version in a somewhat frustratingly arbitrary manner. Look up which urllib[n] has .urlopen() as a top-level function. You can read more about this naming-convention clusterf**k here, here, and here.)
** - (This isn't totally true. It's more sort-of functionally-true at human timescales.)
I would use a combination of wget for downloading - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885 and BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the downloaded file

Extracting titles from PDF files?

I want to write a script to rename downloaded papers with their titles automatically, I'm wondering if there is any library or tricks i can make use of? The PDFs are all generated by TeX and should have some 'formal' structures.
You could try to use pyPdf and this example.
for example:
from pyPdf import PdfFileWriter, PdfFileReader
def get_pdf_title(pdf_file_path):
with open(pdf_file_path) as f:
pdf_reader = PdfFileReader(f)
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('/home/user/Desktop/my.pdf')
Assuming all these papers are from arXiv, you could instead extract the arXiv id (I'd guess that searching for "arXiv:" in the PDF's text would consistently reveal the id as the first hit).
Once you have the arXiv reference number (and have done a pip install arxiv), you can get the title using
paper_ref = '1501.00730'
arxiv.query(id_list=[paper_ref])[0].title
I would probably start with perl (seeing as it's always the first thing I reach for). There are several modules for handling PDFs. If you have a consistent structure, you could use regex to snag the titles.
You can try using iText with Jython

Categories

Resources