Unable to properly format html code to text

Unable to properly format html code to text - python

I'm trying to create a script where I copy and paste information from one url to another.
I was able to get the html code using playwright and beautiful soup but was unable to format properly.
For example,
<strong>Autoignition Temperature: </strong>
Using html2text, I was able to convert the above html code to
 
AutoignitionTemp = ** Autoignition Temperature: **  
Which I then paste it into the correct website using this code:
pastePage.frame_locator(
"text=Rich Text Editor, txtContent3Editor toolbarsClipboard/Undo Cut Copy Paste Paste >> iframe").locator(
"body").fill(html2text.html2text(AutoignitionTemp))
However, txt editor does not recognised the ** as bolding and paste the text directly as it is:
Pasted Code:
Is there a way I can format the text properly or paste the html code directly into chrome console?

Related

Cannot convert HTML to PDF using wkhtmltopdf or pdfkit

I'm trying to convert a large HTML to PDF. So far I tried two approaches with the same no desired result.
I execute this on the terminal:
wkhtmltopdf my_html_file.html my_pdf_file.pdf
On the other hand, I tried the conversion inside a Python script:
import pdfkit
with open('my_html_file.html') as f:
pdfkit.from_file(f, 'my_pdf_file.pdf')
In both cases the output file generated (my_pdf_file.pdf) only has 1 page and does not contain all the content from the HTML file.

Generate and download tsv from a website (with python)

I have this website and want to write a script which can execute a code which gives the same output as clicking on 'Export' -> 'Generate tsv' -> Wait to generate -> 'Download'.
The endgoal is to use this for a list of approx. 1700 proteins which I have in .txt (so extract a protein, in this case 'Q9BXF6' and put it in the url: https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table) and download all results in .tsv files.
I tried inspecting the 'Export' button but the sourcecode wasn't illuminating (or I didn't know where to look). I also tried this:
r = requests.get('https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table')
soup = BeautifulSoup(r.content, 'html.parser')
to locate what I need but it outputs a bunch of characters that I can't really understand.
I also tried downloading the whole page just like it is with the urllib library:
with
myurl = 'https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table'
urllib.request.urlopen() as f:
html = f.read().decode('utf-8')
or
urllib.urlretrieve (myurl, 'interpro.txt') # although this didn't work
It seems as if all content is written somewhere else and refered to and everything I've tried outputs something stupid, but I don't know anything about html and am really new to python (I only use R).

For your first question, you can use the URL of the following element to retrieve the protein value that you require for the next problem.
href="blob:https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b"
The URL is set to the href tag which you can then use it to make the request to download the file. You can also obtain this by right-clicking on the download button for TSV and clicking Inspect-Element you will then be able to see the presence of this href tag.
Following that, download by doing e.g.
import urllib.request
url = 'https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b'
urllib.request.urlretrieve(url, '/Users/abc/Downloads/file.tsv') # any dir to save
with open("/Users/abc/Downloads/file.tsv") as file_in:
for line in file_in:
#here make your calls for your second problem.
You can also use a Web-Automator such as selenium to gracefully solve this problem. If the latter is of interest do look into it - it's not hard.

Optimal way to convert PDF file to HTML

I am trying to convert thousands of PDF files to HTML. I was able to convert this PDF file to this HTML file using the following code:
def convertPDFToHtml():
command = 'pdf2txt.py -o output.html -t html test.pdf'
os.system(command)
I want to be able to parse the HTML file so that I can extract different texts from it. The problem now is that the output HTML file is missing a lot of text from the original file.
Is there a better to convert the PDF file and parse the HTML text ?

This is possibly a similar problem as discussed here, unless you specifically want to generate HTML files. But even so, you could first extract the text from the PDFs as simple unformatted text, parse it, and then generate the HTMLs.

parse directly on pure html source with selenium in python

I'm trying to test the selenium program that I wrote by giving it an HTML source as a string for some reasons such as speed. I don't want it to get the URL and I don't want it to open a file I just want to pass it a string that contains whole DIV part of that site and do parsing stuff on it.
this is part of a module that i wrote:
source = driver.page_source
return {'containers': source}
and in another module,
def get_rail_origin(self):
return self.data['containers'].find_element_by_id('o_outDepName')...
I'm trying to do parsing stuff on it but I get
AttributeError: 'str' object has no attribute 'find_element_by_id'
So how can I parse on pure HTML source without opening any file or URL

Selenium works with live HTML DOM. If you want to get source and then parse it, you can try, for instance, lxml.html:
def get_rail_origin(self):
source = html.fromstring(self.data['containers'])
return source.get_element_by_id('o_outDepName')
P.S. I assumed that self.data['containers'] is HTML source code

Python3: storing a link recognized as HTML format in clipboard

How can to store this link my link in the clipboard to be able to past it in HTML mode (and not source code) in an HTML editor?
Pasting it in an editor should only show the text my link with a clickable link.
Using Tkinter or pywin32 (or others), how to tell the clipboard that it contain html content (and not just raw text)?

Based on the link suggested by #chrki.
You can do this:
Install HtmlClipboard : copy the script, save it as HtmlClipboard.py in C:\Python##\Lib\site-packages\
Save the script below as link_as_html.py(I used some of your code in your question):
Create a shorcut for the script in step to (right click on the file link_as_html.py, and select create a shorcut)
Right click on the shorcut, select Properties, and and add a keyboard shorcut in Shorcut key.
That's it. When you have an link in our clipboard, you can just press your keyboard shorcut and you can paste your image directly in the html mode of you editor.
link_as_html.py (Python34). I assume you have your url http://www.web.com in the clipboard:
from tkinter import Tk
root = Tk()
root.withdraw()
url = root.clipboard_get()
# send my link to an "HTML format clipboard"
import HtmlClipboard
HtmlClipboard.PutHtml("<a href=\"http://"+url+" \" target=\"_blank\"/>my link</a>")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to properly format html code to text - python

Related

Cannot convert HTML to PDF using wkhtmltopdf or pdfkit

Generate and download tsv from a website (with python)

Optimal way to convert PDF file to HTML

parse directly on pure html source with selenium in python

Python3: storing a link recognized as HTML format in clipboard

Categories

Resources