How to parse html code saved as text? - python

I have multiple .txt files containing HTML code (HTML code from web pages were copied and saved as .txt).
I want to parse these files as an HTML. Are there any libraries which have similar functionality as requests+bs4 bundle and can treat input from text files as a result of usual web parsing?
Thank you for your help.

As many of the comments stated it is possible to feed .txt file to BeautifulSoup():
from bs4 import BeautifulSoup
path = 'path/to/file.txt'
with open(path) as f:
text = f.read()
BeautifulSoup(text, 'lxml')

You may be looking for Beautiful Soup, which can parse and read text from HTML quite easily:
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Related

Python Save XML Webpage to .mht

I have a single diagnostic webpage on a device with charts that is in XML format made up of an xsl and gif files. Is there a way with Python to download the entire page and save it as a single .mht file rather than separate files?
This is essentially a combination of those two problems:
How to save "complete webpage" not just basic html using Python
https://stackoverflow.com/a/44815028/679240
AFAIK, you could download the page with urllib, parse the HTML with Beautiful Soup, find the images and other dependencies in the parsed HTML, download those, rewrite the image urls in the parsed html to point to the local copies (Beautiful Soup can do this), save the modified HTML back to the disk, and use MHTifier to generate the MHT.
Perhaps Scrapy could help you, too.
Hi I was able to convert html page from web page and local html to .mht using win32com.
You can have a look at this
https://stackoverflow.com/a/59321911/5290876.
You can share sample xml with xsl with images for testing.

python html parser which doesn't modify actual markup?

I want to parse html code in python and tried beautiful soup and pyquery already. The problem is that those parsers modify original code e.g insert some tag or etc. Is there any parser out there that do not change the code?
I tried HTMLParser but no success! :(
It doesn't modify the code and just tells me where tags are placed. But it fails in parsing web pages like mail.live.com
Any idea how to parse a web page just like a browser?
You can use BeautifulSoup to extract just text and not modify the tags. Its in their documentation.
Same question here:
How to extract text from beautiful soup
No, to this moment there is no such HTML parser and every parser has it's own limitations.
Have you tried the webkit engine with Python bindings?
See this: https://github.com/niwibe/phantompy
You can traverse the real DOM of the parsed web page and do what you need to do.

scrapy for table content in pdf file

I am working on web scraping for tables in pdf file using python
Can some one suggest me a good module which fetch's only required table
I have tried pypdf,pdf2html,ocr,slate but nothing works
Thanks
First, convert PDF to HTML. See Converting PDF to HTML with Python.
And then, using an HTML parsing library, parse the HTML generated from the PDF. See BeautifulSoup HTML table parsing

html to text conversion using python language

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.
Update: html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.
you would need to use urllib2 python library to get the html from the website and then parse through the html to grab the text that you want.
Use BeautifulSoup to parse through the html
import BeautifulSoup
resp = urllib2.urlopen("http://stackoverflow.com")
rawhtml = resp.read()
#parse through html to get text
soup=BeautifulSoup(rawhtml)
I don't "copy-paste from browser" is a well-defined operation. For instance, what would happen if the entire page were covered with a transparent floating div? What if it had tables? What about dynamic content?
BeautifulSoup is a powerful parser; you just need to know how to use it (it is easy, for instance, to remove the script tags from the page). Fortunately, it has a lot of documentation.
You can use xml.sax.utils.unescape to unescape HTML entities.

How to get only text of a webpage with Python, just as Select-all & Copy in browser?

I want to get "Main content" instead of < tag> Main content , where the latter is html code and could be retrieved using urllib.urlopen(url).
Just as you open the url in browser, select all text and then copy&paste.
Is there a possible way for this with Python?
Thanks.
Have a look at Beautiful Soup.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

Categories

Resources