I want to parse html code in python and tried beautiful soup and pyquery already. The problem is that those parsers modify original code e.g insert some tag or etc. Is there any parser out there that do not change the code?
I tried HTMLParser but no success! :(
It doesn't modify the code and just tells me where tags are placed. But it fails in parsing web pages like mail.live.com
Any idea how to parse a web page just like a browser?
You can use BeautifulSoup to extract just text and not modify the tags. Its in their documentation.
Same question here:
How to extract text from beautiful soup
No, to this moment there is no such HTML parser and every parser has it's own limitations.
Have you tried the webkit engine with Python bindings?
See this: https://github.com/niwibe/phantompy
You can traverse the real DOM of the parsed web page and do what you need to do.
Related
I have searched and get a little bit introduced to some of the web crawling libraries in python like scrapy, beautifulsoup etc. Using these libraries I want to crawl all of the text under a specific heading in a document. If any of you can help me his/her help would be highly appreciated. I have seen some tutorial that how one can get links under a specific class name (by view source page option) using beautiful soap but how can I get a simple text not links under the specific class of heading. Sorry for my bad English
import requests
from bs4 import BeautifulSoup
r=requests.get('https://patents.google.com/patent/US6886010B2/en')
print(r.content)
soup=BeautifulSoup(r.content)
for link in soup.find_all("div", class_="claims"):
print(link)
Here i have extracted claims text but it also shows other div written in these claims that is div in div i just want to extract the text of the claims only.
By links, I assume you mean the entire contents of the div elements. If you'd like to just print the text contained within them, use the .text attribute or .get_text() method. The entire text of the claims is wrapped inside a unique section element. So you might want to try this:
print(soup.find('section', attrs={'id': 'claims'}).text)
The get_text method gives you a bit more flexibility such as joining bits of text together with a separator and stripping the text of extra newlines.
Also, take a look at the BeautifulSoup Documentation and spend some time reading it.
I wish to fetch the source of a webpage and parse individual tags myself. How can I do this in Python?
import urllib2
urllib2.urlopen('http://stackoverflow.com').read()
That's the simple answer, but you should really look at BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/
Some options are:
urllib
urllib2
httplib
httplib2
HTMLParser
Beautiful Soup
All except httplib2 and Beautiful Soup are in the Python Standard Library. The pages for each of the packages above contain simple examples that will let you see what suits your needs best.
I would suggest you use BeautifulSoup
#for HTML parsing
from BeautifulSoup import BeautifulSoup
import urllib2
doc = urllib2.urlopen('http://google.com').read()
soup = BeautifulSoup(''.join(doc))
soup.contents[0].name
After this you can pretty much parse anything out of this document. See documentation which has detailed examples of how to do it.
All the answers here are true, and BeautifulSoup is great, however when the source HTML is dynamically created by javascript, and that's usually the case these days, you'll need to use some engine that first creates the final HTML and only then fetch it, or else you'll have most of the content missing.
As far as I know, the easiest way is simply using the browser's engine for this. In my experience, Python+Selenium+Firefox is the least resistant path
How to extract from html page links for javascript, css and img tags ? Do I need to use regular expression or there is already some lightweight library for html parsing ?
HTML5Lib in combination with lxml is what I like to use extract data from HTML documents. It recovers from errors in a similar way to modern browsers so it makes broken html easier to work with.
If you actually want to run js code in web pages (say the link is calculated via a function), you should consider looking at the webkit and jswebkit packages which will let you run javascript in a headless webkit window that can get you dynamically generated content for your python parser to examine.
It's really not hard at all to run js in python via webkit, though expect memory usage on par with running a webkit browser.
BeautifulSoup will do the trick.
import urllib
from BeautifulSoup import BeautifulSoup
sock = urllib.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(sock.read())
sock.close()
img = soup.findAll("img")
script = soup.findAll("script", {"type" : "text/javascript"})
css = soup.findAll("link", {"rel" : "stylesheet"})
HTML is not a language which is parsable by regular expressions. SO don't even try. It will break.
What I typically use is Beautiful Soup which is a parser library especially build for gathering information from potentially invalid markup, exactly like the stuff you will find out there.
I'm trying to parse some web pages for future use. For parsing webpages, I've used different modules like urllib, lxml, BeautifulSoup, HTMLParser to reach my goal.
I didn't meet any problem while parsing web pages until I faced the hidden tags.
When I opened the page with a chrome browser and used the developer tools to see elements of page, I was able to see the <embed> part of the code:
<embed type="..." src="..." ID="..." >
and simply can copy/paste manually.
I need to parse ID from this hidden tag. Why can I parse this part from the site by using python? Any way to parse these hidden parts?
I know it's not possible to see some code parts like php and asp in the html source but I suppose it's not the case.
This "hidden" code is probably generated by JavaScript at runtime.
You might have better luck finding out how the JavaScript works and where it gets its data (the URLs) than attempting to have something run the script and then parse the resulting DOM tree...
I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.
Update: html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.
you would need to use urllib2 python library to get the html from the website and then parse through the html to grab the text that you want.
Use BeautifulSoup to parse through the html
import BeautifulSoup
resp = urllib2.urlopen("http://stackoverflow.com")
rawhtml = resp.read()
#parse through html to get text
soup=BeautifulSoup(rawhtml)
I don't "copy-paste from browser" is a well-defined operation. For instance, what would happen if the entire page were covered with a transparent floating div? What if it had tables? What about dynamic content?
BeautifulSoup is a powerful parser; you just need to know how to use it (it is easy, for instance, to remove the script tags from the page). Fortunately, it has a lot of documentation.
You can use xml.sax.utils.unescape to unescape HTML entities.