No handlers could be found for logger "bs4.dammit" - python

I write a small link miner by using BeautifulSoup library.
But I saw that there was some link that doesn't handled. So I test one of theme:
result = requests.get('https://domain.ir/PATH_TO_FILE/optics-program-msc.pdf')
soup = BeautifulSoup(result.content,'html.parser')
f2.write('{"counter":'+str(i)+', "id": "'+a['href']+'", "group":'+str(counter)+", \"children\":"+str(len(soup.find_all('a',href=True)))+"},\n")
I understood that html.parser cannot handle all links and I give this error
No handlers could be found for logger "bs4.dammit"
So the link doesn't written in file. But There is some links that I don't no which parser should be use. like .pdf,.zip,....
So what should I do?

First of all: you should use result.text, because it is already unicode string (instead of bytes in content)
Second thing to check: is it really parsed "soup" of HTML with links? By putting one simple if soup.body:
The third one: bs4.dummit warning says about problem with detecting encoding, so try put some more info about it: BeautifulSoup(result.content, 'html.parser', from_encoding="windows-1259")
Another one: instead of html.parser, try to use lxml

Related

Using BeautifulSoup to Extract CData

I'm trying to use BeautifulSoup from bs4/Python 3 to extract CData. However, whenever I search for it using the following, it returns an empty result. Can anyone point out what I'm doing wrong?
from bs4 import BeautifulSoup,CData
txt = '''<foobar>We have
<![CDATA[some data here]]>
and more.
</foobar>'''
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
print('CData contents: %r' % cd)
The problem appears to be that the default parser doesn't parse CDATA properly. If you specify the correct parser, the CDATA shows up:
soup = BeautifulSoup(txt,'html.parser')
For more information on parsers, see the docs
I got onto this by using the diagnose function, which the docs recommend:
If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.
Using the diagnose() function gives you output of how the different parsers see your html, which enables you to choose the right parser for your use case.

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.
Is it unavoidable to parse source code html for it, or there is a better way?
I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.
To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.
Parsing is required. Don't know if there's a library method. A simple regex:
text = sub(r"<[^>]+>", " ", html)
this requires many improvements, but it's a starting point.
With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:
#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup
url = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text()
print text

Beautifulsoup, Python and HTML automatic page truncating?

I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content.
I use the following code to get the set of "div"s:
findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
print it
At a certain point, the output looks like:
correct string, correct string, incomplete/truncated string ("So, I")
although, the htmlSource contains the string "So, I am bored", and many others. Also, I would like to mention that when I prettify() the tree I see the HTML source truncated.
Do you have an idea how can I fix this issue?
Thanks!
Try using lxml.html. It is a faster, better html parser, and deals better with broken html than latest BeautifulSoup. It is working fine for your example page, parsing the entire page.
import lxml.html
doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))
Code above returns 131 divs.
I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(html, 'html5lib')

How to download any(!) webpage with correct charset in python?

Problem
When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up.
People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or they use an encoding detector (which does not care about meta tags or headers).
By using only one these techniques, sometimes you will not get the same result as you would in a browser.
Browsers do it this way:
Meta tags always takes precedence (or xml definition)
Encoding defined in the header is used when there is no charset defined in a meta tag
If the encoding is not defined at all, than it is time for encoding detection.
(Well... at least that is the way I believe most browsers do it. Documentation is really scarce.)
What I'm looking for is a library that can decide the character set of a page the way a browser would. I'm sure I'm not the first who needs a proper solution to this problem.
Solution (I have not tried it yet...)
According to Beautiful Soup's documentation.
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
An encoding you pass in as the
fromEncoding argument to the soup
constructor.
An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected
at this stage, it will be one of the
UTF-* encodings, EBCDIC, or ASCII.
An
encoding sniffed by the chardet
library, if you have it installed.
UTF-8
Windows-1252
When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:
fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')
You can use BeautifulSoup to locate a meta element in the HTML:
soup = BeatifulSoup.BeautifulSoup(data)
meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.
Use the Universal Encoding Detector:
>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}
The other option would be to just use wget:
import os
h = os.popen('wget -q -O foo1.txt http://foo.html')
h.close()
s = open('foo1.txt').read()
It seems like you need a hybrid of the answers presented:
Fetch the page using urllib
Find <meta> tags using beautiful soup or other method
If no meta tags exist, check the headers returned by urllib
If that still doesn't give you an answer, use the universal encoding detector.
I honestly don't believe you're going to find anything better than that.
In fact if you read further into the FAQ you linked to in the comments on the other answer, that's what the author of detector library advocates.
If you believe the FAQ, this is what the browsers do (as requested in your original question) as the detector is a port of the firefox sniffing code.
I would use html5lib for this.
Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, <meta> tags, BOM marks and differences in encoding names in account.
Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or <meta> or BOM marks are not available or provide no information.
You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.
To get page encoding and unicode body use w3lib.encoding.html_to_unicode function, with a content-based guessing fallback:
import chardet
from w3lib.encoding import html_to_unicode
def _guess_encoding(data):
return chardet.detect(data).get('encoding')
detected_encoding, html_content_unicode = html_to_unicode(
content_type_header,
html_content_bytes,
default_encoding='utf8',
auto_detect_fun=_guess_encoding,
)
instead of trying to get a page then figuring out the charset the browser would use, why not just use a browser to fetch the page and check what charset it uses..
from win32com.client import DispatchWithEvents
import threading
stopEvent=threading.Event()
class EventHandler(object):
def OnDownloadBegin(self):
pass
def waitUntilReady(ie):
"""
copypasted from
http://mail.python.org/pipermail/python-win32/2004-June/002040.html
"""
if ie.ReadyState!=4:
while 1:
print "waiting"
pythoncom.PumpWaitingMessages()
stopEvent.wait(.2)
if stopEvent.isSet() or ie.ReadyState==4:
stopEvent.clear()
break;
ie = DispatchWithEvents("InternetExplorer.Application", EventHandler)
ie.Visible = 0
ie.Navigate('http://kskky.info')
waitUntilReady(ie)
d = ie.Document
print d.CharSet
BeautifulSoup dose this with UnicodeDammit : Unicode, Dammit

Decomposing HTML to link text and target

Given an HTML link like
texttxt
how can I isolate the url and the text?
Updates
I'm using Beautiful Soup, and am unable to figure out how to do that.
I did
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
links = soup.findAll('a')
for link in links:
print "link content:", link.content," and attr:",link.attrs
i get
*link content: None and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root /support.asp')]* ...
...
Why am i missing the content?
edit: elaborated on 'stuck' as advised :)
Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.
EDIT:
I think you want:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.
EDIT 2:
This should show you all the links in a page:
import urlparse, urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source)
for item in soup.fetchall('a'):
try:
link = urlparse.urlparse(item['href'].lower())
except:
# Not a valid link
pass
else:
print link
Here's a code example, showing getting the attributes and contents of the links:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.attrs, link.contents
Looks like you have two issues there:
link.contents, not link.content
attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.
Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.
/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/
Here's what it matches:
'text'
// Parts: "url", "text"
'text<span>something</span>'
// Parts: "url", "text<span>something</span>"
If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

Categories

Resources