Using Beautiful Soup to Scrape content encoded in unicode? [duplicate] - python
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
How can I retrieve the links of a webpage and copy the url address of the links using Python?
Here's a short snippet using the SoupStrainer class in BeautifulSoup:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.
For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:
from bs4 import BeautifulSoup
import urllib.request
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
or the Python 2 version:
from bs4 import BeautifulSoup
import urllib2
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))
for link in soup.find_all('a', href=True):
print link['href']
and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.
BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.
Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.
With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
lxml.html also supports CSS3 selectors so this sort of thing is trivial.
An example with lxml and xpath would look like this:
import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'): # select the url in href for all a tags(links)
print link
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
if 'national-park' in a['href']:
print 'found a url with national-park in the link'
The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:
import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
print(line.get('href'))
Links can be within a variety of attributes so you could pass a list of those attributes to select.
For example, with src and href attributes (here I am using the starts with ^ operator to specify that either of these attributes values starts with http):
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)
Attribute = value selectors
[attr^=value]
Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.
There are also the commonly used $ (ends with) and * (contains) operators. For a full syntax list see the link above.
Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)
[x for x in dom.xpath('//a/#href') if '//' in x and 'nytimes.com' not in x]
In the list comp, the "if '//' and 'url.com' not in x" is a simple method to scrub the url list of the sites 'internal' navigation urls, etc.
just for getting the links, without B.soup and regex:
import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)
except: pass
else:
print item[:end]
for more complex operations, of course BSoup is still preferred.
This script does what your looking for, But also resolves the relative links to absolute links.
import urllib
import lxml.html
import urlparse
def get_dom(url):
connection = urllib.urlopen(url)
return lxml.html.fromstring(connection.read())
def get_links(url):
return resolve_links((link for link in get_dom(url).xpath('//a/#href')))
def guess_root(links):
for link in links:
if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme + '://'
netloc = parsed_link.netloc
return scheme + netloc
def resolve_links(links):
root = guess_root(links)
for link in links:
if not link.startswith('http'):
link = urlparse.urljoin(root, link)
yield link
for link in get_links('http://www.google.com'):
print link
To find all the links, we will in this example use the urllib2 module together
with the re.module
*One of the most powerful function in the re module is "re.findall()".
While re.search() is used to find the first match for a pattern, re.findall() finds all
the matches and returns them as a list of strings, with each string representing one match*
import urllib2
import re
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
Why not use regular expressions:
import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
print('href: %s, HTML text: %s' % (link[0], link[1]))
Here's an example using #ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
I found the answer by #Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
wget.download(full_path)
For Python 3:
urllib.parse.urljoin has to be used in order to obtain the full URL instead.
BeatifulSoup's own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).
import lxml.html
doc = lxml.html.parse(url)
links = doc.xpath('//a[#href]')
for link in links:
print link.attrib['href']
The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won't handle single and double dots in the relative paths though, but so far I didn't have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.
NOTE: Direct lxml url parsing doesn't handle loading from https and doesn't do redirects, so for this reason the version below is using urllib2 + lxml.
#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch
try:
import urltools as urltools
except ImportError:
sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
urltools = None
def get_host(url):
p = urlparse.urlparse(url)
return "{}://{}".format(p.scheme, p.netloc)
if __name__ == '__main__':
url = sys.argv[1]
host = get_host(url)
glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'
doc = lxml.html.parse(urllib2.urlopen(url))
links = doc.xpath('//a[#href]')
for link in links:
href = link.attrib['href']
if fnmatch.fnmatch(href, glob_patt):
if not href.startswith(('http://', 'https://' 'ftp://')):
if href.startswith('/'):
href = host + href
else:
parent_url = url.rsplit('/', 1)[0]
href = urlparse.urljoin(parent_url, href)
if urltools:
href = urltools.normalize(href)
print href
The usage is as follows:
getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"
There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:
# Python 3.
import urllib
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
link = line.get('href')
if not link:
continue
if link.startswith('http'):
external_links.add(link)
else:
internal_links.add(link)
# Depending on usage, full internal links may be preferred.
full_internal_links = {
urllib.parse.urljoin(url, internal_link)
for internal_link in internal_links
}
# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
print(link)
import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']
Related
Scraping site returns different href for a link
In python, I'm using the requests module and BS4 to search the web with duckduckgo.com. I went to http://duckduckgo.com/html/?q='hello' manually and got the first results title as <a class="result__a" href="http://example.com"> using the Developer Tools. Now I used the following code to get the href with Python: html = requests.get('http://duckduckgo.com/html/?q=hello').content soup = BeautifulSoup4(html, 'html.parser') result = soup.find('a', class_='result__a')['href'] However, the href looks like gibberish and is completely different from the one i saw manually. ny idea why this is happening?
There are multiple DOM elements with the classname 'result__a'. So, don't expect the first link you see be the first you get. The 'gibberish' you mentioned is an encoded URL. You'll need to decode and parse it to get the parameters(params) of the URL. For example: "/l/?kh=-1&uddg=https%3A%2F%2Fwww.example.com" The above href contains two params, namely kh and uddg. uddg is the actual link you need I suppose. Below code will get all the URL of that particular class, unquoted. import requests from bs4 import BeautifulSoup from urllib.parse import urlparse, parse_qs, unquote html = requests.get('http://duckduckgo.com/html/?q=hello').content soup = BeautifulSoup(html, 'html.parser') for anchor in soup.find_all('a', attrs={'class':'result__a'}): link = anchor.get('href') url_obj = urlparse(link) parsed_url = parse_qs(url_obj.query).get('uddg', '') if parsed_url: print(unquote(parsed_url[0]))
How to extract certain parts of an HTML paragraph
I am new to webscraping and regular expressions and facing a problem here. One of my code gives me an output in HTML but I need to extract a certain part out of the paragraph and not the complete paragraph. I Need help with this. Below is my code. import mechanize from bs4 import BeautifulSoup import urllib2 br = mechanize.Browser() response = br.open("http://www.consultadni.info/index.php") br.select_form(name="form1") br['APE_PAT']='PATRICIO' br['APE_MAT']='GAMARRA' br['NOMBRES']='MARCELINA' req=br.submit().read() soup = BeautifulSoup(req, "lxml") for link in soup.findAll("a"): sub=link.get("href") soup1 = BeautifulSoup(sub, "lxml") print soup1.find_all('p') Output on screen: [<p>/</p>] [<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>] [<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>] [<p>http://www.infocorpperuconsultatusdeudas.blogspot.com/2015/05/infocorp-consulta-gratis-tu-reporte-de.html?ref=dnionline</p>] What I need: 30/06/1980 & 40631880
For Python 2.7 try this way: from urlparse import parse_qs result = set() for link in soup.find_all("a"): sub = parse_qs(link.get("href")) if "id2" in sub: result.add((sub["id2"][0], sub["dni3"][0])) print result
Clean way to parse URLs (Python 3): from urllib import parse URL = "datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880" query_parts = parse.parse_qs(parse.urlparse(URL).query) print(query_parts["id2"][0], query_parts["dni3"][0])
How to sift through specific items from a webpage using conditional statement
I've made a scraper in python. It is running smoothly. Now I would like to discard or accept specific links from that page as in, links only containing "mobiles" but even after making some conditional statement I can't do so. Hope I'm gonna get any help to rectify my mistakes. import requests from bs4 import BeautifulSoup def SpecificItem(): url = 'https://www.flipkart.com/' Process = requests.get(url) soup = BeautifulSoup(Process.text, "lxml") for link in soup.findAll('div',class_='')[0].findAll('a'): if "mobiles" not in link: print(link.get('href')) SpecificItem() On the other hand if I do the same thing using lxml library with xpath, It works. import requests from lxml import html def SpecificItem(): url = 'https://www.flipkart.com/' Process = requests.get(url) tree = html.fromstring(Process.text) links = tree.xpath('//div[#class=""]//a/#href') for link in links: if "mobiles" not in link: print(link) SpecificItem() So, at this point i think with BeautifulSoup library the code should be somewhat different to get the purpose served.
The root of your problem is your if condition works a bit differently between BeautifulSoup and lxml. Basically, if "mobiles" not in link: with BeautifulSoup is not checking if "mobiles" is in the href field. I didn't look too hard but I'd guess it's comparing it to the link.text field instead. Explicitly using the href field does the trick: import requests from bs4 import BeautifulSoup def SpecificItem(): url = 'https://www.flipkart.com/' Process = requests.get(url) soup = BeautifulSoup(Process.text, "lxml") for link in soup.findAll('div',class_='')[0].findAll('a'): href = link.get('href') if "mobiles" not in href: print(href) SpecificItem() That prints out a bunch of links and none of them include "mobiles".
How to find and extract link from webpage?
I have website , e.g http://site.com I would like fetch main page and extract only links that match the regular expression, e.g .*somepage.* The format of links in html code can be: url url url I need the output format: http://site.com/my-somepage http://site.com/my-somepage.html http://site.com/my-somepage.htm Output url must contain domain name always. What is the fast python solution for this?
You could use lxml.html: from lxml import html url = "http://site.com" doc = html.parse(url).getroot() # download & parse webpage doc.make_links_absolute(url) for element, attribute, link, _ in doc.iterlinks(): if (attribute == 'href' and element.tag == 'a' and 'somepage' in link): # or e.g., re.search('somepage', link) print(link) Or the same using beautifulsoup4: import re try: from urllib2 import urlopen from urlparse import urljoin except ImportError: # Python 3 from urllib.parse import urljoin from urllib.request import urlopen from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4 url = "http://site.com" only_links = SoupStrainer('a', href=re.compile('somepage')) soup = BeautifulSoup(urlopen(url), parse_only=only_links) urls = [urljoin(url, a['href']) for a in soup(only_links)] print("\n".join(urls))
Use an HTML Parsing module, like BeautifulSoup. Some code(only some): from bs4 import BeautifulSoup import re html = '''url url url''' soup = BeautifulSoup(html) links = soup.find_all('a',{'href':re.compile('.*somepage.*')}) for link in links: print link['href'] Output: http://site.com/my-somepage /my-somepage.html my-somepage.htm You should be able to get the format you want from this much data...
Scrapy is the simplest way to do what you want. There is actually link extracting mechanism built-in. Let me know if you need help with writing the spider to crawl links. Please, also see: How do I use the Python Scrapy module to list all the URLs from my website? Scrapy tutorial
can we use XPath with BeautifulSoup?
I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody': import urllib import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" req = urllib2.Request(url) response = urllib2.urlopen(req) the_page = response.read() soup = BeautifulSoup(the_page) soup.findAll('td',attrs={'class':'empformbody'}) Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.
Nope, BeautifulSoup, by itself, does not support XPath expressions. An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster. Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements. try: # Python 2 from urllib2 import urlopen except ImportError: from urllib.request import urlopen from lxml import etree url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" response = urlopen(url) htmlparser = etree.HTMLParser() tree = etree.parse(response, htmlparser) tree.xpath(xpathselector) There is also a dedicated lxml.html() module with additional functionality. Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression: import lxml.html import requests url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" response = requests.get(url, stream=True) response.raw.decode_content = True tree = lxml.html.parse(response.raw) Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier: from lxml.cssselect import CSSSelector td_empformbody = CSSSelector('td.empformbody') for elem in td_empformbody(tree): # Do something with these table cells. Coming full circle: BeautifulSoup itself does have very complete CSS selector support: for cell in soup.select('table#foobar td.empformbody'): # Do something with these table cells.
I can confirm that there is no XPath support within Beautiful Soup.
As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3: from lxml import html import requests page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') tree = html.fromstring(page.content) #This will create a list of buyers: buyers = tree.xpath('//div[#title="buyer-name"]/text()') #This will create a list of prices prices = tree.xpath('//span[#class="item-price"]/text()') print('Buyers: ', buyers) print('Prices: ', prices) I used this as a reference.
BeautifulSoup has a function named findNext from current element directed childern,so: father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a') Above code can imitate the following xpath: div[class=class_value]/div[id=id_value]
from lxml import etree from bs4 import BeautifulSoup soup = BeautifulSoup(open('path of your localfile.html'),'html.parser') dom = etree.HTML(str(soup)) print dom.xpath('//*[#id="BGINP01_S1"]/section/div/font/text()') Above used the combination of Soup object with lxml and one can extract the value using xpath
when you use lxml all simple: tree = lxml.html.fromstring(html) i_need_element = tree.xpath('//a[#class="shared-components"]/#href') but when use BeautifulSoup BS4 all simple too: first remove "//" and "#" second - add star before "=" try this magic: soup = BeautifulSoup(html, "lxml") i_need_element = soup.select ('a[class*="shared-components"]') as you see, this does not support sub-tag, so i remove "/#href" part
I've searched through their docs and it seems there is no XPath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from XPath to BeautifulSoup, so my conclusion would be - no, there is no XPath parsing available.
Maybe you can try the following without XPath from simplified_scrapy.simplified_doc import SimplifiedDoc html = ''' <html> <body> <div> <h1>Example Domain</h1> <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p> <p>More information...</p> </div> </body> </html> ''' # What XPath can do, so can it doc = SimplifiedDoc(html) # The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text print (doc.body.div.h1.text) print (doc.div.h1.text) print (doc.h1.text) # Shorter paths will be faster print (doc.div.getChildren()) print (doc.div.getChildren('p'))
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time. Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works. from bs4 import BeautifulSoup rss_obj = BeautifulSoup(rss_text, 'xml') cls.title = rss_obj.rss.channel.title.get_text()
use soup.find(class_='myclass')