BeautifulSoup XML parsing not working - python

I'm trying to parse an XML page with BeautifulSoup and for some reason it's not able to find the XML parser. I don't think it's a path issue as I've used lxml to parse pages in the past, just not XML. Here's the code:
from bs4 import *
import urllib2
import lxml
from lxml import *
BASE_URL = "http://auctionresults.fcc.gov/Auction_66/Results/xml/round/66_115_database_round.xml"
proxy = urllib2.ProxyHandler({'http':'http://myProxy.com})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
page = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(page,"xml")
print soup
I'm probably missing something simple, but all the XML parsing with BS questions I found on here were around bs3 and I'm using bs4 which uses a different method for parsing XML. Thanks.

If you have lxml installed, just call that as BeautifulSoup's parser instead, like below.
Code:
from bs4 import BeautifulSoup as bsoup
import requests as rq
url = "http://auctionresults.fcc.gov/Auction_66/Results/xml/round/66_115_database_round.xml"
r = rq.get(url)
soup = bsoup(r.content, "lxml")
print soup
Result:
<html><body><dataroot xmlns:od="urn:schemas-microsoft-com:officedata" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:nonamespaceschemalocation="66_database.xsd"><all_bids>
<auction_id>66</auction_id>
<auction_description>Advanced Wireless Services</auction_description>
... really long list follows...
[Finished in 34.9s]
Let us know if this helps.

Related

Is html parser included in BeautifulSoup?

I wrote the code
import requests
from bs4 import BeautifulSoup
res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html_parser')
print(soup)
And the terminal returns
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html_parser. Do you need to install a parser library?
There must be something I am doing wrong? Or is there something I forgot to install? I'm quite new to Beautiful Soup and data scraping in general so I'm open for advice and criticism.
You wrote html parser wrong instead of html_parser it is html.parser
Your code should look like this:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)

Converting BeautifulSoup soup to lxml element

I would like to use BeautifulSoup or lxml to parse some web pages. Since the raw data is not a clean xml so it cannot be parsed directly by lxml.etree.fromstring. However, Beautifulsoup(page_source,'lxml') works and I can get the soup data of the page. As I need some functions in lxml such as query by xpath. Are there any functions or variables that I can call to convert the soup object of the whole raw web page to a etree object? (I guess Beautifulsoup should have converted the raw page to an etree object before generating a soup object via lxml parser, but I cannot find out where it stores the object.)
p.s. I have tried the answer from Is it possible to use bs4 soup object with lxml? to parse the web pages. But I still find some pages cannot be parsed, here is the example:
>>> from urllib.request import urlopen
>>> html = urlopen('https://www.nature.com/articles/s41558-019-0619-1').read()
>>> soup = BeautifulSoup(html,'lxml') ## return a soup object
>>> from lxml.etree import fromstring
>>> fromstring(soup.prettify()) ## return errors
>>> from lxml.html.soupparser import fromstring
>>> fromstring(soup.prettify()) ## return errors

Having trouble in getting page source with beautifulsoup

I am trying to get the HTML source of a web page using beautifulsoup.
import bs4 as bs
import requests
import urllib.request
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2.html'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source,'html.parser')
print(soup)
I want the HTML source of the page. This is what I am getting now:
'ps.store("siteSettings", {"title":"PakWheels Forums","contact_email":"sami.ullah#pakeventures.com","contact_url":"https://www.pakwheels.com/main/contact_us","logo_url":"https://www.pakwheels.com/assets/logo.png","logo_small_url":"/images/d-logo-sketch-small.png","mobile_logo_url":"data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz4NCjwhLS0gR2VuZXJhdG9yOiBBZG9iZSBJbGx1c3RyYXRvciAxNi4wLjAsIFNWRyBFeHBvcnQgUGx1Zy1JbiAuIFNWRyBWZXJzaW9uOiA2LjAwIEJ1aWxkIDApICAtLT4NCjwhRE9DVFlQRSBzdmcgUFVCTElDICItLy9XM0MvL0RURCBTVkcgMS4xLy9FTiIgImh0dHA6Ly93d3cudzMub3JnL0dyYXBoaWNzL1NWRy8xLjEvRFREL3N2ZzExLmR0ZCI+DQo8c3ZnIHZlcnNpb249IjEuMSIgaWQ9IkxheWVyXzEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgeG1sbnM6eGxpbms9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGxpbmsiIHg9IjBweCIgeT0iMHB4Ig0KCSB3aWR0aD0iMjQwcHgiIGhlaWdodD0iNjBweCIgdmlld0JveD0iMCAwIDI0MCA2MCIgZW5hYmxlLWJhY2tncm91bmQ9Im5ldyAwIDAgMjQwIDYwIiB4bWw6c3BhY2U9InByZXNlcnZlIj4NCjxwYXRoIGZpbGw9IiNGRkZGRkYiIGQ9Ik02LjkwMiwyMy4yODZDMzQuNzc3LDIwLjI2Miw1Ny4yNC'
Have a look at this code:
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
Import everything you need correctly. Read this.

BeautifulSoup and Large html

I was trying to scrape a number of large Wikipedia pages like this one.
Unfortunately, BeautifulSoup is not able to work with such a large content, and it truncates the page.
I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(htmlContent, 'html5lib')
However, if you prefer, you can also use lxml as follows:
import lxml.html
doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')
I suggest you get the html content and then pass it to BS:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
soup = BeautifulSoup(r.content)
# get the div with links at the bottom of the page
links_div = soup.find('div', id='catlinks')
for a in links_div.find_all('a'):
print a.text
else:
print r.status_code

BeautifulSoup does not work for some web sites

I have this sript:
import urrlib2
from bs4 import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs
For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04
Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser
So, just try to use other couple of parsers mentioned in the documentation
soup = BeautifulSoup(page, "html.parser")
This should work!
It looks like you have a few mistakes in your code urrlib2 should be urllib2, I've fixed the code for you and this works using BeautifulSoup 3
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs

Categories

Resources