BeautifulSoup drops children when parent is in an implied namespace - python

I am building a screenscraper for Nordstrom's website using selenium and BeautifulSoup. The website does not actually have every tag in a namespace, but the Firefox webdriver creates one to avoid ambiguity (specifically, the site has a <html xmlns> tag that seems to confuse the driver).
Thus, everything is put in the namespace a0. However, Beautiful Soup only returns the parent element and (sometimes) one level of children when find() is called.
Take this html for example:
<div class='division'>
<a href='#'>
<img />
</a>
</div>
Everything is in the implied a0 namespace, so we can get the image with:
soup.find('a0:div',{'class':'division'}).find('a0:img')
However, this returns None. I have looked through soup.prettify() and can say with certainty that the a0:img is within the a0:div. Is this an intended feature (in which case I need to find a new way of doing it) or a bug (in which case I need a workaround)?
EDIT:
To avoid confusion, this is an example demonstrating the entire workflow:
from selenium import webdriver
from BeautifulSoup import BeautifulSoup # Note that this is BeautifulSoup 3
b = webdriver.Firefox()
b.get("http://shop.nordstrom.com/c/womens-skirts")
borscht = BeautifulSoup(b.page_source)
theImageThatCannotBeFound = borscht.find('a0:div',{'class':'fashion-item'}).find('a0:img')
The above code sets theImageThatCannotBeFound to None, which I believe is incorrect. I hope this clarifies.

This worked for me.
import urllib
from BeautifulSoup import BeautifulSoup
url = 'http://shop.nordstrom.com/c/womens-skirts'
fp = urllib.urlopen(url)
soup = BeautifulSoup(fp)
print soup.find('div',{'class':'fashion-item'}).findAll('img') # also tried .find
Try excludinga0:. That's what seems to be your problem.
EDIT:
Using both Chrome and Firefox browsers in and out of Selenium, xmlns is set to an empty string when I view it, which is why the above code works for me. It seems that by mismatch of some component somewhere, we're not getting the same results, and you're getting the namespace a0:.
Because I cannot reproduce the situation, the only solution I can find (albeit very hacky), is to manually replace the namespace:
source = browser.page_source.replace('a0:div','div')
soup = BeautifulSoup(source)
print soup.find('div',{'class':'fashion-item'}).find('img')
I'll admit its not exactly an ideal solution. I'll keep looking and update my answer if I find a more elegant fix.

Related

extracting table data from yahoofinance by beautifulsoup in python

I am a python programmer. I want to extract all of table data in below link by beautifulsoup library.
This is the link: https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF[enter image description here]1
You'll want to look into web scraping tutorials.
Here's one to get you started: https://realpython.com/python-web-scraping-practical-introduction/
This kind of thing can get a little complicated with complex mark-up, and I'd say the provided link in the question post qualifies as slightly complex mark-up, but basically, you want to find the container div object with "Pb(10px) Ovx(a) W(100%)" classes or table container with data-test attribute of "historical-prices". Drill down to the mark-up data you need from there.
HOWEVER, if you insist on using BeautifulSoup library, here's a tutorial for that: https://realpython.com/beautiful-soup-web-scraper-python/
Scroll down to step 3: "Parse HTML Code With Beautiful Soup"
install the library: python -m pip install beautifulsoup4
Then, use the following code to scrape the page:
import requests
from bs4 import BeautifulSoup
URL = "https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
Then, find the table container with data-test attribute of "historical-prices" which I mentioned earlier:
results = soup.find(attrs={"data-test" : "historical-prices"})
Thanks to this other StackOverflow post for this info on the attrs parameter: Extracting an attribute value with beautifulsoup
From there, you'll want to drill down. I'm not really sure how to do this step properly, as I never did this in Python before, but there are multiple ways to go about doing this. My preferred way would be to use the find method or findAll method on the initial result set:
result_set = results.find("tbody", recursive=False).findAll("tr")
Alternatively, you may be able to use the deprecated findChildren method:
result_set = results.findChildren("tbody", recursive=False)
result_set2 = result_set.findChildren("tr", recursive=False)
You may require a results set loop for each drill-down. The page you mentioned doesn't make things easy, mind you. You'll have to drill down multiple times to find the proper tr elements. Of course, the above code is only example code, not properly tested.

xpath how to format path

I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.
In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.
I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

Parse malformed attribute using BeautifulSoup

I'm trying to extract an attribute that contains an invalid unescaped quote:
<meta content="mal"formed">
When using BeautifulSoup like this:
soup.find('meta')['content']
And as expected, the result is mal.
Is there a way to make BeautifulSoup treat the unescaped quote as a part of the attribute, so the result will be mal"formed?
After some trial and error using regex, this is my best solution so far:
html = re.sub('(content="[^"=]+)"([^"=]+")', r'\1"\2', html)
soup = BeautifulSoup(html)
soup.find('meta')['content']
Explanation: at first I tried to run the regex only on the desired element. However when doing str(element), BeautifulSoup doesn't return the original html, but a reformatted html which already doesn't contain the formed (the invalid) part of the attribute.
So my solution is based on searching for this exact kind of malformed attributes on the entire HTML, and fixing it using regex. Of course it's very specific to my case.
A better (and hopefully less hackish) solution will be much appreciated.
Here is what I've tried to fix that broken HTML:
different BeautifulSoup parsers - html.parser, html5lib, lxml
lxml.html with and without recover=True
from lxml.html import HTMLParser, fromstring, tostring
data = """<meta content="mal"formed">"""
parser = HTMLParser(recover=True)
print tostring(fromstring(data, parser=parser))
Prints:
<html><head><meta content="mal" formed></head></html>
firing up Firefox and Chrome via selenium and feeding them the broken meta tag:
from selenium import webdriver
data = """<meta content="mal"formed">"""
driver = webdriver.Chrome() # or webdriver.Firefox
driver.get("about:blank")
driver.execute_script("document.head.innerHTML = '{html}';".format(html=data))
data = driver.page_source
driver.close()
print data
Prints:
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta content="mal" formed"="" /></head><body></body></html>
Different tools interpreted the HTML differently but no tool provided the desired output.
I guess, depending on how much you know about the data, pre-processing it with regular expressions might be a practical solution in this case.

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary

Python web scraping involving HTML tags with attributes

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:
<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?
At the moment, my code looks like what is below:
import re
import urllib2,sys
import lxml
from lxml import etree
from lxml.html.soupparser import fromstring
from lxml.etree import tostring
from lxml.cssselect import CSSSelector
from BeautifulSoup import BeautifulSoup, NavigableString
address='http://www.example.com/'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
html=soup.prettify()
html=html.replace('&nbsp', '&#160')
html=html.replace('&iacute','&#237')
root=fromstring(html)
I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.
EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.
It's not clear to me from your question why you need to worry about the div tags -- what about doing just:
soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string
On the HTML you give, running this emits exactly:
####I want whatever is located here ###
which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn't do -- multiple td tags all of class author of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It's hard to infer what exactly are your specs, just from this simple example and overabundant code;-).
Edit: if, as per the OP's latest comment, there are multiple such td tags, one per author:
thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
print thetd.string
...i.e., not much harder at all!-)
or you could be using pyquery, since BeautifulSoup is not actively maintained anymore, see http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
first, install pyquery with
easy_install pyquery
then your script could be as simple as
from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]
pyquery uses the css selector syntax familiar from jQuery which I find more intuitive than BeautifulSoup's. It uses lxml underneath, and is much faster than BeautifulSoup. But BeautifulSoup is pure python, and thus works on Google's app engine as well
The lxml library is now the standard for parsing html in python. The interface can seem awkward at first, but it is very serviceable for what it does.
You should let the libary handle the xml specialism, such as those escaped &entities;
import lxml.html
html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
<td class="author">####I want whatever is located here, eh? í ###</td>
</tr></tbody></table></div></div></body></html>"""
root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")
print tds # gives [<Element td at 84ee2cc>]
print tds[0].text # what you want, including the 'í'
BeautifulSoup is certainly the canonical HTML parser/processor. But if you have just this kind of snippet you need to match, instead of building a whole hierarchical object representing the HTML, pyparsing makes it easy to define leading and trailing HTML tags as part of creating a larger search expression:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
author_td, end_td = makeHTMLTags("td")
# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))
search = author_td + SkipTo(end_td)("body") + end_td
for match in search.searchString(html):
print match.body
Pyparsing's makeHTMLTags function does a lot more than just emit "<tag>" and "</tag>" expressions. It also handles:
caseless matching of tags
"<tag/>" syntax
zero or more attribute in the opening tag
attributes defined in arbitrary order
attribute names with namespaces
attribute values in single, double, or no quotes
intervening whitespace between tag and symbols, or attribute name, '=', and value
attributes are accessible after parsing as named results
These are the common pitfalls when considering using a regex for HTML scraping.

Categories

Resources