Extracting HTML data fields with Python

Extracting HTML data fields with Python - python

Please forgive me for my lack of knowledge, but given HTML in the following format, what is the best way to extract the individual data fields? Please keep in mind that more often than not some, or all, of them will be NULL in which case we'll keep them at NULL.
<div class="profile-section" id="a-bit-more-about">
<dl>
<dt>Name:</dt>
<dd><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></dd>
</dl>
<!-- <span class="RealName">/ <span class="fn n"><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></span></span> -->
<dl>
<dt>Joined:</dt>
<dd>September 1910</dd>
</dl>
<div class="sep"></div>
<dl>
<dt>Hometown:</dt>
<dd>Quiet Rest Maximum Security Twilight Home</dd>
</dl>
<dl>
<dt>Currently:</dt>
<dd><span class="adr"><span class="locality">They won't tell me</span>, <span class="country-name">Zimbobwe</span></span></dd>
</dl>
<div class="sep"></div>

You want an HTML parser. I recommend beautiful soup or lxml.

Use third-party modules beautiful soup, lxml or built-in module html.parser. For example:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<html><body><a>bbb</a></body></html')
soup.find('a')
Or if like, you can use regex for small target.

Related

Using beatifulsoup to find text on html

This is my first time using beautifulsoup as a scraper tool and I just follow thru slowly with each step.
I've used soup.find_all("div", class_="product-box__inner") find a list of element I want and this partiful stuff not going thru my mind right now. my question below,
here is the HTML and my target is "$0" and I have tried
element.find("span", title= re.compile("$")) and I can't use element.select("dt > dd > span > span") because there's multiple one with same tag format which I dont need at all, Is there way I can target span data-fees-annual-value="" to get .text working?
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>

You are close to your goal with css selectors and they could be used more specific and reference directly on the attribute data-fees-annual-value:
soup.select_one('span[data-fees-annual-value]').text
Example
from bs4 import BeautifulSoup
html="""
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>
"""
soup=BeautifulSoup(html,"html.parser")
soup.select_one('span[data-fees-annual-value]').text
Output
$0

If you want to find element by text, use string instead of title:
element.find("span", string=re.compile('$'))
Output:
<span data-fees-annual-value="">$0</span>

How to scrape from a span subclass using scrapy

<span class="price-box"> <span class="price"><span data-currency-iso="PKR">Rs.</span> <span dir="ltr" data-price="16999"> 16,999</span> </span> <span class="price -old "><span data-currency-iso="PKR">Rs.</span> <span dir="ltr" data-price="50000"> 50,000</span> </span> </span>
Hello. I need some help in extracting the "data-price with "span dir = ""ltr"". I cannot determine how to extract it using scrapy.

It is pretty simple (assuming you get this HTML with a response in spider callback):
>>> response.css('span[dir=ltr]::attr(data-price)').extract()
['16999', '50000']
I would recommend you to read about Scrapy Selectors.

Alternatively to #Stasdeep's answer, you could use xpaths:
response.xpath('//span[#dir="ltr"]/#data-price').extract()
// -> Any sub span, no matter how deep it is
span[#dir="ltr"] -> span with attribute dir equaling "ltr"
#data-price -> same level attribute you want

Beautiful Soup: extracting picture url from webpage

So currently I'm having some issues trying to extract a picture URL from a web page using beautiful soup. I'm quite inexperienced with beautiful soup and would appreciate any feedback you have for me. Here is a snippet of the HTML I'm trying to extract the picture link from (more specifically, the data-srcset URL in the source media tag):
<div class="container-fluid" itemscope="" itemtype="http://schema.org/Product">
<div class="row">
<div id="js_carousel" class="col-xs-12 col-md-8">
<div id="psp-carousel" class="carousel_outer">
<div id="product-carousel" class="pdp-carousel carousel pdp-initial" style="display:block;">
<!-- Wrapper for slides -->
<div class="carousel-inner" id="carousel-inner" role="listbox">
<img class="product-image-placeholder" itemprop="image" alt="..." src="data:image/svg+xml;charset=utf-8,%3Csvg xmlns%3D'http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg' viewBox%3D'0 0 355 462'%3E %3Crect fill%3D'%23eee' width%3D'100%25' height%3D'100%25'%2F%3E%3C%2Fsvg%3E" width="355" height="462">
<picture class="item active" data-image="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of" role="option" aria-selected="true" tabindex="0">
<source media="(max-width: 767px)" data-srcset="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of?$pdp-main_small$" srcset="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of?$pdp-main_small$">
Any time I try to use the line
my_imgs = page_soup.findAll('picture',{'class':'item active'})
I get an empty array. I apologize if this is a dumb question, but any help would be appreciated.

Have you tried using the .select() function for a bs4 instance? The documentation says that this is the prefered method for finding css elements in your HTML soup. So in this case use page_soup.select('picture[class="item active"]') instead of .findall()
The .find() and .findAll() are for older versions of Beautiful Soup. And reading the documentation it seems like your code for the older versions should be formatted my_imgs = page_soup.findAll('picture', attrs ={'class':'item active'}) instead of my_imgs = page_soup.findAll('picture',{'class':'item active'}) you forgot to include the attrs part of the code to create a dictionary which beautiful soup then uses incase the data attributes that have names that can't be used as keyword arguments

BeautifulSoup: How to get nested divs

Given the following code:
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>
How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.

xpath should be the straight forward answer, however this is not supported in BeautifulSoup.
Updated: with a BeautifulSoup solution
To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>'''
content = BeautifulSoup(html)
for div in content.findAll('div', attrs={'class':'category5'}):
print div.text
test
I have no problem extracting the text from your html sample, like #MartijnPieters suggested, you will need to find out why your div element is missing.
Another update
As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.
You may consider using lxml or similar which supports xpath, dead easy if you ask me.
from lxml import etree
tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[#class="category5"]/text()')
[' test\n ']

Python -- Regex -- How to find a string between two sets of strings

Consider the following:
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
Sitemap
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?

Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# Sitemap

Parsing HTML with regular expression is a bad idea!
Think about the following piece of html
<a></a > <!-- legal html, but won't pass your regex -->
Sitemap<!-- proof that a>b iff ab>1 -->
There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.
You should consider using Beautiful Soup python HTML parser.
Anyhow, a ad-hoc solution using regex is
import re
data = """
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
"""
e = re.compile('<a *[^>]*>.*</a *>')
print e.findall(data)
Output:
>>> e.findall(data)
['Foo1', 'Home', 'Extract', 'Sitemap']

In order to extract the contents of the tagline:
Sitemap
... I would use:
>>> import re
>>> s = '''
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>'''
>>> m = re.compile(r'(.*?)').search(s)
>>> m.group(1)
'Sitemap'

Use BeautifulSoup or lxml if you need to parse HTML.
Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?
If you really have to use regular expressions, have a look at findall.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting HTML data fields with Python - python

You want an HTML parser. I recommend beautiful soup or lxml.

Use third-party modules beautiful soup, lxml or built-in module html.parser. For example: from bs4 import BeautifulSoup soup = BeautifulSoup('<html><body><a>bbb</a></body></html') soup.find('a') Or if like, you can use regex for small target.

Related

Using beatifulsoup to find text on html

How to scrape from a span subclass using scrapy

Beautiful Soup: extracting picture url from webpage

BeautifulSoup: How to get nested divs

Python -- Regex -- How to find a string between two sets of strings

Categories

Resources