Does BeautifulSoup .select() method support use of regex? - python

Suppose I want to parse a html using BeautifulSoup and I wanted to use css selectors to find specific tags. I would "soupify" it by doing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I wanted to find a tag whose "id" attribute has a value of "abc" I can do
soup.select('#abc')
If I wanted to find all "a" child tags under our current tag, we could do
soup.select('#abc a')
But now, suppose I want to find all "a" tags whose 'href' attributes has values that end in "xyz" I would want to use regex for that, I was hoping something along the lines of
soup.select('#abc a[href] = re.compile(r"xyz$")')
I can not seem to find anything that says BeautifulSoup's .select() method will support regex.

The soup.select() function only supports CSS syntax; regular expressions are not part of that.
You can use such syntax to match attributes ending with text:
soup.select('#abc a[href$="xyz"]')
See the CSS attribute selectors documentation over on MSDN.
You can always use the results of a CSS selector to continue the search:
for element in soup.select('#abc'):
child_elements = element.find_all(href=re.compile('^http://example.com/\d+.html'))
Note that, as the element.select() documentation states:
This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.
Emphasis mine.

Related

How to use Regex in CSS Selector scrapy

I need to get a ul tag by the class name but the class name has a lot of different combinations but it is always just two letters that changes. product-gallerytw__thumbs could be one and product-galleryfp__thumbs could be one. I need to know how to use a css selector that uses regex so that either of these could be found (or any other combination)
I can't use Xpath as the location changes
img_ul = response.css('.product-gallerytw__thumbs')
print(img_ul)
This is what I am trying to do but have not found a way to add regex inside the .css()
You actually can use xpath:
img_ul = response.xpath("//*[contains(#class,'product-gallery')]")
or if you really need to specify everything but the two characters:
img_ul = response.xpath("//*[contains(#class,'product-gallery')][contains(#class,'__thumbs')]")
There is nothing a css selector can do that xpath can't. In fact css selectors are simply an abstraction of xpath selectors.

Beautiful Soup - Get all elements

Newbie with Beautiful Soup would appreciate any pointers.
I'm working with a page which has a lot of:
<p data-v-04dd08f2> .. </p>
elements. Inside the p is a string value, which I need and an embedded span.
Question might be very simple... I am trying to use find_all to 'get' a list of all those elements which I would subsequently parse out to get the tokens I need from inside.
Can someone put me out my misery and tell me how the find_all should be structured to get these?
I've tried:
find_all('p',{'data':'v-04dd08f2'} } # nope
find_all('p', {"attributes': 'v-04dd08f2'} ) # nope
and lots of other combinations all to no avail.
Thanks!
If you are willing to use CSS selectors instead, which I personally prefer to BeautifulSoup's find_* methods and the paragraph tags are in fact exactly what you indicated, that "data-v-04dd08f2" is an attribute of the tag, then the following should do the trick
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p data-v-04dd08f2> .. </p>')
p_tags = soup.select('p[data-v-04dd08f2]')
print(p_tags)
#[<p data-v-04dd08f2=""> .. </p>]
bs4 uses SoupSieve to implement CSS selectors. The SoupSieve docs for selecting based on attribute are here. Note that based on your attempts I suspect you might actually be looking for p tags who have a data attribute = 'v-04dd08f2'. If that's the case the soup.select string should be soup.select('p[data=v-04dd08f2]')
This will return all elements having attribute name starting with "data-v-"
match_pattern = 'data-v-'
m = soup.findAll(lambda tag: any(attr.startswith(match_pattern) for attr in tag.attrs.keys()))
element.attrs is a key-value structure, {attribute_name: attribute_value}

Beautiful Soup find.all() that accepts start of word

I'm web scraping a site with beautiful soup that has class names like the following:
<a class="Component-headline-0-2-109" data-key="card-headline" href="/article/politics-senate-elections-legislation-coronavirus-pandemic-bills-f100b3a3b4498a75d6ce522dc09056b0">
The primary issue is that the class name always starts with Component-headline- but just send with a random number. When I use beautiful soup's soup.find_all('class','Component-headline'), it's not able to grab anything because of the unique number. Is it possible to use find_all, but to grab all the classes that just start with "Component-headline"?
I was also thinking on using the data-key="card-headline", and use soup.find_all('data-key','card-headline'), but for some reason that didn't work either, so I assume I can't find by data-key, but not sure. Any suggestions?
BeautifulSoup supports regex, so you can use re.compile to search for partial text on the class attribute
import re
soup.find_all('a', class_=re.compile('Component-headline'))
You can also use lambda
soup.find_all('a', class_=lambda c: c.startswith('Component-headline'))
Try using an [attribute^=value] CSS Selector.
To use a CSS Selector, instead of the find_all() method, use select().
The following selects all classes that start with Component-headline:
soup = BeautifulSoup(html, "html.parser")
print(soup.select('[class^="Component-headline"]'))

What is the most efficient way to get a specific link using Beautiful Soup in Python 3.0?

I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]

How does `findAll` work in BeautifulSoup?

Can someone please explain how findAll works in BeautifulSoup?
My doubt is this row: A = soup.findAll('strong',{'class':'name fn'}). it looks like find some characters matching certain criteria.
but the original codes of the webpage is like ......<STRONG class="name fn">iPod nano 16GB</STRONG>......
how does the ('strong',{'class':'name fn'}) pick it up? thanks.
original Python codes
from bs4 import BeautifulSoup
import urllib2
import re
url="http://m.harveynorman.com.au/ipods-audio-music/ipods/ipods"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
A = soup.findAll('strong',{'class':'name fn'})
for B in A:
print B.renderContents()
From the docs: Beautifulsoup Docs
Beautiful Soup provides many methods that traverse(goes through) the parse tree, gathering Tags and NavigableStrings that match criteria you specify.
From The basic find method: findAll(name, attrs, recursive, text, limit, **kwargs)
The findAll method traverses the tree, starting at the given point, and finds all the Tag and NavigableString objects that match the criteria you give. The signature for the findall method is this:
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
The name argument can be used to pass in a:
tag name (e.g. < B >)
a regular expression
a list or dictionary
the value True
a callable object
The keyword arguments impose restrictions on the attributes of a tag.
It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is also a Python reserved word.
You could search by CSS class with soup.find("tagName", { "class" : "cssClass" }),like the code you gave) but that's a lot of code for such a common operation. Instead, you can pass a string for attrs instead of a dictionary.
The doc has further examples to help you understand.

Categories

Resources