Beautiful Soup find.all() that accepts start of word

Beautiful Soup find.all() that accepts start of word - python

I'm web scraping a site with beautiful soup that has class names like the following:
<a class="Component-headline-0-2-109" data-key="card-headline" href="/article/politics-senate-elections-legislation-coronavirus-pandemic-bills-f100b3a3b4498a75d6ce522dc09056b0">
The primary issue is that the class name always starts with Component-headline- but just send with a random number. When I use beautiful soup's soup.find_all('class','Component-headline'), it's not able to grab anything because of the unique number. Is it possible to use find_all, but to grab all the classes that just start with "Component-headline"?
I was also thinking on using the data-key="card-headline", and use soup.find_all('data-key','card-headline'), but for some reason that didn't work either, so I assume I can't find by data-key, but not sure. Any suggestions?

BeautifulSoup supports regex, so you can use re.compile to search for partial text on the class attribute
import re
soup.find_all('a', class_=re.compile('Component-headline'))
You can also use lambda
soup.find_all('a', class_=lambda c: c.startswith('Component-headline'))

Try using an [attribute^=value] CSS Selector.
To use a CSS Selector, instead of the find_all() method, use select().
The following selects all classes that start with Component-headline:
soup = BeautifulSoup(html, "html.parser")
print(soup.select('[class^="Component-headline"]'))

Related

Webscraping the links from a table

I want to web scrape the links and their respective texts from a table. I plan to use regex to accomplish this.
So let's say in this page I have multiple text_i tags. I want to get all the text_i's into a list and then get all the href's into a separate list.
I have:
web = requests.get(url)
web_text = web.text
texts = re.findall(r'<table .*><a .*>(.*)</a></table>, web_text)'
The regex expression finds all the anchor tags, of whatever class, inside a HTML table of whatever class and returns the texts, correct? This is taking an extraordinarily long time. Is this the correct way to do it?
Also, how do I go about getting the href url's now?

I suggest you use Beautiful Soup to parse the HTML text of the table.
Adapted from Beautiful Soup's documentation you could do for example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(web_text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))

what is the Bs4 way to do the js command "document.getElementsByClasName"?

Is there a way in bs4 to do a command similar to the JavaScript command
document.getElementsByClassName('exampleClassName')
In python?
from bs4 import BeautifulSoup
from requests import get
url = example.com
page = get(url).text
soup = BeautifulSoup(page, 'html_parser')
soup.find_all("div", attrs={'class': 'exampleClassName'})
If not, is there a way to scrape it another way?

There are a few ways to do something equivalent.
With beautifulsoup, you can use css selectors and select by class:
soup.select('.exampleClassName')
You can use lxml, and use xpath with a class filter:
doc.xpath(//'*[#class="exampleClassName"])
or if you really like that particular phrase, you can use AdvancedHTMLParser which actually has a method
getElementsByClassName - Returns a list of all elements containing one or more space-separated class names
It all depends on your style and preferences.

Parsing specific values in multiple pages

I have the following code with a purpose to parse specific information from each of multiple pages. The http of each of the multiple pages is structured and therefore I use this structure to collect all links at the same time for further parsing.
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
Links = ["http://www.newyorksocialdiary.com/party-pictures?page=" + str(i) for i in range(2,27)]
This command gives me a list of http links. I go further to read in and make soups.
Rs = [urllib.urlopen(Link).read() for Link in Links]
soups = [BeautifulSoup(R) for R in Rs]
As these make the soups that I desire, I cannot achieve the final goal - parsing structure . For instance,
Something for Everyone
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'. However, the code below cannot serve this purpose.
As = [soup.find_all('a', attr = {"href"}) for soup in soups]
Could someone tell me where went wrong? I highly appreciate your assistance. Thank you.

I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'.
The next would be going for regular expression!!
You don't necessarily need to use regular expressions, and, from what I understand, you can filter out the desired links with BeautifulSoup:
[[a["href"] for a in soup.select('a[href*=party-pictures]')]
for soup in soups]
This, for example, would give you the list of links having party-pictures inside the href. *= means "contains", select() is a CSS selector search.
You can also use find_all() and apply the regular expression filter, for example:
pattern = re.compile(r"/party-pictures/2007/")
[[a["href"] for a in soup.find_all('a', href=pattern)]
for soup in soups]

This should work :
As = [soup.find_all(href=True) for soup in soups]
This should give you all href tags
If you only want hrefs with name 'a', then the following would work :
As = [soup.find_all('a',href=True) for soup in soups]

Does BeautifulSoup .select() method support use of regex?

Suppose I want to parse a html using BeautifulSoup and I wanted to use css selectors to find specific tags. I would "soupify" it by doing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I wanted to find a tag whose "id" attribute has a value of "abc" I can do
soup.select('#abc')
If I wanted to find all "a" child tags under our current tag, we could do
soup.select('#abc a')
But now, suppose I want to find all "a" tags whose 'href' attributes has values that end in "xyz" I would want to use regex for that, I was hoping something along the lines of
soup.select('#abc a[href] = re.compile(r"xyz$")')
I can not seem to find anything that says BeautifulSoup's .select() method will support regex.

The soup.select() function only supports CSS syntax; regular expressions are not part of that.
You can use such syntax to match attributes ending with text:
soup.select('#abc a[href$="xyz"]')
See the CSS attribute selectors documentation over on MSDN.
You can always use the results of a CSS selector to continue the search:
for element in soup.select('#abc'):
child_elements = element.find_all(href=re.compile('^http://example.com/\d+.html'))
Note that, as the element.select() documentation states:
This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.
Emphasis mine.

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1

You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']

A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup find.all() that accepts start of word - python

BeautifulSoup supports regex, so you can use re.compile to search for partial text on the class attribute import re soup.find_all('a', class_=re.compile('Component-headline')) You can also use lambda soup.find_all('a', class_=lambda c: c.startswith('Component-headline'))

Try using an [attribute^=value] CSS Selector. To use a CSS Selector, instead of the find_all() method, use select(). The following selects all classes that start with Component-headline: soup = BeautifulSoup(html, "html.parser") print(soup.select('[class^="Component-headline"]'))

Related

Webscraping the links from a table

what is the Bs4 way to do the js command "document.getElementsByClasName"?

Parsing specific values in multiple pages

Does BeautifulSoup .select() method support use of regex?

Pull Tag Value using BeautifulSoup

Categories

Resources