what is the Bs4 way to do the js command "document.getElementsByClasName"? - python

Is there a way in bs4 to do a command similar to the JavaScript command
document.getElementsByClassName('exampleClassName')
In python?
from bs4 import BeautifulSoup
from requests import get
url = example.com
page = get(url).text
soup = BeautifulSoup(page, 'html_parser')
soup.find_all("div", attrs={'class': 'exampleClassName'})
If not, is there a way to scrape it another way?

There are a few ways to do something equivalent.
With beautifulsoup, you can use css selectors and select by class:
soup.select('.exampleClassName')
You can use lxml, and use xpath with a class filter:
doc.xpath(//'*[#class="exampleClassName"])
or if you really like that particular phrase, you can use AdvancedHTMLParser which actually has a method
getElementsByClassName - Returns a list of all elements containing one or more space-separated class names
It all depends on your style and preferences.

Related

Beautiful Soup find.all() that accepts start of word

I'm web scraping a site with beautiful soup that has class names like the following:
<a class="Component-headline-0-2-109" data-key="card-headline" href="/article/politics-senate-elections-legislation-coronavirus-pandemic-bills-f100b3a3b4498a75d6ce522dc09056b0">
The primary issue is that the class name always starts with Component-headline- but just send with a random number. When I use beautiful soup's soup.find_all('class','Component-headline'), it's not able to grab anything because of the unique number. Is it possible to use find_all, but to grab all the classes that just start with "Component-headline"?
I was also thinking on using the data-key="card-headline", and use soup.find_all('data-key','card-headline'), but for some reason that didn't work either, so I assume I can't find by data-key, but not sure. Any suggestions?
BeautifulSoup supports regex, so you can use re.compile to search for partial text on the class attribute
import re
soup.find_all('a', class_=re.compile('Component-headline'))
You can also use lambda
soup.find_all('a', class_=lambda c: c.startswith('Component-headline'))
Try using an [attribute^=value] CSS Selector.
To use a CSS Selector, instead of the find_all() method, use select().
The following selects all classes that start with Component-headline:
soup = BeautifulSoup(html, "html.parser")
print(soup.select('[class^="Component-headline"]'))

How to extract only a specific kind of link from a webpage with beautifulsoup4

I'm trying to extract specific links on a page full of links. The links I need contain the word "apartment" in them.
But whatever I try, I get way more data extracted than only the links I need.
<a href="https://www.website.com/en/ad/apartment/abcd123" title target="IWEB_MAIN">
If anyone could help me out on this, it'd be much appreciated!
Also, if you have a good source that could inform me better about this, it would be double appreciated!
Yon can use regular expression re.
import re
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.find_all("a",attrs={"href" : re.compile("apartment")})
for item in alltags:
print(item['href']) #grab href value
Or You can use css selector
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.select("a[href*='apartment']")
for item in alltags:
print(item['href'])
You find the details in official documents Beautifulsoup
Edited:
You need to consider parent div first then find the anchor tag.
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select("div[data-type='resultgallery-resultitem'] >a[href*='apartment']"):
print(item['href'])

Parsing specific values in multiple pages

I have the following code with a purpose to parse specific information from each of multiple pages. The http of each of the multiple pages is structured and therefore I use this structure to collect all links at the same time for further parsing.
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
Links = ["http://www.newyorksocialdiary.com/party-pictures?page=" + str(i) for i in range(2,27)]
This command gives me a list of http links. I go further to read in and make soups.
Rs = [urllib.urlopen(Link).read() for Link in Links]
soups = [BeautifulSoup(R) for R in Rs]
As these make the soups that I desire, I cannot achieve the final goal - parsing structure . For instance,
Something for Everyone
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'. However, the code below cannot serve this purpose.
As = [soup.find_all('a', attr = {"href"}) for soup in soups]
Could someone tell me where went wrong? I highly appreciate your assistance. Thank you.
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'.
The next would be going for regular expression!!
You don't necessarily need to use regular expressions, and, from what I understand, you can filter out the desired links with BeautifulSoup:
[[a["href"] for a in soup.select('a[href*=party-pictures]')]
for soup in soups]
This, for example, would give you the list of links having party-pictures inside the href. *= means "contains", select() is a CSS selector search.
You can also use find_all() and apply the regular expression filter, for example:
pattern = re.compile(r"/party-pictures/2007/")
[[a["href"] for a in soup.find_all('a', href=pattern)]
for soup in soups]
This should work :
As = [soup.find_all(href=True) for soup in soups]
This should give you all href tags
If you only want hrefs with name 'a', then the following would work :
As = [soup.find_all('a',href=True) for soup in soups]

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary

Web Scraping data using python?

I just started learning web scraping using Python. However, I've already ran into some problems.
My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)
The problem: I'm unable to extract all of the species names.
This is what I have so far:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(html_doc)
spans = soup.find_all(
From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...
Any input will be highly appreciated!
You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:
scientific_names = [it.text for it in soup.table.find_all('i')]
Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.
You should read up on what BS actually does, it seems like you're underestimating its utility.
What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:
import urllib2
from BeautifulSoup import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(page)
scientific_names = [it.text for it in soup.table.findAll('i')]
print scientific_names
Looking at the web page, I'm not sure exactly about what information you want to extract. However, note that you can easily get the text in a tag using the text attribute:
>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']
Thanks everyone...I was able to solve the problem I was having with this code:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
scientific_names = [it.text for it in soup.table.find_all('i')]
for item in scientific_names:
print item
If you want a long term solution, try scrapy. It is quite simple and does a lot of work for you. It is very customizable and extensible. You will extract all the URLs you need using xpath, which is more pleasant and reliable. Still scrapy allows you to use re, if you need.

Categories

Resources