Web scraping with beautiful soup [duplicate] - python

This question already has answers here:
Getting all Links from a page Beautiful Soup
(3 answers)
Closed 2 months ago.
I am trying to scrape links from this page(https://www.setlist.fm/search?query=nightwish)
Whilst this code does retrieve the links i want it also comes back with a load of other stuff i don't want.
Example of what i want:
setlist/nightwish/2022/quarterback-immobilien-arena-leipzig-germany-2bbca8f2.html
setlist/nightwish/2022/brose-arena-bamberg-germany-3bf4963.html
setlist/nightwish/2022/arena-gliwice-gliwice-poland-3bc9dc7.html
Can i use beautiful soup to get these links or do i need to use regex?
url = 'https://www.setlist.fm/search?query=nightwish'
reqs = requests.get(url)
soup = bs4.BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.select('a'):
urls.append(link)
print(link.get('href'))

Please check to see if the code snippet below is useful.
import requests
from bs4 import BeautifulSoup
url = 'https://www.setlist.fm/search?query=nightwish'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
for g_data in soup.find_all('a', {'class': 'link-primary'}, href=True):
print(g_data['href'])

Related

How to get specific urls from a website in a class tag with beautiful soup? (Python)

I'm trying to get the urls of the main articles from a news outlet using beautiful soup. Since I do not want to get ALL of the links on the entire page, I specified the class. My code only manages to display the titles of the news articles, not the links. This is the website: https://www.reuters.com/news/us
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("h3", {"class": "story-title"})
for i in links:
print(i.get_text().strip())
print()
Any help is greatly apreciated!
To get link to all articles you can use following code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("div", {"class": "story-content"})
for i in links:
print(i.a.get('href'))

Extract text with Beautiful Soup [duplicate]

This question already has answers here:
BS4 Beautiful Soup extract text from find_all
(2 answers)
Closed 2 years ago.
I'm trying to learn how to use beautiful soup
using this website as a very simple example.
https://www.espncricinfo.com/ci/content/ground/56490.html#Profile
Lets say I want to extract the capacity of the ground. I have so far written the following code which gives me the field names, but I can't seem to understand how to get the actual value of 18,000
Can anyone help?
url="https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text)
soup.findAll('label')
Perhaps something like
from bs4 import BeautifulSoup
import requests
url = "https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stats = soup.find('div', {'id': 'stats'})
for e in stats.findAll('label'):
print(f"{e.text}: {e.nextSibling}")
demo

How to grab titles from webpages using Beautiful Soup in Python and iterating through

I am using bs4 in python to parse web pages and get information. I am having trouble grabbing just the title. Another part I struggled with was following the links, should this be done recursively or would I be able to do it through a loop?
def getTitle(link):
resp = urllib.request.urlopen(link)
soup = BeautifulSoup(resp, 'html.parser')
print(soup.find("<title>"))
from bs4 import BeautifulSoup
import urllib
def getTitle(link):
resp = urllib.request.urlopen(link)
soup = BeautifulSoup(resp, 'html.parser')
return soup.title.text
print(getTitle('http://www.bbc.co.uk/news'))
Which displays:
Home - BBC News

How to scrape href with Python 3.5 and BeautifulSoup [duplicate]

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 6 years ago.
I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.
That's my code
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)
I get [None, None, .... None, None] back.
I need a list with all the href from the class .
Any ideas?
Try something like this:
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage)
project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)
This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.
project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]
This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.
EDIT:
To solve the problem you mentioned in the comments:
soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
print(i.a['href'])

Fail to get all nodes under a tag with python-BeautifulSoup

The page I am scraping is link. I would like to get all the a hrefs of papers. The consequent code is as below:
import urllib2
import requests
from bs4 import BeautifulSoup
response = requests.get('http://ijcai.org/proceedings/2011')
soup = BeautifulSoup(response.content, 'html.parser')
page = soup.find('div', class_ ='field-item even')
tree = [child for child in page.children]
But when I tried:tree[-1], I got:
Erratum
Indeed it just laid on the half of the page. When did I fail to get the remaining parts of that page? Do you have any ideas about it? Thank you in advance!
The HTML of this page is not well-formed, use a different parser, e.g. html5lib (requires html5lib to be installed):
soup = BeautifulSoup(response.content, 'html5lib')
or lxml (requires lxml to be installed):
soup = BeautifulSoup(response.content, 'lxml')
Now tree[-1] would be the last paragraph on the page:
<p>Index / 2871</p>
I would also improve the way you extract the links:
links = [a["href"] for a in soup.select(".field-item a")]

Categories

Resources