How to scrape href with Python 3.5 and BeautifulSoup [duplicate] - python

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 6 years ago.
I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.
That's my code
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)
I get [None, None, .... None, None] back.
I need a list with all the href from the class .
Any ideas?

Try something like this:
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage)
project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)
This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.
project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]
This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.
EDIT:
To solve the problem you mentioned in the comments:
soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
print(i.a['href'])

Related

Web scraping with beautiful soup [duplicate]

This question already has answers here:
Getting all Links from a page Beautiful Soup
(3 answers)
Closed 2 months ago.
I am trying to scrape links from this page(https://www.setlist.fm/search?query=nightwish)
Whilst this code does retrieve the links i want it also comes back with a load of other stuff i don't want.
Example of what i want:
setlist/nightwish/2022/quarterback-immobilien-arena-leipzig-germany-2bbca8f2.html
setlist/nightwish/2022/brose-arena-bamberg-germany-3bf4963.html
setlist/nightwish/2022/arena-gliwice-gliwice-poland-3bc9dc7.html
Can i use beautiful soup to get these links or do i need to use regex?
url = 'https://www.setlist.fm/search?query=nightwish'
reqs = requests.get(url)
soup = bs4.BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.select('a'):
urls.append(link)
print(link.get('href'))
Please check to see if the code snippet below is useful.
import requests
from bs4 import BeautifulSoup
url = 'https://www.setlist.fm/search?query=nightwish'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
for g_data in soup.find_all('a', {'class': 'link-primary'}, href=True):
print(g_data['href'])

How to get specific urls from a website in a class tag with beautiful soup? (Python)

I'm trying to get the urls of the main articles from a news outlet using beautiful soup. Since I do not want to get ALL of the links on the entire page, I specified the class. My code only manages to display the titles of the news articles, not the links. This is the website: https://www.reuters.com/news/us
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("h3", {"class": "story-title"})
for i in links:
print(i.get_text().strip())
print()
Any help is greatly apreciated!
To get link to all articles you can use following code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("div", {"class": "story-content"})
for i in links:
print(i.a.get('href'))

'NoneType' object is not callable in Beautiful Soup 4

I'm new-ish to python and started experimenting with Beautiful Soup 4. I tried writing code that would get all the links on one page then with those links repeat the prosses until I have an entire website parsed.
import bs4 as bs
import urllib.request as url
links_unclean = []
links_clean = []
soup = bs.BeautifulSoup(url.urlopen('https://pythonprogramming.net/parsememcparseface/').read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
print(links_clean)
while True:
for link in links_clean:
soup = bs.BeautifulSoup(url.urlopen(link).read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
links_clean = list(dict.fromkeys(links_clean))
input()
But I'm now getting this error:
'NoneType' object is not callable
line 20, in
soup = bs.BeautifulSoup(url.urlopen(link).read(),
'html.parser')
Can you pls help.
Be careful when importing modules as something. In this case, url on line 2 gets overridden in your for loop when you iterate.
Here is a shorter solution that will also give back only URLs containing https as part of the href attribute:
from bs4 import BeautifulSoup
from urllib.request import urlopen
content = urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(content, "html.parser")
base = soup.find('body')
for link in BeautifulSoup(str(base), "html.parser").findAll("a"):
if 'href' in link.attrs:
if 'https' in link['href']:
print(link['href'])
However, this paints an incomplete picture as not all links are captured because of errors on the page with HTML tags. May I recommend also the following alternative, which is very simple and works flawlessly in your scenario (note: you will need the package Requests-HTML):
from requests_html import HTML, HTMLSession
session = HTMLSession()
r = session.get('https://pythonprogramming.net/parsememcparseface/')
for link in r.html.absolute_links:
print(link)
This will output all URLs, including both those that reference other URLs on the same domain and those that are external websites.
I would consider using an attribute = value css selector and using the ^ operator to specify that the href attributes begin with https. You will then only have valid protocols. Also, use set comprehensions to ensure no duplicates and Session to re-use connection.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
final = []
with requests.Session() as s:
r = s.get('https://pythonprogramming.net/parsememcparseface/')
soup = bs(r.content, 'lxml')
httpsLinks = {item['href'] for item in soup.select('[href^=https]')}
for link in httpsLinks:
r = s.get(link)
soup = bs(r.content, 'lxml')
newHttpsLinks = [item['href'] for item in soup.select('[href^=https]')]
final.append(newHttpsLinks)
tidyList = list({item for sublist in final for item in sublist})
df = pd.DataFrame(tidyList)
print(df)

How to grab titles from webpages using Beautiful Soup in Python and iterating through

I am using bs4 in python to parse web pages and get information. I am having trouble grabbing just the title. Another part I struggled with was following the links, should this be done recursively or would I be able to do it through a loop?
def getTitle(link):
resp = urllib.request.urlopen(link)
soup = BeautifulSoup(resp, 'html.parser')
print(soup.find("<title>"))
from bs4 import BeautifulSoup
import urllib
def getTitle(link):
resp = urllib.request.urlopen(link)
soup = BeautifulSoup(resp, 'html.parser')
return soup.title.text
print(getTitle('http://www.bbc.co.uk/news'))
Which displays:
Home - BBC News

BeautifulSoup returns empty list

I am new to scraping with python. I am using the BeautifulSoup to extract quotes from a website and here's my code:
#!/usr/bin/python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
r = urlopen("http://quotes.toscrape.com/tag/inspirational/")
bsObj = BeautifulSoup(r, "lxml")
links = bsObj.find_all("div", {"class:" "quote"})
print(links)
It returns:
[]
But when I try this:
for link in links :
print(link)
It returns nothing.
(Note: this happened to me for every website )
Edit: the propose of the code above is just to return a Tag but not the text (the quote)

Categories

Resources