Google Scraping href values - python

I have problem with find href values in BeautifulSoup`
from urllib import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("https://www.google.pl/search?q=sprz%C4%99t+dla+graczy&client=ubuntu&ei=4ypXWsi_BcLZwQKGroW4Bg&start=0&sa=N&biw=741&bih=624")
bsObj = BeautifulSoup(html)
for link in bsObj.find("h3", {"class":"r"}).findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
all the time I have error:
"AttributeError: 'NoneType' object has no attribute 'findAll'

You'll have to change the User-Agent string to something other than urllib's default user agent.
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
url = "https://www.google.pl/search?q=sprz%C4%99t+dla+graczy&client=ubuntu&ei=4ypXWsi_BcLZwQKGroW4Bg&start=0&sa=N&biw=741&bih=624"
html = urlopen(Request(url, headers={'User-Agent':'Mozilla/5'})).read()
bsObj = BeautifulSoup(html, 'html.parser')
for link in bsObj.find("h3", {"class":"r"}).findAll("a", href=True):
print(link['href'])
Also note that this expression will select only the first link. If you want to select all the links in the page use the following expression:
links = bsObj.select("h3.r a[href]")
for link in links:
print(link['href'])

Related

Unable to find element BeautifulSoup

I am trying to parse a specific href link from the following website: https://www.murray-intl.co.uk/en/literature-library.
Element i seek to parse:
<a class="btn btn--naked btn--icon-left btn--block focus-within" href="https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc&_ga=2.12911351.1364356977.1629796255-1577053129.1629192717" target="blank">Portfolio Holding Summary<i class="material-icons btn__icon">library_books</i></a>
However, using BeautifulSoup I am unable to obtain the desired element, perhaps due to cookies acceptance.
from bs4 import BeautifulSoup
import urllib.request
import requests as rq
page = requests.get('https://www.murray-intl.co.uk/en/literature-library')
soup = BeautifulSoup(page.content, 'html.parser')
link = soup.find_all('a', class_='btn btn--naked btn--icon-left btn--block focus-within')
url = link[0].get('href')
url
I am still new at BS4, and hope someone can help me on the right course.
Thank you in advance!
To get correct tags, remove "focus-within" class (it's added later by JavaScript):
import requests
from bs4 import BeautifulSoup
url = "https://www.murray-intl.co.uk/en/literature-library"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = soup.find_all("a", class_="btn btn--naked btn--icon-left btn--block")
for u in links:
print(u.get_text(strip=True), u.get("href", ""))
Prints:
...
Portfolio Holding Summarylibrary_books https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc
...
EDIT: To get only the specified link you can use for example CSS selector:
link = soup.select_one('a:-soup-contains("Portfolio Holding Summary")')
print(link["href"])
Prints:
https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc

'NoneType' object is not callable in Beautiful Soup 4

I'm new-ish to python and started experimenting with Beautiful Soup 4. I tried writing code that would get all the links on one page then with those links repeat the prosses until I have an entire website parsed.
import bs4 as bs
import urllib.request as url
links_unclean = []
links_clean = []
soup = bs.BeautifulSoup(url.urlopen('https://pythonprogramming.net/parsememcparseface/').read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
print(links_clean)
while True:
for link in links_clean:
soup = bs.BeautifulSoup(url.urlopen(link).read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
links_clean = list(dict.fromkeys(links_clean))
input()
But I'm now getting this error:
'NoneType' object is not callable
line 20, in
soup = bs.BeautifulSoup(url.urlopen(link).read(),
'html.parser')
Can you pls help.
Be careful when importing modules as something. In this case, url on line 2 gets overridden in your for loop when you iterate.
Here is a shorter solution that will also give back only URLs containing https as part of the href attribute:
from bs4 import BeautifulSoup
from urllib.request import urlopen
content = urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(content, "html.parser")
base = soup.find('body')
for link in BeautifulSoup(str(base), "html.parser").findAll("a"):
if 'href' in link.attrs:
if 'https' in link['href']:
print(link['href'])
However, this paints an incomplete picture as not all links are captured because of errors on the page with HTML tags. May I recommend also the following alternative, which is very simple and works flawlessly in your scenario (note: you will need the package Requests-HTML):
from requests_html import HTML, HTMLSession
session = HTMLSession()
r = session.get('https://pythonprogramming.net/parsememcparseface/')
for link in r.html.absolute_links:
print(link)
This will output all URLs, including both those that reference other URLs on the same domain and those that are external websites.
I would consider using an attribute = value css selector and using the ^ operator to specify that the href attributes begin with https. You will then only have valid protocols. Also, use set comprehensions to ensure no duplicates and Session to re-use connection.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
final = []
with requests.Session() as s:
r = s.get('https://pythonprogramming.net/parsememcparseface/')
soup = bs(r.content, 'lxml')
httpsLinks = {item['href'] for item in soup.select('[href^=https]')}
for link in httpsLinks:
r = s.get(link)
soup = bs(r.content, 'lxml')
newHttpsLinks = [item['href'] for item in soup.select('[href^=https]')]
final.append(newHttpsLinks)
tidyList = list({item for sublist in final for item in sublist})
df = pd.DataFrame(tidyList)
print(df)

Pull out href's with beautifulsoup attrs

I am trying something new with pulling out all the href's in the a tags. It isn't pulling out the hrefs though and cant figure out why.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.findAll('a'):
h = href.attrs['href']
print(h)
You should check if the key exists, since it may also not exist an href between <a> tags.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
for a in soup.findAll('a'):
if 'href' in a.attrs:
print(a.attrs['href'])

Python Beautiful Soup Parse First Href in Each Div

Given the following code:
# import the module
import bs4 as bs
import urllib.request
import re
masterURL = 'http://www.metrolyrics.com/top100.html'
sauce = urllib.request.urlopen(masterURL).read()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.findAll('ul', {'class': 'song-list'}):
for span in div:
for link in span:
for a in link:
print(a)
I can parse multiple divs, and i get a result as follows :
My question is instead of getting the full contents of the div how can I only return the highlighted portion, the URL of the Href?
Try this. You need to specify the right class to fetch the urls connected to it.
from bs4 import BeautifulSoup
import urllib.request
masterURL = 'http://www.metrolyrics.com/top100.html'
sauce = urllib.request.urlopen(masterURL).read()
soup = BeautifulSoup(sauce,'lxml')
for div in soup.find_all(class_='subtitle'):
print(div.get("href"))
Output:
http://www.metrolyrics.com/charles-goose-lyrics.html
http://www.metrolyrics.com/param-singh-lyrics.html
http://www.metrolyrics.com/westlife-lyrics.html
http://www.metrolyrics.com/luis-fonsi-lyrics.html
http://www.metrolyrics.com/grease-lyrics.html
http://www.metrolyrics.com/shanti-dope-lyrics.html
and so on ---
if 'href' in a.attrs:
a.attrs['href']
this will give you what you need.

BeautifulSoup returns empty list

I am new to scraping with python. I am using the BeautifulSoup to extract quotes from a website and here's my code:
#!/usr/bin/python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
r = urlopen("http://quotes.toscrape.com/tag/inspirational/")
bsObj = BeautifulSoup(r, "lxml")
links = bsObj.find_all("div", {"class:" "quote"})
print(links)
It returns:
[]
But when I try this:
for link in links :
print(link)
It returns nothing.
(Note: this happened to me for every website )
Edit: the propose of the code above is just to return a Tag but not the text (the quote)

Categories

Resources