Beautiful Soup Tutorial Error: jeriwieringa.com - python

I run this code from the tutorial here (http://jeriwieringa.com/blog/2012/11/04/beautiful-soup-tutorial-part-1/):
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("43rd-congress.htm"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
And I get this error:
File "soupexample.py", line 11, in <module>
fullLink = link.get('href')
link is not defined
Why would I need to define link in links for this loop? What's the logic? Thanks for your help.

I guess mistake comes from here (somehow there is no indent in the example and there certainly should be):
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink

Related

How to fix the python codes to extract full links from a webpage? Available codes extracted partial links

I am a beginner with python and using BeautifulSoup to extract links from the following webpage https://mhealthfairview.org/locations/m-health-fairview-st-johns-hospital. All available codes are like the follows,
html_page = urllib.request.urlopen("https://mhealthfairview.org/locations/m-health-fairview-st-johns-hospital"
soup = BeautifulSoup(html_page)
for link in soup.find_all('a'):
print(link.get('href'))
The outputs include partial links, such as "/providers", etc. It should be "https://mhealthfairview.org/providers". Is there any way I can extract the full link rather than the partial link? Thank you.
Use urllib.parse.urljoin
from urllib.parse import urljoin
url = "https://mhealthfairview.org/locations/m-health-fairview-st-johns-hospital"
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page)
for link in soup.find_all('a'):
print(urljoin(url, link.get('href')))
You can simply use an if.
webroot = 'https://mhealthfairview.org'
href = link.get('href')
if href[0] == "/":
print(webroot + href)

How do I fix this attribute error python?

I'm having a problem with my code. I'm trying to extract the listed jobs on this website (https://www.local.ch/en/q/geneve/employment%20agency?slot=yellow) with the names of the Company and the link to their information. The first part works, I am able to print all the names but then printing the link to its information gives me the error:
Traceback (most recent call last):
File "main.py", line 20, in <module>
href = (links.get("href"))
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/bs4/element.py", line 921, in __getattr__
raise AttributeError(
AttributeError: 'NavigableString' object has no attribute 'get'
This is my code:
print("Hello, welcome to local job in geneva finder")
import requests
from bs4 import BeautifulSoup
url = "https://www.local.ch/en/q/geneve/employment%20agency?slot=yellow"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
names = soup.findAll("h2")
for name in names:
print(name.text)
link = soup.find("a")
for links in link:
href = (links.get("href"))
if href.startswith("https://www.local.ch/en/d/geneve/1204/recruiting"):
print(href)
Use findAll to extract all <a> tags.
links = soup.findAll("a")
Iterate the loop for links instead of names to get href from all <a> tags.
link.get("href") can return None too incase of href not found in <a> tag. So write a condition for checking weather its None or not.
Complete Code:
print("Hello, welcome to local job in geneva finder")
import requests
from bs4 import BeautifulSoup
url = "https://www.local.ch/en/q/geneve/employment%20agency?slot=yellow"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
names = soup.findAll("h2")
for name in names:
print(name.text)
links = soup.findAll("a")
for link in links:
href = link.get("href")
if href:
if href.startswith("https://www.local.ch/en/d/geneve/1204/recruiting"):
print(href)

Get links related to a given keyword (python)

import requests, urllib
from bs4 import BeautifulSoup
keyword = 'hello'
r = requests.get(f'https://www.google.com/search?q={keyword}')
soup = BeautifulSoup(r.text, "html.parser")
links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
links.append(item.a['href'])
print(links)
I got this code but it isn't working. Any help? There aren't any errors but it seems it can't find any link related to the given keyword.
Output:
[]

Print only one link of the web with python3 web scraping

I´m trying to print the link o text of each headline of the following website http://www.infobolsa.es/news but when I run the code I keep getting the same output, the correct headline text but every link is the same. Here is the part of the link code, thank you:
from urllib.request import urlopen
html_page = urlopen("http://www.infobolsa.es/news")
soup = BeautifulSoup(html_page, 'lxml')
links = list()
for titleM in bodyDictWeb2:
for link in soup.findAll('a', attrs={'href': re.compile("^/news/detail")}):
print(link)
bodyDictWeb2[titleM] = link.get('href')
break
for k,v in bodyDictWeb2.items():
print(k,":",v)
I have solved it, here is the code that works:
from urllib.request import urlopen
html_page = urlopen("http://www.infobolsa.es/news")
soup = BeautifulSoup(html_page, 'lxml')
links = list()
for titleM in bodyDictWeb2:
for link in soup.findAll('a', attrs={'href': re.compile("^/news/detail")}):
print(link.text , link.get('href'))
break

Python can not get links from web page

I am writing python script which gets links from website. But when I tried with this web page I was unable to get links. My script is:
soup = BeautifulSoup(urllib2.urlopen(url))
datas = soup.findAll('div', attrs={'class':'tsrImg'})
for data in datas:
link = data.find('a')
print str(link.href)
it prints only None, can anyone explain why it is so???
Change:
str(link.href)
With:
link.get('href')
It will look like this:
from BeautifulSoup import BeautifulSoup
import urllib2
url = 'http://www.meinpaket.de/de/shopsList.html?page=1'
soup = BeautifulSoup(urllib2.urlopen(url))
datas = soup.findAll('div', {'class':'tsrImg'})
for data in datas:
link = data.find('a')
print link.get('href')
Outputs:
/de/~-office-partner-gmbh-;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~-24selling-de;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~abalisi-kuenstlerbedarf-shop;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~abcmeineverpackung-de-kg;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~ability;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~ac-foto-handels-gmbh;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~ac-sat-corner-inh-dirk-hahn;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~adamo-fashion-gmbh-shop;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~adapter-markt;jsessionid=11957F27FC2D888A34532D9848C922FB.as03
/de/~adko;jsessionid=11957F27FC2D888A34532D9848C922FB.as03

Categories

Resources