Getting href using beautiful soup with different methods - python

I'm trying to scrape a website. I learned to scrape from two resources: one used tag.get('href') to get the href from an a tag, and one used tag['href'] to get the same. As far as I understand it, they both do the same thing. But when I tried this code:
link_list = [l.get('href') for l in soup.find_all('a')]
it worked with the .get method, but not with the dictionary access way.
link_list = [l['href'] for l in soup.find_all('a')]
This throws a KeyError. I'm very new to scraping, so please pardon if this is a silly one.
Edit - Both of the methods worked for the find method instead of find_all.

You may let BeautifulSoup find the links with existing href attributes only.
test
You can do it in two common ways, via find_all():
link_list = [a['href'] for a in soup.find_all('a', href=True)]
Or, with a CSS selector:
link_list = [a['href'] for a in soup.select('a[href]')]

Maybe HTML-string does not have a "href"?
For example:
from bs4 import BeautifulSoup
doc_html = """<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>"""
soup = BeautifulSoup(doc_html, 'html.parser')
ahref = soup.find('a')
ahref.get('href')
Nothing will happen, but
ahref['href']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sergey/.virtualenvs/soup_example/lib/python3.5/site-
packages/bs4/element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'href'
'href'

Related

Error: ResultSet object has no attribute 'endswith'

I'm trying to scrape the unique links from a website, but when I do, I get the following error and I'm not sure what's causing it.
ResultSet object has no attribute 'endswith'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I tried changing the url to see if it was the link, and it didn't work --which I wasn't surprised, but I wanted to check.
I looked at the documentation(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#miscellaneous) and if I'm understanding it correctly, it's saying to use find instead of findall. I tried using find instead, but that didn't pull up anything, but even if it did, it wouldn't pull up what I'm looking for since I'm wanting all unique links.
Anyway, here's the code. Any ideas or places I can look to understand this error more?
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
url="https://www.census.gov/programs-surveys/popest.html"
r=requests.get(url)
soup= BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
link.get("href")
def unique_links(tags,url):
cleaned_links = set()
for link in links:
link = link.get("href")
if link is None:
continue
if link.endswith('/') or links.endswith('#'):
link = link [-1]
actual_url = urllib.parse.urljoin(url,link)
cleaned_links.add(actual_url)
return cleaned_links
cleaned_links = unique_links(links, url)
There is a typo in your code: link.endswith('#'): instead of links.
if link.endswith('/') or link.endswith('#'):
link = link [-1]

Python - How do I find a link on webpage that has no class?

I am a beginner python programmer and I am trying to make a webcrawler as practice.
Currently I am facing a problem that I cannot find the right solution for. The problem is that I am trying to get a link location/address from a page that has no class, so I have no idea how to filter that specific link.
It is probably better to show you.
The page I am trying to get the link from.
As you can see, I am trying to get what is inside of the href attribute of the "Historical prices" link. Here is my python code:
import requests
from bs4 import BeautifulSoup
def find_historicalprices_link(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find_all('li', 'fjfe-nav-sub')
href = str(link.get('href'))
find_spreadsheet(href)
def find_spreadsheet(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find('a', {'class' : 'nowrap'})
href = str(link.get('href'))
download_spreadsheet(href)
def download_spreadsheet(url):
response = requests.get(url)
text = response.text
lines = text.split("\\n")
filename = r'google.csv'
file = open(filename, 'w')
for line in lines:
file.write(line + "\n")
file.close()
find_historicalprices_link('https://www.google.com/finance?q=NASDAQ%3AGOOGL&ei=3lowWYGRJNSvsgGPgaywDw')
In the function "find_spreadsheet(url)", I could easily filter the link by looking for the class called "nowrap". Unfortunately, the Historical prices link does not have such a class and right now my script just gives me the following error:
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
How do I make sure that my crawler only takes the href from the "Historical prices"?
Thank you in advance.
UPDATE:
I found the way to do it. By only looking for the link with a specific text attached to it, I could find the href I needed.
Solution:
soup.find('a', string="Historical prices")
Does the following code sniplet helps you? I think you can solve your problem with the following code as I hope:
from bs4 import BeautifulSoup
html = """<a href='http://www.google.com'>Something else</a>
<a href='http://www.yahoo.com'>Historical prices</a>"""
soup = BeautifulSoup(html, "html5lib")
urls = soup.find_all("a")
print(urls)
print([a["href"] for a in urls if a.text == "Historical prices"])

Python scraping webpages

I am trying to pull only the links and their text from a webpage line by line and insert text and link into a dictionary. Without using beautiful soup or a regex.
i keep getting this error:
error:
Traceback (most recent call last):
File "F:/Homework7-2.py", line 13, in <module>
link2 = link1.split("href=")[1]
IndexError: list index out of range
code:
import urllib.request
url = "http://www.facebook.com"
page = urllib.request.urlopen(url)
mylinks = {}
links = page.readline().decode('utf-8')
for items in links:
links = page.readline().decode('utf-8')
if "a href=" in links:
links = page.readline().decode('utf-8')
link1 = links.split(">")[0]
link2 = link1.split("href=")[1]
mylinks = link2
print(mylinks)
import requests
from bs4 import BeautifulSoup
r = requests.get("http://stackoverflow.com/questions/29336915/python-scraping-webpages")
# find all a tags with href attributes
for a in BeautifulSoup(r.content).find_all("a",href=True):
# print each href
print(a["href"])
Obviously that is a very broad example but will get you started, if you want specific urls you can narrow your search to certain elements but it will be different for all webpages. You won't find much easier tools to use for parsing than requests and BeautifulSoup

Using beautiful soup 4 to scrape URLS within a <p class="postbody"> tag and save them to a text file

I realize this is probably incredibly straightforward but please bear with me. I'm trying to use beautifulsoup 4 to scrape a website that has a list of blog posts for the urls of those posts. The tag that I want is within an tag. There are multiple tags that include a header and then a link that I want to capture. This is the code I'm working with:
with io.open('TPNurls.txt', 'a', encoding='utf8') as logfile:
snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
The error I'm getting is:
AttributeError: 'ResultSet' object has no attribute 'find'
I understand that means "head" is a set and beautifulsoup doesn't let me look for tags within a set. But then how can I do this? I need it to find the entire set of tags and then look for the tag within each one and then save each one on a separate line to a file.
The actual reason for the error is that snippet is a result of find_all() call and is basically a list of results, there is no find() function available on it. Instead, you meant:
snippet = soup.find('p', class_="postbody")
for link in snippet.find_all('a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
Also, note the use of class_ here - class is a reserved keyword and cannot be used as a keyword argument here. See Searching by CSS class for more info.
Alternatively, make use of CSS selectors:
for link in snippet.select('p.postbody a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
p.postbody a would match all a tags inside the p tag with class postbody.
In your code,
snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):
Here snippet is a bs4.element.ResultSet type object. So you are getting this error. But the elements of this ResultSet object are bs4.element.Tag type where you can apply find method.
Change your code like this,
snippet = soup.find_all("p", { "class" : "postbody" })
for link in snippet:
if link.find('a'):
fulllink = link.a['href']
logfile.write(fulllink + "\n")

BeautifulSoup get_text from find_all

This is my first work with web scraping. So far I am able to navigate and find the part of the HTML I want. I can print it as well. The problem is printing only the text, which will not work. I get following error, when trying it: AttributeError: 'ResultSet' object has no attribute 'get_text'
Here my code:
from bs4 import BeautifulSoup
import urllib
page = urllib.urlopen('some url')
soup = BeautifulSoup(page)
zeug = soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}).get_text()
print zeug
find_all() returns an array of elements. You should go through all of them and select that one you are need. And than call get_text()
UPD
For example:
for el in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}):
print el.get_text()
But note that you may have more than one element.
Try for inside the list for getting the data, like this:
zeug = [x.get_text() for x in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'})]
I would close this issue for being a duplicate and link you to another I found that answers this question but I don't think I have the reputation needed to moderate... So...
Original Answer
Code for this:
for el in soup.findAll('div', attrs={'class': 'fm_linkeSpalte'}):
print ''.join(el.findAll(text=True))
If a mod wants to close this question that would be helpful.

Categories

Resources