Web Scraping links

Web Scraping links - python

I am working on scraping links from a Christmas tree farm website. First, I used this tutorial method to get all the links. Then, I noticed that the links that I wanted did not lead with the proper hypertext transfer protocol, so I created a variable to concatenate. Now I am trying to create a if statement that grabs each link and looks for any two characters followed by "xmastrees.php". If that is true then my concatenate variable to the front of it. If the link does not contain the specific text then it is deleted. For example NYxmastrees.php will be http://www.pickyourownchristmastree.org/NYxmastrees.php and ../disclaimer.htm will be removed. I've tried multiple ways, but can't seem to find the right one.
Here is what I currently have and keep running into a syntax error: del. I commented out that line and get another error saying my string object has no attribute 're'. This confuses me because I though i could use regex with strings??
source = requests.get('http://www.pickyourownchristmastree.org/').text
soup = BeautifulSoup(source, 'lxml')
concatenate = 'http://www.pickyourownchristmastree.org/'
find_state_group = soup.find('div', class_ = 'alert')
for link in find_state_group.find_all('a', href=True):
if link['href'].re.search('^.\B.\$xmastrees'):
states = concatenate + link
else del link['href']
print(link['href']
Error with else del link['href']:
else del link['href']
^
SyntaxError: invalid syntax
Error without else del link['href']:
if link['href'].re.search('^.\B.\$xmastrees'):
AttributeError: 'str' object has no attribute 're'

You can try using:
import requests
from bs4 import BeautifulSoup as bs
u = "http://www.pickyourownchristmastree.org/"
soup = bs(requests.get(u).text, 'html5lib')
find_state_group = soup.find('div', {"class": 'alert'})
for link in find_state_group.find_all('a', href=True):
if "mastrees" in link['href']:
states = u + link['href']
print(states)
http://www.pickyourownchristmastree.org/ALxmastrees.php
http://www.pickyourownchristmastree.org/AZxmastrees.php
http://www.pickyourownchristmastree.org/AKxmastrees.php
...
Demo

Related

Error: ResultSet object has no attribute 'endswith'

I'm trying to scrape the unique links from a website, but when I do, I get the following error and I'm not sure what's causing it.
ResultSet object has no attribute 'endswith'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I tried changing the url to see if it was the link, and it didn't work --which I wasn't surprised, but I wanted to check.
I looked at the documentation(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#miscellaneous) and if I'm understanding it correctly, it's saying to use find instead of findall. I tried using find instead, but that didn't pull up anything, but even if it did, it wouldn't pull up what I'm looking for since I'm wanting all unique links.
Anyway, here's the code. Any ideas or places I can look to understand this error more?
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
url="https://www.census.gov/programs-surveys/popest.html"
r=requests.get(url)
soup= BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
link.get("href")
def unique_links(tags,url):
cleaned_links = set()
for link in links:
link = link.get("href")
if link is None:
continue
if link.endswith('/') or links.endswith('#'):
link = link [-1]
actual_url = urllib.parse.urljoin(url,link)
cleaned_links.add(actual_url)
return cleaned_links
cleaned_links = unique_links(links, url)

There is a typo in your code: link.endswith('#'): instead of links.
if link.endswith('/') or link.endswith('#'):
link = link [-1]

Python - How do I find a link on webpage that has no class?

I am a beginner python programmer and I am trying to make a webcrawler as practice.
Currently I am facing a problem that I cannot find the right solution for. The problem is that I am trying to get a link location/address from a page that has no class, so I have no idea how to filter that specific link.
It is probably better to show you.
The page I am trying to get the link from.
As you can see, I am trying to get what is inside of the href attribute of the "Historical prices" link. Here is my python code:
import requests
from bs4 import BeautifulSoup
def find_historicalprices_link(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find_all('li', 'fjfe-nav-sub')
href = str(link.get('href'))
find_spreadsheet(href)
def find_spreadsheet(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find('a', {'class' : 'nowrap'})
href = str(link.get('href'))
download_spreadsheet(href)
def download_spreadsheet(url):
response = requests.get(url)
text = response.text
lines = text.split("\\n")
filename = r'google.csv'
file = open(filename, 'w')
for line in lines:
file.write(line + "\n")
file.close()
find_historicalprices_link('https://www.google.com/finance?q=NASDAQ%3AGOOGL&ei=3lowWYGRJNSvsgGPgaywDw')
In the function "find_spreadsheet(url)", I could easily filter the link by looking for the class called "nowrap". Unfortunately, the Historical prices link does not have such a class and right now my script just gives me the following error:
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
How do I make sure that my crawler only takes the href from the "Historical prices"?
Thank you in advance.
UPDATE:
I found the way to do it. By only looking for the link with a specific text attached to it, I could find the href I needed.
Solution:
soup.find('a', string="Historical prices")

Does the following code sniplet helps you? I think you can solve your problem with the following code as I hope:
from bs4 import BeautifulSoup
html = """<a href='http://www.google.com'>Something else</a>
<a href='http://www.yahoo.com'>Historical prices</a>"""
soup = BeautifulSoup(html, "html5lib")
urls = soup.find_all("a")
print(urls)
print([a["href"] for a in urls if a.text == "Historical prices"])

Python and BeautifulSoup Opening pages

I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.

I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.

I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page

Using beautiful soup 4 to scrape URLS within a <p class="postbody"> tag and save them to a text file

I realize this is probably incredibly straightforward but please bear with me. I'm trying to use beautifulsoup 4 to scrape a website that has a list of blog posts for the urls of those posts. The tag that I want is within an tag. There are multiple tags that include a header and then a link that I want to capture. This is the code I'm working with:
with io.open('TPNurls.txt', 'a', encoding='utf8') as logfile:
snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
The error I'm getting is:
AttributeError: 'ResultSet' object has no attribute 'find'
I understand that means "head" is a set and beautifulsoup doesn't let me look for tags within a set. But then how can I do this? I need it to find the entire set of tags and then look for the tag within each one and then save each one on a separate line to a file.

The actual reason for the error is that snippet is a result of find_all() call and is basically a list of results, there is no find() function available on it. Instead, you meant:
snippet = soup.find('p', class_="postbody")
for link in snippet.find_all('a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
Also, note the use of class_ here - class is a reserved keyword and cannot be used as a keyword argument here. See Searching by CSS class for more info.
Alternatively, make use of CSS selectors:
for link in snippet.select('p.postbody a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
p.postbody a would match all a tags inside the p tag with class postbody.

In your code,
snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):
Here snippet is a bs4.element.ResultSet type object. So you are getting this error. But the elements of this ResultSet object are bs4.element.Tag type where you can apply find method.
Change your code like this,
snippet = soup.find_all("p", { "class" : "postbody" })
for link in snippet:
if link.find('a'):
fulllink = link.a['href']
logfile.write(fulllink + "\n")

Can't get the value of a tag with BeautifulSoup

I'm just starting to use BS4 and I just can't seems to find why i can't extract the text in the following table -> http://pastebin.com/MCQC7wLY
This is my code:
for team in soup.find_all('tr'):
print team.a.string
I get the following error
AttributeError: 'NoneType' object has no attribute 'string'
I also tried other stuff like
for team in soup.find_all('tr'):
print team.find('a').string
But I'm always getting the same error.
this is what team.find('a') return
FC Lasne
I would like to extract "FC Lasne"
It's driving me mad because usually I just do find('a').string and its just work
How should I proceed?
Thanks

The very first tr in your example does not have any a tags in it.
You could just ignore any trs without links:
for team in soup.find_all('tr'):
link = team.find('a')
if link == null:
continue
print link.string
Though you could just do:
soup.find_all('a')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping links - python

Related

Error: ResultSet object has no attribute 'endswith'

Python - How do I find a link on webpage that has no class?

Python and BeautifulSoup Opening pages

Using beautiful soup 4 to scrape URLS within a <p class="postbody"> tag and save them to a text file

Can't get the value of a tag with BeautifulSoup

Categories

Resources