So, as i need more detailed data I have to dig a bit deeper in the HTML code of a website. I wrote a script that returns me a list of specific links to detail pages, but I can't bring Python to search each link of this list for me, it always stops at the first one. What am I doing wrong?
from BeautifulSoup import BeautifulSoup
import urllib2
from lxml import html
import requests
#Open site
html_page = urllib2.urlopen("http://www.sitetoscrape.ch/somesite.aspx")
#Inform BeautifulSoup
soup = BeautifulSoup(html_page)
#Search for the specific links
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
#print found links
print link.get('href')
#complete links
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
#print complete links
print complete_links
#
#EVERYTHING WORKS FINE TO THIS POINT
#
page = requests.get(complete_links)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
Also: What tutorial can you recommend me to learn how to put my output in a file and clean it up, give variable names, etc?
I think that you want a list of links in complete_links instead of a single link. As #Pynchia and #lemonhead said you're overwritting complete_links every iteration of first for loop.
You need two changes:
Append links to a list and use it to loop and scrap each link
# [...] Same code here
links_list = []
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
print link.get('href')
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
print complete_links
link_list.append(complete_links) # append new link to the list
Scrap each accumulated link in another loop
for link in link_list:
page = requests.get(link)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
PS: I recommend scrapy framework for tasks like that.
Related
I am trying to pass a link I extracted from beautifulsoup.
import requests
r = requests.get('https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links[1])
This is the link I am wanting.
Output: https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip
Now I am trying to pass this link through so I can download the contents.
# make a folder if it doesn't already exist
if not os.path.exists(folder_name):
os.makedirs(folder_name)
# pass the url
url = r'link from beautifulsoup result needs to go here'
response = requests.get(url, stream = True)
# extract contents
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
for elem in zf.namelist():
zf.extract(elem, '../data')
My overall goal is trying to take the link that I webscraped and place it in the url variable because the link is always changing on this website. I want to make it dynamic so I don't have to manually search for this link and change it when its changing and instead it changes dynamically. I hope this makes sense and appreciate any help I can get.
If I manually enter my code as the following I know it works
url = r'https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip'
If I can get my code to pass that exactly I know it'll work I'm just stuck with how to accomplish this.
I think you can do it with the find_all() method in Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
for a in soup.find_all('a'):
url = a.get('href')
Im working on a web crawler that will crawl only internal links using requests and bs4.
I have a rough working version below but Im not sure how to properly handle checking if a link has been crawled previously or not.
import re
import time
import requests
import argparse
from bs4 import BeautifulSoup
internal_links = set()
def crawler(new_link):
html = requests.get(new_link).text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
if "href" in link.attrs:
print(link)
if link.attrs["href"] not in internal_links:
new_link = link.attrs["href"]
print(new_link)
internal_links.add(new_link)
print("All links found so far, ", internal_links)
time.sleep(6)
crawler(new_link)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('url', help='Pass the website url you wish to crawl')
args = parser.parse_args()
url = args.url
#Check full url has been passed otherwise requests will throw error later
try:
crawler(url)
except:
if url[0:4] != 'http':
print('Please try again and pass the full url eg http://example.com')
if __name__ == '__main__':
main()
These are the last few lines of the output:
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
ViewState
http://quotes.toscrape.com/search.aspx
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/search.aspx', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
Random
http://quotes.toscrape.com/random
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/search.aspx', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/random', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
so it is working, but only up until a certain point and then it doesn't seem to follow the links any further.
Im sure its because of this line
for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
as that will only find the links that start with http and on a lot of the internal pages the links dont have that but when I try it like this
for link in soup.find_all('a')
the program runs very briefly and then ends:
http://books.toscrape.com
{'href': 'http://books.toscrape.com'}
http://books.toscrape.com
All links found so far, {'http://books.toscrape.com'}
index.html
{'href': 'index.html'}
index.html
All links found so far, {'index.html', 'http://books.toscrape.com'}
You could reduce
for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
if "href" in link.attrs:
print(link)
if link.attrs["href"] not in internal_links:
new_link = link.attrs["href"]
print(new_link)
internal_links.add(new_link)
To
links = {link['href'] for link in soup.select("a[href^='http:']")}
internal_links.update(links)
This uses a grabs only qualifying a tag elements with http protocol and uses a set comprehension to ensure no dupes. It then updates the existing set with any new links. I don't know enough python to comment on efficiency of using .update but I believe it modifies the existing set rather than creating a new one. More methods for combining sets are listed here: How to join two sets in one line without using "|"
I'm trying to collect a specific link to visit it later throughout my script, but there are many links on the page I'm crawling and they all have the same a href tag.
How can I select one specifically? The site is bbb.org and my code is below.
Example, search lamps on bbb and i want to collect the links embedded with the business names so i can visit their profiles later.
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
def bbb_spider(max_pages):
bus_cat = raw_input('Enter a business category: ')
pages = 1
while pages <= max_pages:
url = 'http://www.bbb.org/search/?type=category&input=' + str(bus_cat) + '&page=' + str(pages)
sauce_code = requests.get(url)
plain_text = sauce_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
pages += 1
You need the links located inside h4 elements which are inside the search results table. There are different ways to get to them, but I would make a CSS selector:
soup.select("table.search-results-table tr h4 a")
I have created something similar like this.
Look at my example for a crawler.
https://github.com/shiva1791/Python_webcrawler
The code takes the url it needs to parse from link.csv.
All the logic behind parsing every link on the page is in webcrawler.py file.
I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.
I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.
I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page
I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage.
For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning, I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:
Data Collection Is Out of Control
Here's my code:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
href = link.get('href')
if href == None:
continue
elif '/roomfordebate/' in href:
link_set.add(href)
for link in link_set:
print link
This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work?
Thanks
I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?
Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content. So using BeautifulSoup, lets get that whole div first.
debaters_div = soup.find('div', class_="nytint-discussion-content")
Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1.
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
list_items[0].find('a')
# Data Collection Is Out of Control
So, your whole code can be:
link_set = set()
response = requests.get(url)
html_data = response.text
soup = BeautifulSoup(html_data)
debaters_div = soup.find('div', class_="nytint-discussion-content")
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
for each_item in list_items:
html_link = each_item.find('a').get('href')
if html_link.startswith('/roomfordebate'):
link_set.add(html_link)
Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.
PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set. Just change the last line to:
link_set.add('http://www.nytimes.com' + html_link)
You need to call the method with an object instead of keyword argument:
soup.find("tagName", { "class" : "cssClass" })
or use .select method which executes CSS queries:
soup.select('a.bl-bigger')
Examples are in the docs, just search for '.select' string. Also, instead of writing the entire script you'll quickly get some working code with ipython interactive shell.