Scraping href links and scrape from these links - python

I'm doing python scraping and i'm trying to get all the links between href tags and then accessing it one by one to scrape data from these links. I'm a newbie and can't figure it out how to continue from this.The code is as follows:
import requests
import urllib.request
import re
from bs4 import BeautifulSoup
import csv
url = 'https://menupages.com/restaurants/ny-new-york'
url1 = 'https://menupages.com'
response = requests.get(url)
f = csv.writer(open('Restuarants_details.csv', 'w'))
soup = BeautifulSoup(response.text, "html.parser")
menu_sections=[]
for url2 in soup.find_all('h3',class_='restaurant__title'):
completeurl = url1+url2.a.get('href')
print(completeurl)
#print(url)

If you want to scrape all the links obtained from the first page, and then scrape all the links obtained from these links, etc, you need a recursive function.
Here is some initial code to get you started:
if __name__ == "__main__":
initial_url = "https://menupages.com/restaurants/ny-new-york"
scrape(initial_url)
def scrape(url):
print("now looking at " + url)
# scrape URL
# do something with the data
if (STOP_CONDITION): # update this!
return
# scrape new URLs:
for new_url in soup.find_all(...):
scrape(new_url, file)
The problem with this recursive function is that it will not stop until there are no links on the pages, which probably won't happen anytime soon. You will need to add a stop condition.

Related

Web scraping using python-beginner level

Hello I am new into python . practicing web scraping with some demo sites .
I am trying to scrape this website http://books.toscrape.com/ and want to extract
href
name/title
start rating/star-rating
price/price_color
in-stock availbility/instock availability
i written a basic code which goes to each book level.
but after that i am clueless as how i can extract those information.
import requests
from csv import reader,writer
from bs4 import BeautifulSoup
base_url= "http://books.toscrape.com/"
r = requests.get(base_url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent,'html.parser')
for article in soup.find_all('article'):
This will find you the href and name for every book. You could also extract some other other information if you want.
import requests
from csv import reader,writer
from bs4 import BeautifulSoup
base_url= "http://books.toscrape.com/"
r = requests.get(base_url)
soup = BeautifulSoup(r.content,'html.parser')
def extract_info(soup):
href = []
for a in soup.find_all('a', href=True):
if a.text:
if "catalogue" in a["href"]:
href.append(a['href'])
name = []
for a in soup.find_all('a', title=True):
name.append(a.text)
return href, name
href, name = extract_info(soup)
print(href[0], name[0])
the output will be the href and name for the first book
Try below approach using python - requests and BeautifulSoup. I have fetched the page URL from website itself after inspecting the network section > Doc tab of google chrome browser.
What exactly below script is doing:
First it will take the Page URL which is created using, page no parameter and then doing a GET request.
URL is dynamic which will get created after finishing of an iteration. You will notice that PAGE_NO param will get incremented after each iteration.
After getting the data script will parse the HTML code using html5.parser library.
Finally it will iterate all over the list of books fetched in each iteration or page for ex:- Title, Hyperlink, Price, Stock Availability and rating.
There are 50 pages and 1k results below script will extract all the books details one page per iteration
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def scrap_books_data():
PAGE_NO = 1 # Page no parameter which will get incremented after every iteration
while True:
print('Creating URL to scrape books data for ', str(PAGE_NO))
URL = 'http://books.toscrape.com/catalogue/page-' + str(PAGE_NO) + '.html' #dynamic URL which will get created after every iteration
response = requests.get(URL,verify=False) # GET request to fetch data from site
soup = bs(response.text,'html.parser') #Parse HTML data using 'html5.parser'
extracted_books_data = soup.find_all('article', class_ = 'product_pod') # find all articles tag where book details are nested
if len(extracted_books_data) == 0: #break the loop and exit from the script if there in no more data available to process
break
else:
for item in range(len(extracted_books_data)): #iterate over the list of extracted books
print('-' * 100)
print('Title : ', extracted_books_data[item].contents[5].contents[0].attrs['title'])
print('Link : ', extracted_books_data[item].contents[5].contents[0].attrs['href'])
print('Rating : ', extracted_books_data[item].contents[3].attrs['class'][1])
print('Price : ', extracted_books_data[item].contents[7].contents[1].text.replace('Â',''))
print('Availability : ', extracted_books_data[item].contents[7].contents[3].text.replace('\n','').strip())
print('-' * 100)
PAGE_NO += 1 #increment page no by 1 to scrape next page data
scrap_books_data()

Beautifulsoup "findAll()" does not return the tags

I am trying to build a scraper to get some abstracts of academic papers and their corresponding titles on this page.
The problem is that my for link in bsObj.findAll('a',{'class':'search-track'}) does not return the links I need to further build my scraper. In my code, the check is like this:
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
The for loop above does print out anything, however, the href links should be inside the <a class="search-track" ...</a>.
I have referred to this post, but changing the Beautifulsoup parser is not solving the problem of my code. I am using "html.parser" in my Beautifulsoup constructor: bsObj = bs(html.content, features="html.parser").
And the print(len(bsObj)) prints out "3" while it prints out "2" for both "lxml" and "html5lib".
Also, I started off using urllib.request.urlopen to get the page and then tried requests.get() instead. Unfortunately the two approaches give me the same bsObj.
Here is the code I've written:
#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl
'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results?
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"
journals = []
articles = []
def getJournals(url):
global journals
#html = urlopen(url)
html = requests.get(url)
bsObj = bs(html.content, features="html.parser")
#print(len(bsObj))
#testFile = open('testFile.txt', 'wb')
#testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
#testFile.close()
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
########does not print anything########
'''
if 'href' in link.attrs and link.attrs['href'] not in journals:
newJournal = link.attrs['href']
journals.append(newJournal)
'''
return None
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
getJournals(address)
print(journals)
Can anyone tell me what the problem is in my code that the for loop does not print out any links? I need to store the links of journals in a list and then visit each link to scrape the abstracts of papers. By right the abstracts part of a paper is free and the website shouldn't have blocked my ID because of it.
This page is dynamically loaded with jscript, so Beautifulsoup can't handle it directly. You may be able to do it using Selenium, but in this case you can do it by tracking the api calls made by the page (for more see, as one of many examples, here.
In your particular case it can be done this way:
from bs4 import BeautifulSoup as bs
import requests
import json
#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")
data = json.loads(str(soup))#response is in json format so we load it into a dictionary
Note: in this case, it's also possible to dispense with Beautifulsoup altogether and load the response directly, as in data = json.loads(html.content). From this point:
hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
print(hit['_source']['url'])
Ouput:
https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x
etc.

using python to pull href tags

tyring to pull the href links for the products on this webpage. The code pulls all of the href's except the products that are listed on the page.
from bs4 import BeautifulSoup
import requests
url = "https://www.neb.com/search#t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
The products are loaded through rest API dynamically, the URL is this:
https://international.neb.com/coveo/rest/v2/?sitecoreItemUri=sitecore%3A%2F%2Fweb%2F%7BA1D9D237-B272-4C5E-A23F-EC954EB71A26%7D%3Flang%3Den%26ver%3D1&siteName=nebinternational
Loading this response will get you the URLs.
Next time, check your network inspector if any part of web page isn't loading dynamically (or use selenium).
Try to verify if the product href's is in the received response. I'm telling you to do this because if the part of the products is being dynamically generated by ajax, for example, a simple get on the main page will not bring them.
Print the response and verifiy if the products are being received in the html
I think you want something like this:
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '100'):
resp = urllib.request.urlopen("https://www.neb.com/search#first=" + numb + "&t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])

List links in web page with python

I am trying to write a python script that lists all the links in a webpage that contain some substring. The problem that I am running into is that the webpage has multiple "pages" so that it doesn't clutter all the screen. Take a look at https://www.go-hero.net/jam/17/solutions/1/1/C++ for an example.
This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Any suggestions on how I might get this to work? Thanks in advance.
Edit/Update: Using Selenium, you could click the page links before scraping the html to collect all the content into the html. Many/most websites with pagination don't collect all the text in the html when you click through the pages, but I noticed that the example you provided does. Take a look at this SO question for a quick example of making Selenium work with BeautifulSoup. Here is how you could use it in your code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
driver.get(original_url)
# click the links for pages 1-29
for i in range(1, 30):
path_string = '/jam/17/solutions/1/1/C++#page-' + str(i)
driver.find_element_by_xpath('//a[#href=' + path_string + ']').click()
# scrape from the accumulated html
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Original Answer: For the link you provided above, you could simply loop through possible urls and run your scraping code in the loop:
import requests
from bs4 import BeautifulSoup
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
# scrape from the original page (has no page number)
response = requests.get(original_url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
# prepare to scrape from the pages numbered 1-29
# (note that the original page is not numbered, and the next page is "#page-1")
url_suffix = '#page-'
for i in range(1, 30):
# add page number to the url
paginated_url = original_url + url_suffix + str(i)
response = requests.get(paginated_url)
soup = BeautifulSoup(response.content, "html5lib")
# append resulting list to 'links' list
links += soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
I don't know if you mind that you'll get duplicates in your results. You will get duplicate results in your link list as the code currently stands, but you could add the links to a Set or something instead to easily remedy that.

scraping data from unknown number of pages using beautiful soup

I want to parse some info from website that has data spread among several pages.
The problem is I don't know how many pages there are. There might be 2, but there might be also 4, or even just one page.
How can I loop over pages when I don't know how many pages there will be?
I know however the url pattern which looks something like in the code below.
Also, the pages names are not plain numbers but they are in 'pe2' for page 2 and 'pe4' for page 3 etc. so can't just loop over range(number).
This dummy code for the loop I am trying to fix.
pages=['','pe2', 'pe4', 'pe6', 'pe8',]
import requests
from bs4 import BeautifulSoup
for i in pages:
url = "http://www.website.com/somecode/dummy?page={}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
#rest of the scraping code
You can use a while loop that will stop to run when encounters an exception.
Code:
from bs4 import BeautifulSoup
from time import sleep
import requests
i = 0
while(True):
try:
if i == 0:
url = "http://www.website.com/somecode/dummy?page=pe"
else:
url = "http://www.website.com/somecode/dummy?page=pe{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
#print page url
print(url)
#rest of the scraping code
#don't overflow website
sleep(2)
#increase page number
i += 2
except:
break
Output:
http://www.website.com/somecode/dummy?page
http://www.website.com/somecode/dummy?page=pe2
http://www.website.com/somecode/dummy?page=pe4
http://www.website.com/somecode/dummy?page=pe6
http://www.website.com/somecode/dummy?page=pe8
...
... and so on, until it faces an Exception.

Categories

Resources