Hello I am new into python . practicing web scraping with some demo sites .
I am trying to scrape this website http://books.toscrape.com/ and want to extract
href
name/title
start rating/star-rating
price/price_color
in-stock availbility/instock availability
i written a basic code which goes to each book level.
but after that i am clueless as how i can extract those information.
import requests
from csv import reader,writer
from bs4 import BeautifulSoup
base_url= "http://books.toscrape.com/"
r = requests.get(base_url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent,'html.parser')
for article in soup.find_all('article'):
This will find you the href and name for every book. You could also extract some other other information if you want.
import requests
from csv import reader,writer
from bs4 import BeautifulSoup
base_url= "http://books.toscrape.com/"
r = requests.get(base_url)
soup = BeautifulSoup(r.content,'html.parser')
def extract_info(soup):
href = []
for a in soup.find_all('a', href=True):
if a.text:
if "catalogue" in a["href"]:
href.append(a['href'])
name = []
for a in soup.find_all('a', title=True):
name.append(a.text)
return href, name
href, name = extract_info(soup)
print(href[0], name[0])
the output will be the href and name for the first book
Try below approach using python - requests and BeautifulSoup. I have fetched the page URL from website itself after inspecting the network section > Doc tab of google chrome browser.
What exactly below script is doing:
First it will take the Page URL which is created using, page no parameter and then doing a GET request.
URL is dynamic which will get created after finishing of an iteration. You will notice that PAGE_NO param will get incremented after each iteration.
After getting the data script will parse the HTML code using html5.parser library.
Finally it will iterate all over the list of books fetched in each iteration or page for ex:- Title, Hyperlink, Price, Stock Availability and rating.
There are 50 pages and 1k results below script will extract all the books details one page per iteration
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def scrap_books_data():
PAGE_NO = 1 # Page no parameter which will get incremented after every iteration
while True:
print('Creating URL to scrape books data for ', str(PAGE_NO))
URL = 'http://books.toscrape.com/catalogue/page-' + str(PAGE_NO) + '.html' #dynamic URL which will get created after every iteration
response = requests.get(URL,verify=False) # GET request to fetch data from site
soup = bs(response.text,'html.parser') #Parse HTML data using 'html5.parser'
extracted_books_data = soup.find_all('article', class_ = 'product_pod') # find all articles tag where book details are nested
if len(extracted_books_data) == 0: #break the loop and exit from the script if there in no more data available to process
break
else:
for item in range(len(extracted_books_data)): #iterate over the list of extracted books
print('-' * 100)
print('Title : ', extracted_books_data[item].contents[5].contents[0].attrs['title'])
print('Link : ', extracted_books_data[item].contents[5].contents[0].attrs['href'])
print('Rating : ', extracted_books_data[item].contents[3].attrs['class'][1])
print('Price : ', extracted_books_data[item].contents[7].contents[1].text.replace('Â',''))
print('Availability : ', extracted_books_data[item].contents[7].contents[3].text.replace('\n','').strip())
print('-' * 100)
PAGE_NO += 1 #increment page no by 1 to scrape next page data
scrap_books_data()
Related
I am trying to scrape a website www.zath.co.uk, and extract the links to all of the articles using Python 3. Looking at the raw html file I identified one of the sections I am interested in, displayed below using BeautifulSoup.
<article class="post-32595 post type-post status-publish format-standard has-post-thumbnail category-games entry" itemscope="" itemtype="https://schema.org/CreativeWork">
<header class="entry-header">
<h2 class="entry-title" itemprop="headline">
<a class="entry-title-link" href="https://www.zath.co.uk/family-games-day-night-event-giffgaff/" rel="bookmark">
A Family Games Night (& Day) With giffgaff
</a>
I then wrote this code to excute this, I started by setting up a list of urls from the website to scrape.
urlList = ["https://www.zath.co.uk/","https://www.zath.co.uk/page/2/",....."https://www.zath.co.uk/page/35/"
Then (after importing the necessary libraries) defined a function get all Zeth articles.
def getAllZathPosts(url,links):
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response)
for a in soup.findAll('a'):
url = a['href']
c = a['class']
if c == "entry-title-link":
print(url)
links.append(url)
return
Then call the function.
links = []
zathPosts = {}
for url in urlList:
zathPosts = getAllZathPosts(url,links)
The code runs with no errors but the links list remains empty with no urls printed as if the class never equals "entry-title-link". I have tried adding an else case.
else:
print(url + " not article")
and all the links from the pages printed as expected. Any suggestions?
You can simply iterate it using range and extract article tag
import requests
from bs4 import BeautifulSoup
for page_no in range(35):
page=requests.get("https://www.zath.co.uk/page/{}/".format(page_no))
parser=BeautifulSoup(page.content,'html.parser')
for article in parser.findAll('article'):
print(article.h2.a['href'])
You can do something like the below code:
import requests
from bs4 import BeautifulSoup
def getAllZathPosts(url,links):
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
results = soup.select("a.entry-title-link")
#for i in results:
#print(i.text)
#links.append(url)
if len(results) >0:
links.append(url)
links = []
urlList = ["https://www.zath.co.uk/","https://www.zath.co.uk/page/2/","https://www.zath.co.uk/page/35/"]
for url in urlList:
getAllZathPosts(url,links)
print(set(links))
I'm doing python scraping and i'm trying to get all the links between href tags and then accessing it one by one to scrape data from these links. I'm a newbie and can't figure it out how to continue from this.The code is as follows:
import requests
import urllib.request
import re
from bs4 import BeautifulSoup
import csv
url = 'https://menupages.com/restaurants/ny-new-york'
url1 = 'https://menupages.com'
response = requests.get(url)
f = csv.writer(open('Restuarants_details.csv', 'w'))
soup = BeautifulSoup(response.text, "html.parser")
menu_sections=[]
for url2 in soup.find_all('h3',class_='restaurant__title'):
completeurl = url1+url2.a.get('href')
print(completeurl)
#print(url)
If you want to scrape all the links obtained from the first page, and then scrape all the links obtained from these links, etc, you need a recursive function.
Here is some initial code to get you started:
if __name__ == "__main__":
initial_url = "https://menupages.com/restaurants/ny-new-york"
scrape(initial_url)
def scrape(url):
print("now looking at " + url)
# scrape URL
# do something with the data
if (STOP_CONDITION): # update this!
return
# scrape new URLs:
for new_url in soup.find_all(...):
scrape(new_url, file)
The problem with this recursive function is that it will not stop until there are no links on the pages, which probably won't happen anytime soon. You will need to add a stop condition.
I'm trying to scrape a website on product reviews and can't assign more than one url to a variable. Basically, I need to scrape urls in a url of specific content.
I have the parent url, and three linked pages to scrape product details like reviews, stars, ect. When passing more than one url to the assigned variable, a "connection adapter error" throws. I've also attempted to just compile, or copy the same code three times with no avail.
import requests as r
from bs4 import BeautifulSoup
import csv
url1 = 'http://drd.ba.ttu.edu/isqs6339/imbadproducts/'
filepath = 'dataout.csv'
res = r.get(url1)
res.content
soup = BeautifulSoup(res.content,'lxml')
results = soup.find("a")
print(results)
print(results['href'])
results = soup.find_all("a")
for l in results:
print(l['href'])
for l in results:
print(l.text)
print(res.headers)
product_result = soup.find_all('a')
for pr in product_result:
print(pr)
search_results = soup.find('div', attrs={'id' : 'searchresults'})
product_result = search_results.find_all('a')
for pr in product_result:
print(pr)
So I provided one link, but have three embedded links and different tags to scrape. I've never been able to get past the connection adapter error.
I am writing a python script using BeautifulSoup. I need to scrape a website and count unique links ignoring the links starting with '#'.
Example if the following links exist on a webpage:
https://www.stackoverflow.com/questions
https://www.stackoverflow.com/foo
https://www.cnn.com/
For this example, the only two unique links will be (The link information after the main domain name is removed):
https://stackoverflow.com/ Count 2
https://cnn.com/ Count 1
Note: this is my first time using python and web scraping tools.
I appreciate all the help in advance.
This is what I have tried so far:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
for link in soup.find_all('a'):
print(link.get('href'))
count += 1
There is a function named urlparse from urllib.parse which you can get netloc of urls. And there is a new awesome HTTP library named requests_html which can help you get all links in source file.
from requests_html import HTMLSession
from collections import Counter
from urllib.parse import urlparse
session = HTMLSession()
r = session.get("the link you want to crawl")
unique_netlocs = Counter(urlparse(link).netloc for link in r.html.absolute_links)
for link in unique_netlocs:
print(link, unique_netlocs[link])
You could also do this:
from bs4 import BeautifulSoup
from collections import Counter
import requests
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)").text, "html.parser")
foundUrls = Counter([link["href"] for link in soup.find_all("a", href=lambda href: href and not href.startswith("#"))])
foundUrls = foundUrls.most_common()
for item in foundUrls:
print ("%s: %d" % (item[0], item[1]))
The soup.find_all line checks if every atag has an href set and if it doesn't start with the # character.
The Counter method counts the occurrences of each list entry and the most_common orders by the value.
The for loop just prints the results.
My way to do this is to find all links using beautiful soup and then determine which link redirects to which location:
def get_count_url(url): # get the umber of links having the same domain and suffix
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
urls={} #dictionary for the domains
# input_domain=url.split('//')[1].split('/')[0]
#library to extract the exact domain( ex.- blog.bbc.com and bbc.com have the same domains )
input_domain=tldextract.extract(url).domain+"."+tldextract.extract(url).suffix
for link in soup.find_all('a'):
word =link.get('href')
# print(word)
if word:
# Same website or domain calls
if "#" in word or word[0]=="/": #div call or same domain call
if not input_domain in urls:
# print(input_domain)
urls[input_domain]=1 #if first encounter with the domain
else:
urls[input_domain]+=1 #multiple encounters
elif "javascript" in word:
# javascript function calls (for domains that use modern JS frameworks to display information)
if not "JavascriptRenderingFunctionCall" in urls:
urls["JavascriptRenderingFunctionCall"]=1
else:
urls["JavascriptRenderingFunctionCall"]+=1
else:
# main_domain=word.split('//')[1].split('/')[0]
main_domain=tldextract.extract(word).domain+"." +tldextract.extract(word).suffix
# print(main_domain)
if main_domain.split('.')[0]=='www':
main_domain = main_domain.replace("www.","") # removing the www
if not main_domain in urls: # maintaining the dictionary
urls[main_domain]=1
else:
urls[main_domain]+=1
count += 1
for key, value in urls.items(): # printing the dictionary in a paragraph format for better readability
print(key,value)
return count
tld extract finds the correct url name and soup.find_all('a') finds a tags. The if statements check for same domain redirect, javascript redirect or other domain redirects.
I would like to scrape the following website using python and need to export scraped data into a CSV file:
http://www.swisswine.ch/en/producer?search=&&
This website consist of 154 pages to relevant search. I need to call every pages and want to scrape data but my script couldn't call next pages continuously. It only scrape one page data.
Here I assign value i<153 therefore this script run only for the 154th page and gave me 10 data. I need data from 1st to 154th page
How can I scrape entire data from all page by once I run the script and also how to export data as CSV file??
my script is as follows
import csv
import requests
from bs4 import BeautifulSoup
i = 0
while i < 153:
url = ("http://www.swisswine.ch/en/producer?search=&&&page=" + str(i))
r = requests.get(url)
i=+1
r.content
soup = BeautifulSoup(r.content)
print (soup.prettify())
g_data = soup.find_all("ul", {"class": "contact-information"})
for item in g_data:
print(item.text)
You should put your HTML parsing code to under the loop as well. And you are not incrementing the i variable correctly (thanks #MattDMo):
import csv
import requests
from bs4 import BeautifulSoup
i = 0
while i < 153:
url = ("http://www.swisswine.ch/en/producer?search=&&&page=" + str(i))
r = requests.get(url)
i += 1
soup = BeautifulSoup(r.content)
print (soup.prettify())
g_data = soup.find_all("ul", {"class": "contact-information"})
for item in g_data:
print(item.text)
I would also improve the following:
use requests.Session() to maintain a web-scraping session, which will also bring a performance boost:
if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
be explicit about an underlying parser for BeautifulSoup:
soup = BeautifulSoup(r.content, "html.parser") # or "lxml", or "html5lib"