Webscraping emedded link contents in a url

Webscraping emedded link contents in a url - python

I'm trying to scrape a website on product reviews and can't assign more than one url to a variable. Basically, I need to scrape urls in a url of specific content.
I have the parent url, and three linked pages to scrape product details like reviews, stars, ect. When passing more than one url to the assigned variable, a "connection adapter error" throws. I've also attempted to just compile, or copy the same code three times with no avail.
import requests as r
from bs4 import BeautifulSoup
import csv
url1 = 'http://drd.ba.ttu.edu/isqs6339/imbadproducts/'
filepath = 'dataout.csv'
res = r.get(url1)
res.content
soup = BeautifulSoup(res.content,'lxml')
results = soup.find("a")
print(results)
print(results['href'])
results = soup.find_all("a")
for l in results:
print(l['href'])
for l in results:
print(l.text)
print(res.headers)
product_result = soup.find_all('a')
for pr in product_result:
print(pr)
search_results = soup.find('div', attrs={'id' : 'searchresults'})
product_result = search_results.find_all('a')
for pr in product_result:
print(pr)
So I provided one link, but have three embedded links and different tags to scrape. I've never been able to get past the connection adapter error.

Related

Web scraping using python-beginner level

Hello I am new into python . practicing web scraping with some demo sites .
I am trying to scrape this website http://books.toscrape.com/ and want to extract
href
name/title
start rating/star-rating
price/price_color
in-stock availbility/instock availability
i written a basic code which goes to each book level.
but after that i am clueless as how i can extract those information.
import requests
from csv import reader,writer
from bs4 import BeautifulSoup
base_url= "http://books.toscrape.com/"
r = requests.get(base_url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent,'html.parser')
for article in soup.find_all('article'):

This will find you the href and name for every book. You could also extract some other other information if you want.
import requests
from csv import reader,writer
from bs4 import BeautifulSoup
base_url= "http://books.toscrape.com/"
r = requests.get(base_url)
soup = BeautifulSoup(r.content,'html.parser')
def extract_info(soup):
href = []
for a in soup.find_all('a', href=True):
if a.text:
if "catalogue" in a["href"]:
href.append(a['href'])
name = []
for a in soup.find_all('a', title=True):
name.append(a.text)
return href, name
href, name = extract_info(soup)
print(href[0], name[0])
the output will be the href and name for the first book

Try below approach using python - requests and BeautifulSoup. I have fetched the page URL from website itself after inspecting the network section > Doc tab of google chrome browser.
What exactly below script is doing:
First it will take the Page URL which is created using, page no parameter and then doing a GET request.
URL is dynamic which will get created after finishing of an iteration. You will notice that PAGE_NO param will get incremented after each iteration.
After getting the data script will parse the HTML code using html5.parser library.
Finally it will iterate all over the list of books fetched in each iteration or page for ex:- Title, Hyperlink, Price, Stock Availability and rating.
There are 50 pages and 1k results below script will extract all the books details one page per iteration
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def scrap_books_data():
PAGE_NO = 1 # Page no parameter which will get incremented after every iteration
while True:
print('Creating URL to scrape books data for ', str(PAGE_NO))
URL = 'http://books.toscrape.com/catalogue/page-' + str(PAGE_NO) + '.html' #dynamic URL which will get created after every iteration
response = requests.get(URL,verify=False) # GET request to fetch data from site
soup = bs(response.text,'html.parser') #Parse HTML data using 'html5.parser'
extracted_books_data = soup.find_all('article', class_ = 'product_pod') # find all articles tag where book details are nested
if len(extracted_books_data) == 0: #break the loop and exit from the script if there in no more data available to process
break
else:
for item in range(len(extracted_books_data)): #iterate over the list of extracted books
print('-' * 100)
print('Title : ', extracted_books_data[item].contents[5].contents[0].attrs['title'])
print('Link : ', extracted_books_data[item].contents[5].contents[0].attrs['href'])
print('Rating : ', extracted_books_data[item].contents[3].attrs['class'][1])
print('Price : ', extracted_books_data[item].contents[7].contents[1].text.replace('Â',''))
print('Availability : ', extracted_books_data[item].contents[7].contents[3].text.replace('\n','').strip())
print('-' * 100)
PAGE_NO += 1 #increment page no by 1 to scrape next page data
scrap_books_data()

Beautifulsoup "findAll()" does not return the tags

I am trying to build a scraper to get some abstracts of academic papers and their corresponding titles on this page.
The problem is that my for link in bsObj.findAll('a',{'class':'search-track'}) does not return the links I need to further build my scraper. In my code, the check is like this:
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
The for loop above does print out anything, however, the href links should be inside the <a class="search-track" ...</a>.
I have referred to this post, but changing the Beautifulsoup parser is not solving the problem of my code. I am using "html.parser" in my Beautifulsoup constructor: bsObj = bs(html.content, features="html.parser").
And the print(len(bsObj)) prints out "3" while it prints out "2" for both "lxml" and "html5lib".
Also, I started off using urllib.request.urlopen to get the page and then tried requests.get() instead. Unfortunately the two approaches give me the same bsObj.
Here is the code I've written:
#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl
'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results?
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"
journals = []
articles = []
def getJournals(url):
global journals
#html = urlopen(url)
html = requests.get(url)
bsObj = bs(html.content, features="html.parser")
#print(len(bsObj))
#testFile = open('testFile.txt', 'wb')
#testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
#testFile.close()
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
########does not print anything########
'''
if 'href' in link.attrs and link.attrs['href'] not in journals:
newJournal = link.attrs['href']
journals.append(newJournal)
'''
return None
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
getJournals(address)
print(journals)
Can anyone tell me what the problem is in my code that the for loop does not print out any links? I need to store the links of journals in a list and then visit each link to scrape the abstracts of papers. By right the abstracts part of a paper is free and the website shouldn't have blocked my ID because of it.

This page is dynamically loaded with jscript, so Beautifulsoup can't handle it directly. You may be able to do it using Selenium, but in this case you can do it by tracking the api calls made by the page (for more see, as one of many examples, here.
In your particular case it can be done this way:
from bs4 import BeautifulSoup as bs
import requests
import json
#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")
data = json.loads(str(soup))#response is in json format so we load it into a dictionary
Note: in this case, it's also possible to dispense with Beautifulsoup altogether and load the response directly, as in data = json.loads(html.content). From this point:
hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
print(hit['_source']['url'])
Ouput:
https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x
etc.

Web Scraping through links with Beautiful Soup

I'm trying to Scrape a blog "https://blog.feedspot.com/ai_rss_feeds/" and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.
The blog follows a pattern - It has multiple RSS Feeds and each Feed has an attribute called "Site" in the UI. I need to get all the links in the "Site" attribute. Example : aitrends.com, sciecedaily.com/... etc. In the code, the main div has a class called "rss-block", which has another nested class called "data" and each data has several tags and the tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.
I've tried various variations of the following code but nothing seemed to help much.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')
class_name='data'
dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)
I'm struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.
If you could help me finish both the parts of the problem, that'd be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!

The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second - commented out - uses :contains - available with bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
import requests, re
p = re.compile(r'feed_domain":"(.*?)",')
with requests.Session() as s:
r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = bs(r.content, 'lxml')
results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
## or use with bs4 4.7.1 +
#results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]
for link in results:
#do something e.g.
r = s.get(link)
soup = bs(r.content, 'lxml')
# extract info from indiv page

To get all the sublinks for each block, you can use soup.find_all:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]
Output:
[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/#Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/#Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

Beautiful soup with find all only gives the last result

I'm trying to retrieve all the products from a page using beautiful soup. The page has pagination, and to solve it I have made a loop to make the retrieve work for all pages.
But, when I move to the next step and try to "find_all()" the tags, it only gives the data from the last page.
If I try when one isolated page it works fine, so I guest that it is a problem with getting all the html from all pages.
My code is the next:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib3 as ur
http = ur.PoolManager()
base_url = 'https://www.kiwoko.com/tienda-de-perros-online.html'
for x in range (1,int(33)+1):
dog_products_http = http.request('GET', base_url+'?p='+str(x))
soup = BeautifulSoup(dog_products_http.data, 'html.parser')
print (soup.prettify)
and ones it has finished:
soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
As I said, if I do not use the for range and only retrieve one page (example: https://www.kiwoko.com/tienda-de-perros-online.html?p=10, it works fine and gives me the 36 products.
I have copied the "soup" in a word file and search the class to see if there is a problem, but there are all the 1.153 products I'm looking for.
So, I think the soup is right, but as I look for "more than one html" I do not think that the find all is working good.
¿What could be the problem?

You do want your find inside the loop but here is a way to copy the ajax call the page makes which allows you to return more items per request and also to calculate the number of pages dynamically and make requests for all products.
I re-use connection with Session for efficiency.
from bs4 import BeautifulSoup as bs
import requests, math
results = []
with requests.Session() as s:
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p=1&product_list_limit=54&isAjax=1&_=1560702601779').json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
product_count = int(soup.select_one('.toolbar-number').text)
pages = math.ceil(product_count / 54)
if pages > 1:
for page in range(2, pages + 1):
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p={}&product_list_limit=54&isAjax=1&_=1560702601779'.format(page)).json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
results = [result for item in results for result in item]
print(len(results))
# parse out from results what you want, as this is a list of tags, or do in loop above

using python to pull href tags

tyring to pull the href links for the products on this webpage. The code pulls all of the href's except the products that are listed on the page.
from bs4 import BeautifulSoup
import requests
url = "https://www.neb.com/search#t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))

The products are loaded through rest API dynamically, the URL is this:
https://international.neb.com/coveo/rest/v2/?sitecoreItemUri=sitecore%3A%2F%2Fweb%2F%7BA1D9D237-B272-4C5E-A23F-EC954EB71A26%7D%3Flang%3Den%26ver%3D1&siteName=nebinternational
Loading this response will get you the URLs.
Next time, check your network inspector if any part of web page isn't loading dynamically (or use selenium).

Try to verify if the product href's is in the received response. I'm telling you to do this because if the part of the products is being dynamically generated by ajax, for example, a simple get on the main page will not bring them.
Print the response and verifiy if the products are being received in the html

I think you want something like this:
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '100'):
resp = urllib.request.urlopen("https://www.neb.com/search#first=" + numb + "&t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping emedded link contents in a url - python

Related

Web scraping using python-beginner level

Beautifulsoup "findAll()" does not return the tags

Web Scraping through links with Beautiful Soup

Beautiful soup with find all only gives the last result

using python to pull href tags

Categories

Resources