How to scrape the same information from the next pages?

How to scrape the same information from the next pages? - python

from bs4 import BeautifulSoup
import requests
url13cases = 'https://hitechfix.com/product-category/cases/apple-cases/iphone-
cases/iphone-13-6-1-cases/'
r = requests.get(url13cases)
soup = BeautifulSoup(r.text, 'html.parser')
img = soup.findAll('img',{"class":"attachment-woocommerce_thumbnail size-
woocommerce_thumbnail"})
So I am trying to scrape all the pictures from my friends website but the problem is there are a few pages. I just want to know how to edit the url where it goes to the second third and fourth page also. Then I also want to create an array or objects for each link.
The link for page 2 is like this https://hitechfix.com/product-category/cases/apple-cases/iphone-cases/iphone-13-6-1-cases/page/2/
Its the same as the last link just the end just the extra /page/2/ at the end. There are also 2 more pages for 4 pages total how do i get all of them and create objects.

You could use built in function range() to itrate the pages.
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
import requests
img_list = []
for i in range(1,5):
r = requests.get(f'https://hitechfix.com/product-category/cases/apple-cases/iphone-cases/iphone-13-6-1-cases/page/{i}')
soup = BeautifulSoup(r.text)
img_list.extend(soup.find_all('img',{"class":"attachment-woocommerce_thumbnail size-woocommerce_thumbnail"}))
img_list

Related

How to select a tags and scrape href value?

I am having trouble getting hyperlinks for tennis matches listed on a webpage, how do I go about fixing the code below so that it can obtain it through print?
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02")
webpage = response.content
soup = BeautifulSoup(webpage, "html.parser")
print(soup.findAll('a href'))

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Select your elements more specific and use set comprehension to avoid duplicates:
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02')
soup = BeautifulSoup(r.text)
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Output
{'https://www.betexplorer.com/tennis/itf-men-singles/m15-new-delhi-2/sinha-nitin-kumar-vardhan-vishnu/tOasQaJm/',
'https://www.betexplorer.com/tennis/itf-women-doubles/w25-jerusalem/mushika-mao-mushika-mio-cohen-sapir-nagornaia-sofiia/xbNOHTEH/',
'https://www.betexplorer.com/tennis/itf-men-singles/m25-jakarta-2/barki-nathan-anthony-sun-fajing/zy2r8bp0/',
'https://www.betexplorer.com/tennis/itf-women-singles/w15-solarino/margherita-marcon-abbagnato-anastasia/lpq2YX4d/',
'https://www.betexplorer.com/tennis/itf-women-singles/w60-sydney/lee-ya-hsuan-namigata-junri/CEQrNPIG/',
'https://www.betexplorer.com/tennis/itf-men-doubles/m15-sharm-elsheikh-16/echeverria-john-marrero-curbelo-ivan-ianin-nikita-jasper-lai/nsGbyqiT/',...}

Change the last line to
print([a['href'] for a in soup.findAll('a')])
See a full tutorial here: https://pythonprogramminglanguage.com/get-links-from-webpage/

Fetch all pages using a Python request, using Beautiful Soup

I tried to fetch all product's name from the web page, but I could have only 12.
If I scroll down the web page then it gets refreshed and adds more information.
How can I to get all information?
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.outre.com/product-category/wigs/"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
items = soup.find_all("div", attrs={"class":"title-wrapper"})
for item in items:
print(item.p.a.get_text())

Your code is good. The thing is on the website; the products are dynamically loaded, so when you do your request you can only get the first 12 products.
You can check the developer console inside your browser to track the Ajax call made during browsing.
I did it, and it turns out a call is made to retrieve more product to the URL
https://www.outre.com/product-category/wigs/page/2/
So if you want to get all the products you need to browse multiple pages. I suggest you to use a loop and use your code several times.
N.B.: You can try to check the website to see is there is a more convenient place to get the product (like not from the main page)

The page loads the products from different URL via JavaScript, so Beautiful Soup doesn't see it. To get all pages, you can use the following example:
import requests
from bs4 import BeautifulSoup
url = "https://www.outre.com/product-category/wigs/page/{}/"
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
titles = soup.select(".product-title")
if not titles:
break
for title in titles:
print(title.text)
page += 1
Prints:
...
Wet & Wavy Loose Curl 18″
Wet & Wavy Boho Curl 20″
Nikaya
Jeanette
Natural Glam Body
Natural Free Deep

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

I am trying to scrape the website, and there are 12 pages with X links on them - I just want to extract all the links, and store them for later usage.
But there is an awkward problem with extracting links from the pages. To be precise, my output contains only the last link from each of the pages.
I know that this description may sound confusing, so let me show you the code and images:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
print(paper_urls)
time.sleep(5)
But problem is, my output looks like [redacted].
Instead of ~80 links, I'm getting this! I wondered what happened, and it looks like my script from every generated URL (from the list named "issues" in the code) gets only the last listed link?! How to fix it? I do not have any idea what should be the problem here.

Were you perhaps missing an indentation when appending to paper_urls?
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags) # added missing indentation
print(paper_urls)
time.sleep(5)
The whole code, after moving the print outside the loop, would look like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
#print(ahrefTags) #uncomment if you wish to print each and every link by itself
#time.sleep(5) #uncomment if you wish to add a delay between each request
print(paper_urls)

Web Scraping through links with Beautiful Soup

I'm trying to Scrape a blog "https://blog.feedspot.com/ai_rss_feeds/" and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.
The blog follows a pattern - It has multiple RSS Feeds and each Feed has an attribute called "Site" in the UI. I need to get all the links in the "Site" attribute. Example : aitrends.com, sciecedaily.com/... etc. In the code, the main div has a class called "rss-block", which has another nested class called "data" and each data has several tags and the tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.
I've tried various variations of the following code but nothing seemed to help much.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')
class_name='data'
dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)
I'm struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.
If you could help me finish both the parts of the problem, that'd be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!

The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second - commented out - uses :contains - available with bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
import requests, re
p = re.compile(r'feed_domain":"(.*?)",')
with requests.Session() as s:
r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = bs(r.content, 'lxml')
results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
## or use with bs4 4.7.1 +
#results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]
for link in results:
#do something e.g.
r = s.get(link)
soup = bs(r.content, 'lxml')
# extract info from indiv page

To get all the sublinks for each block, you can use soup.find_all:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]
Output:
[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/#Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/#Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

Beautiful soup with find all only gives the last result

I'm trying to retrieve all the products from a page using beautiful soup. The page has pagination, and to solve it I have made a loop to make the retrieve work for all pages.
But, when I move to the next step and try to "find_all()" the tags, it only gives the data from the last page.
If I try when one isolated page it works fine, so I guest that it is a problem with getting all the html from all pages.
My code is the next:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib3 as ur
http = ur.PoolManager()
base_url = 'https://www.kiwoko.com/tienda-de-perros-online.html'
for x in range (1,int(33)+1):
dog_products_http = http.request('GET', base_url+'?p='+str(x))
soup = BeautifulSoup(dog_products_http.data, 'html.parser')
print (soup.prettify)
and ones it has finished:
soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
As I said, if I do not use the for range and only retrieve one page (example: https://www.kiwoko.com/tienda-de-perros-online.html?p=10, it works fine and gives me the 36 products.
I have copied the "soup" in a word file and search the class to see if there is a problem, but there are all the 1.153 products I'm looking for.
So, I think the soup is right, but as I look for "more than one html" I do not think that the find all is working good.
¿What could be the problem?

You do want your find inside the loop but here is a way to copy the ajax call the page makes which allows you to return more items per request and also to calculate the number of pages dynamically and make requests for all products.
I re-use connection with Session for efficiency.
from bs4 import BeautifulSoup as bs
import requests, math
results = []
with requests.Session() as s:
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p=1&product_list_limit=54&isAjax=1&_=1560702601779').json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
product_count = int(soup.select_one('.toolbar-number').text)
pages = math.ceil(product_count / 54)
if pages > 1:
for page in range(2, pages + 1):
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p={}&product_list_limit=54&isAjax=1&_=1560702601779'.format(page)).json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
results = [result for item in results for result in item]
print(len(results))
# parse out from results what you want, as this is a list of tags, or do in loop above

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape the same information from the next pages? - python

Related

How to select a tags and scrape href value?

Fetch all pages using a Python request, using Beautiful Soup

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

Web Scraping through links with Beautiful Soup

Beautiful soup with find all only gives the last result

Categories

Resources