Extracting pagination number using BeautifulSoup - python

I'm trying to extract the pagination number of a webpage and have tried several methods all to no avail;
What's the right method, and please provide an explanation as to why these following methods do not extract the information as requested:
First method:
for i in range(0, 48, 24):
url = f'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=STATION%5E1712&maxPrice=500000&radius=0.5&sortType=10&propertyTypes=&mustHave=&dontShow=&index={i}&furnishTypes=&keywords='
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
page = soup.select('span[class="pagination-pageInfo"]')
print(page)
returns:
[]
[]
I've also tried:
1. page = soup.find('span', {'data-bind':'text: total'})
2. page = soup.select("[class~=pagination-pageInfo]")
which returns nothing
page = soup.select('span', {'data-bind':'text: total'})
which returns a bunch of unnecessary things and not the pagination number.
How do I get the pagination number at the bottom?
expected output:
1
2

There is no pagination element in DOM tree you get because this data loads by Javascript. You have 2 options:
You can use Selenium and do what you do (search element by span[class="pagination-pageInfo"] selector).
You still can use requests for your purpose, because you can find all page data including pagination in the JSON at the bottom of page HTML. You can easily get it with regular expressions. Full code:
import json
import requests
import re
for i in range(0, 48, 24):
url = f'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=STATION%5E1712&maxPrice=500000&radius=0.5&sortType=10&propertyTypes=&mustHave=&dontShow=&index={i}&furnishTypes=&keywords='
r = requests.get(url)
html = r.text
full_data_json = json.loads(re.search(r'window\.jsonModel = (.*)</script>', html).group(1))
print(full_data_json["pagination"]["page"])

Related

How to scrape the same information from the next pages?

from bs4 import BeautifulSoup
import requests
url13cases = 'https://hitechfix.com/product-category/cases/apple-cases/iphone-
cases/iphone-13-6-1-cases/'
r = requests.get(url13cases)
soup = BeautifulSoup(r.text, 'html.parser')
img = soup.findAll('img',{"class":"attachment-woocommerce_thumbnail size-
woocommerce_thumbnail"})
So I am trying to scrape all the pictures from my friends website but the problem is there are a few pages. I just want to know how to edit the url where it goes to the second third and fourth page also. Then I also want to create an array or objects for each link.
The link for page 2 is like this https://hitechfix.com/product-category/cases/apple-cases/iphone-cases/iphone-13-6-1-cases/page/2/
Its the same as the last link just the end just the extra /page/2/ at the end. There are also 2 more pages for 4 pages total how do i get all of them and create objects.
You could use built in function range() to itrate the pages.
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
import requests
img_list = []
for i in range(1,5):
r = requests.get(f'https://hitechfix.com/product-category/cases/apple-cases/iphone-cases/iphone-13-6-1-cases/page/{i}')
soup = BeautifulSoup(r.text)
img_list.extend(soup.find_all('img',{"class":"attachment-woocommerce_thumbnail size-woocommerce_thumbnail"}))
img_list

BeautifulSoup not returning results of a search on a website

I am trying to get the links to the individual search results on a website (National Gallery of Art). But the link to the search doesn't load the search results. Here is how I try to do it:
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
I can see that the links to the individual results could be found under soup.findAll('a') but they do not appear, instead the last output is a link to empty search result:
https://www.nga.gov/content/ngaweb/collection-search-result.html
How could I get a list of links, the first of which is the first search result (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) etc?
Actually, data is generating from api calls json response. Here is the desired
list of links.
Code:
import requests
import json
url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)
for item in r.json()['results']:
url = item['url']
abs_url = f'https://www.nga.gov{url}'
print(abs_url)
Output:
https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html
This seems to work for me:
from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
print(a['href'])
It returns all of the html a href links.
For the links from the search results specifically, those are loaded via AJAX and you would need to implement something that renders the javascript like headless chrome. You can read about one of the ways to implement this here, which fits your use case very closely. http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/
If you want to ask how to render javascript from python and then parse the result, you would need to close this question and open a new one, as it is not scoped correctly as is.

Get XHR info from URL

I have this website https://www.futbin.com/22/player/7504 and I want to know if there is a way to get the XHR url for the information using python. For example for the URL above I know the XHR I want is https://www.futbin.com/22/playerPrices?player=231443 (got it from inspect element -> network).
My objective is to get the price value from https://www.futbin.com/22/player/1 to https://www.futbin.com/22/player/10000 at once without using inspect element one by one.
import requests
URL = 'https://www.futbin.com/22/playerPrices?player=231443'
page = requests.get(URL)
x = page.json()
data = x['231443']['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])
You can find the player-resource id and build the url yourself. I use beautifulsoup. It's made for parsing websites, but you can take the requests content and throw that into an html parser as well if you don't want to install beautifulsoup
With it, read the first url, get the id and use your code to pull the prices. To test, change the 10000 to 2 or 3 and you'll see it works.
import re, requests
from bs4 import BeautifulSoup
for i in range(1,10000):
url = 'https://www.futbin.com/22/player/{}'.format(str(i))
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
player_resource = soup.find(id=re.compile('page-info')).get('data-player-resource')
# print(player_resource)
URL = 'https://www.futbin.com/22/playerPrices?player={}'.format(player_resource)
page = requests.get(URL)
x = page.json()
# print(x)
data = x[player_resource]['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])

Beautiful soup with find all only gives the last result

I'm trying to retrieve all the products from a page using beautiful soup. The page has pagination, and to solve it I have made a loop to make the retrieve work for all pages.
But, when I move to the next step and try to "find_all()" the tags, it only gives the data from the last page.
If I try when one isolated page it works fine, so I guest that it is a problem with getting all the html from all pages.
My code is the next:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib3 as ur
http = ur.PoolManager()
base_url = 'https://www.kiwoko.com/tienda-de-perros-online.html'
for x in range (1,int(33)+1):
dog_products_http = http.request('GET', base_url+'?p='+str(x))
soup = BeautifulSoup(dog_products_http.data, 'html.parser')
print (soup.prettify)
and ones it has finished:
soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
As I said, if I do not use the for range and only retrieve one page (example: https://www.kiwoko.com/tienda-de-perros-online.html?p=10, it works fine and gives me the 36 products.
I have copied the "soup" in a word file and search the class to see if there is a problem, but there are all the 1.153 products I'm looking for.
So, I think the soup is right, but as I look for "more than one html" I do not think that the find all is working good.
¿What could be the problem?
You do want your find inside the loop but here is a way to copy the ajax call the page makes which allows you to return more items per request and also to calculate the number of pages dynamically and make requests for all products.
I re-use connection with Session for efficiency.
from bs4 import BeautifulSoup as bs
import requests, math
results = []
with requests.Session() as s:
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p=1&product_list_limit=54&isAjax=1&_=1560702601779').json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
product_count = int(soup.select_one('.toolbar-number').text)
pages = math.ceil(product_count / 54)
if pages > 1:
for page in range(2, pages + 1):
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p={}&product_list_limit=54&isAjax=1&_=1560702601779'.format(page)).json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
results = [result for item in results for result in item]
print(len(results))
# parse out from results what you want, as this is a list of tags, or do in loop above

How to use beautiful soup and requests to extract article in website which is split in different pages

How to use beautiful soup and requests to extract each article to get full article in website which is split in different pages?
for example, this website
http://www.pagebypagebooks.com/F_Scott_Fitzgerald/The_Lees_Of_Happiness/Authors_Note_p1.html
thank you!
Here a snippet to help you:
import requests
TOC = ["Chapter_I_p%d.html", "Chapter_III_p%d.html", ...]
# get all page for chapter I
for i in range(1, 10): # 10 because i noticed it.
url = "http://www.pagebypagebooks.com/F_Scott_Fitzgerald/The_Lees_Of_Happiness/Chapter_I_p%d.html" % i
r = requests.get(url)
if not r.status_code in (200, 201):
continue
content = r.content
with open(os.path.split(url)[-1], "wb") as fin:
# write the whole content.
# bs4 can be used here to select 'p' tags...
fin.write(content)
Note: for each chapter you can loop over the page as describe.
My approach to scrap pages is to identify the pattern beetwen the urls (here the page number) or with bs4 selector get next page to fetch it.

Categories

Resources