Beautiful soup with find all only gives the last result - python

I'm trying to retrieve all the products from a page using beautiful soup. The page has pagination, and to solve it I have made a loop to make the retrieve work for all pages.
But, when I move to the next step and try to "find_all()" the tags, it only gives the data from the last page.
If I try when one isolated page it works fine, so I guest that it is a problem with getting all the html from all pages.
My code is the next:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib3 as ur
http = ur.PoolManager()
base_url = 'https://www.kiwoko.com/tienda-de-perros-online.html'
for x in range (1,int(33)+1):
dog_products_http = http.request('GET', base_url+'?p='+str(x))
soup = BeautifulSoup(dog_products_http.data, 'html.parser')
print (soup.prettify)
and ones it has finished:
soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
As I said, if I do not use the for range and only retrieve one page (example: https://www.kiwoko.com/tienda-de-perros-online.html?p=10, it works fine and gives me the 36 products.
I have copied the "soup" in a word file and search the class to see if there is a problem, but there are all the 1.153 products I'm looking for.
So, I think the soup is right, but as I look for "more than one html" I do not think that the find all is working good.
¿What could be the problem?

You do want your find inside the loop but here is a way to copy the ajax call the page makes which allows you to return more items per request and also to calculate the number of pages dynamically and make requests for all products.
I re-use connection with Session for efficiency.
from bs4 import BeautifulSoup as bs
import requests, math
results = []
with requests.Session() as s:
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p=1&product_list_limit=54&isAjax=1&_=1560702601779').json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
product_count = int(soup.select_one('.toolbar-number').text)
pages = math.ceil(product_count / 54)
if pages > 1:
for page in range(2, pages + 1):
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p={}&product_list_limit=54&isAjax=1&_=1560702601779'.format(page)).json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
results = [result for item in results for result in item]
print(len(results))
# parse out from results what you want, as this is a list of tags, or do in loop above

Related

How to scrape the same information from the next pages?

from bs4 import BeautifulSoup
import requests
url13cases = 'https://hitechfix.com/product-category/cases/apple-cases/iphone-
cases/iphone-13-6-1-cases/'
r = requests.get(url13cases)
soup = BeautifulSoup(r.text, 'html.parser')
img = soup.findAll('img',{"class":"attachment-woocommerce_thumbnail size-
woocommerce_thumbnail"})
So I am trying to scrape all the pictures from my friends website but the problem is there are a few pages. I just want to know how to edit the url where it goes to the second third and fourth page also. Then I also want to create an array or objects for each link.
The link for page 2 is like this https://hitechfix.com/product-category/cases/apple-cases/iphone-cases/iphone-13-6-1-cases/page/2/
Its the same as the last link just the end just the extra /page/2/ at the end. There are also 2 more pages for 4 pages total how do i get all of them and create objects.
You could use built in function range() to itrate the pages.
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
import requests
img_list = []
for i in range(1,5):
r = requests.get(f'https://hitechfix.com/product-category/cases/apple-cases/iphone-cases/iphone-13-6-1-cases/page/{i}')
soup = BeautifulSoup(r.text)
img_list.extend(soup.find_all('img',{"class":"attachment-woocommerce_thumbnail size-woocommerce_thumbnail"}))
img_list

Fetch all pages using a Python request, using Beautiful Soup

I tried to fetch all product's name from the web page, but I could have only 12.
If I scroll down the web page then it gets refreshed and adds more information.
How can I to get all information?
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.outre.com/product-category/wigs/"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
items = soup.find_all("div", attrs={"class":"title-wrapper"})
for item in items:
print(item.p.a.get_text())
Your code is good. The thing is on the website; the products are dynamically loaded, so when you do your request you can only get the first 12 products.
You can check the developer console inside your browser to track the Ajax call made during browsing.
I did it, and it turns out a call is made to retrieve more product to the URL
https://www.outre.com/product-category/wigs/page/2/
So if you want to get all the products you need to browse multiple pages. I suggest you to use a loop and use your code several times.
N.B.: You can try to check the website to see is there is a more convenient place to get the product (like not from the main page)
The page loads the products from different URL via JavaScript, so Beautiful Soup doesn't see it. To get all pages, you can use the following example:
import requests
from bs4 import BeautifulSoup
url = "https://www.outre.com/product-category/wigs/page/{}/"
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
titles = soup.select(".product-title")
if not titles:
break
for title in titles:
print(title.text)
page += 1
Prints:
...
Wet & Wavy Loose Curl 18″
Wet & Wavy Boho Curl 20″
Nikaya
Jeanette
Natural Glam Body
Natural Free Deep

Extracting pagination number using BeautifulSoup

I'm trying to extract the pagination number of a webpage and have tried several methods all to no avail;
What's the right method, and please provide an explanation as to why these following methods do not extract the information as requested:
First method:
for i in range(0, 48, 24):
url = f'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=STATION%5E1712&maxPrice=500000&radius=0.5&sortType=10&propertyTypes=&mustHave=&dontShow=&index={i}&furnishTypes=&keywords='
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
page = soup.select('span[class="pagination-pageInfo"]')
print(page)
returns:
[]
[]
I've also tried:
1. page = soup.find('span', {'data-bind':'text: total'})
2. page = soup.select("[class~=pagination-pageInfo]")
which returns nothing
page = soup.select('span', {'data-bind':'text: total'})
which returns a bunch of unnecessary things and not the pagination number.
How do I get the pagination number at the bottom?
expected output:
1
2
There is no pagination element in DOM tree you get because this data loads by Javascript. You have 2 options:
You can use Selenium and do what you do (search element by span[class="pagination-pageInfo"] selector).
You still can use requests for your purpose, because you can find all page data including pagination in the JSON at the bottom of page HTML. You can easily get it with regular expressions. Full code:
import json
import requests
import re
for i in range(0, 48, 24):
url = f'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=STATION%5E1712&maxPrice=500000&radius=0.5&sortType=10&propertyTypes=&mustHave=&dontShow=&index={i}&furnishTypes=&keywords='
r = requests.get(url)
html = r.text
full_data_json = json.loads(re.search(r'window\.jsonModel = (.*)</script>', html).group(1))
print(full_data_json["pagination"]["page"])

Scrape <div<span from HTML-page

I am trying to create a simple weather forecast with Python in Eclipse. So far I have written this:
from bs4 import BeautifulSoup
import requests
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
print(r.content) # Outputs HTML code for the page
soup = BeautifulSoup(r.content, 'html5lib') # Parse the data with BeautifulSoup(HTML-string, html-parser)
min_max = soup.select('min-max.temperature') # Select all spans with a "min-max-temperature" attribute
print(min_max.prettify())
table = soup.find('div', attrs={'daily-weather-list-item__temperature'})
print(table.prettify())
From a html-page with elements that looks like this:
I have found the path to the first temperature in the HTML-page's elements, but when I try and execute my code, and print to see if I have done it correctly, nothing is printed. My goal is to print a table with dates and corresponding temperatures, which seems like an easy task, but I do not know how to properly name the attribute or how to scrape them all from the HTML-page in one iteration.
The <span has two temperatures stored, one min and one max, here it just happens that they're the same.
I want to go into each <div class="daily-weather-list-item__temperature", collect the two temperatures and add them to a dictionary, how do I do this?
I have looked at this question on stackoverflow but I couldn't figure it out:
Python BeautifulSoup - Scraping Div Spans and p tags - also how to get exact match on div name
You could use a dictionary comprehension. Loop over all the forecasts which have class daily-weather-list-item, then extract date from the datetime attribute of the time tags, and use those as keys; associate the keys with the maxmin info.
import requests
from bs4 import BeautifulSoup
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
soup = BeautifulSoup(r.content, 'html5lib')
temps = {i.select_one('time')['datetime']:i.select_one('.min-max-temperature').get_text(strip= True)
for i in soup.select('.daily-weather-list-item')}
return temps
weather_forecast()

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

I am trying to scrape the website, and there are 12 pages with X links on them - I just want to extract all the links, and store them for later usage.
But there is an awkward problem with extracting links from the pages. To be precise, my output contains only the last link from each of the pages.
I know that this description may sound confusing, so let me show you the code and images:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
print(paper_urls)
time.sleep(5)
But problem is, my output looks like [redacted].
Instead of ~80 links, I'm getting this! I wondered what happened, and it looks like my script from every generated URL (from the list named "issues" in the code) gets only the last listed link?! How to fix it? I do not have any idea what should be the problem here.
Were you perhaps missing an indentation when appending to paper_urls?
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags) # added missing indentation
print(paper_urls)
time.sleep(5)
The whole code, after moving the print outside the loop, would look like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
#print(ahrefTags) #uncomment if you wish to print each and every link by itself
#time.sleep(5) #uncomment if you wish to add a delay between each request
print(paper_urls)

Categories

Resources