Scrape data from a website that URL doesn't change

Scrape data from a website that URL doesn't change - python

I'm new to web scraping, but have enough command on requests, BeautifulSoup and Selenium that can do extract data from a website. Now the problem is, I'm trying to scrape data from the website that URL doesn't change when click on the page number for next page.
Page number in inspection
websiteURL ==> https://www.ellsworth.com/products/adhesives/
I also try the Google Developer tool but couldn't get the way. If someone guides me with Code that would be grateful.
Google Developer show Get Request
Here is my Code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import requests
itemproducts = pd.DataFrame()
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.ellsworth.com/products/adhesives/')
base_url = 'https://www.ellsworth.com'
html= driver.page_source
s = BeautifulSoup(html,'html.parser')
data = []
href_link = s.find_all('div',{'class':'results-products-item-image'})
for links in href_link:
href_link_a = links.find('a')['href']
data.append(base_url+href_link_a)
# url = 'https://www.ellsworth.com/products/adhesives/silicone/dow-838-silicone-adhesive-sealant-white-90-ml-tube/'
for c in data:
driver.get(c)
html_pro = driver.page_source
soup = BeautifulSoup(html_pro,'html.parser')
title = soup.find('span',{'itemprop':'name'}).text.strip()
part_num = soup.find('span',{'itemprop':'sku'}).text.strip()
manfacture = soup.find('span',{'class':'manuSku'}).text.strip()
manfacture_ = manfacture.replace('Manufacturer SKU:', '').strip()
pro_det = soup.find('div',{'class':'product-details'})
p = pro_det.find_all('p')
try:
d = p[1].text.strip()
c = p.text.strip()
except:
pass
table = pro_det.find('table',{'class':'table'})
tr = table.find_all('td')
typical = tr[1].text.strip()
brand = tr[3].text.strip()
color = tr[5].text.strip()
image = soup.find('img',{'itemprop':'image'})['src']
image_ = base_url + image
png_url = title +('.jpg')
img_data = requests.get(image_).content
with open(png_url,'wb') as fh:
fh.write(img_data)
itemproducts=itemproducts.append({'Product Title':title,
'Part Number':part_num,
'SKU':manfacture_,
'Description d':d,
'Description c':c,
'Typical':typical,
'Brand':brand,
'Color':color,
'Image URL':image_},ignore_index=True)

The content of the page is rendered dynamically, but if you inspect the XHR tab under Network in the Developer Tool you can fetch the API request url. I've shortened the URL a bit, but it still works just fine.
Here's how you can get the list of the first 10 products from page 1:
import requests
start = 0
n_items = 10
api_request_url = f"https://www.ellsworth.com/api/catalogSearch/search?sEcho=1&iDisplayStart={start}&iDisplayLength={n_items}&DefaultCatalogNode=Adhesives&_=1497895052601"
data = requests.get(api_request_url).json()
print(f"Found: {data['iTotalRecords']} items.")
for item in data["aaData"]:
print(item)
This gets you a nice JSON response with all the data for each product and that should get you started.
['Sauereisen Insa-Lute Adhesive Cement No. P-1 Powder Off-White 1 qt Can', 'P-1-INSA-LUTE-ADHESIVE', 'P-1 INSA-LUTE ADHESIVE', '$72.82', '/products/adhesives/ceramic/sauereisen-insa-lute-adhesive-cement-no.-p-1-powder-off-white-1-qt-can/', '/globalassets/catalogs/sauereisen-insa-lute-cement-no-p-1-off-white-1qt_170x170.jpg', 'Adhesives-Ceramic', '[{"qty":"1-2","price":"$72.82","customerPrice":"$72.82","eachPrice":"","custEachPrice":"","priceAmount":"72.820000000","customerPriceAmount":"72.820000000","currency":"USD"},{"qty":"3-15","price":"$67.62","customerPrice":"$67.62","eachPrice":"","custEachPrice":"","priceAmount":"67.620000000","customerPriceAmount":"67.620000000","currency":"USD"},{"qty":"16+","price":"$63.36","customerPrice":"$63.36","eachPrice":"","custEachPrice":"","priceAmount":"63.360000000","customerPriceAmount":"63.360000000","currency":"USD"}]', '', '', '', 'P1-Q', '1000', 'true', 'Presentation of packaged goods may vary. For special packaging requirements, please call (877) 454-9224', '', '', '']
If you want to get the next 10 items, you have to modify the iDisplayStart value to 10. And if you want more items per request just change the iDisplayLength to say 20.
In the demo, I substitute these values with start and n_items but you can easily automate that because the number of all items found comes with the response e.g. iTotalRecords.

Related

Scraping: can't get stable results

I'm doing a scraping exercise on a job searching webpage. I want to get the link, name of the company, job title, salary, location and posting date. I've run the same code multiple times, and sometimes it gives the expected results in the salary item (salary if the info is displayed, "N/A" otherwise) and sometimes it gives me something different: salary if the info is displayed, "N/A", and some random character values in columns whose values should be "N/A". I have no problems with the other elements. Here is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
import pandas as pd
import requests
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://ca.indeed.com/')
#Inputs a job title and location into the input boxes
input_box = driver.find_element(By.XPATH,'//*[#id="text-input-what"]')
input_box.send_keys('data analyst')
location = driver.find_element(By.XPATH,'//*[#id="text-input-where"]')
location.send_keys('toronto')
#Clicks on the search button
button = driver.find_element(By.XPATH,'//*[#id="jobsearch"]/button').click()
#Creates a dataframe
df = pd.DataFrame({'Link':[''], 'Job Title':[''], 'Company':[''], 'Location':[''],'Salary':[''], 'Date':['']})
#This loop goes through every page and grabs all the details of each posting
#Loop will only end when there are no more pages to go through
while True:
#Imports the HTML of the current page into python
soup = BeautifulSoup(driver.page_source, 'lxml')
#Grabs the HTML of each posting
postings = soup.find_all('div', class_ = 'slider_container css-g7s71f eu4oa1w0')
len(postings)
#grabs all the details for each posting and adds it as a row to the dataframe
for post in postings:
link = post.find('a').get('href')
link_full = 'https://ca.indeed.com'+link
name = post.find('h2', tabindex = '-1').text.strip()
company = post.find('span', class_ = 'companyName').text.strip()
try:
location = post.find('div', class_ = 'companyLocation').text.strip()
except:
location = 'N/A'
try:
salary = post.find('div', attrs = {'class':'heading6 tapItem-gutter metadataContainer noJEMChips salaryOnly'}).text.strip()
except:
salary = 'N/A'
date = post.find('span', class_ = 'date').text.strip()
df = df.append({'Link':link_full, 'Job Title':name, 'Company':company, 'Location':location,'Salary':salary, 'Date':date},
ignore_index = True)
#checks if there is a button to go to the next page, and if not will stop the loop
try:
button = soup.find('a', attrs = {'aria-label': 'Next'}).get('href')
driver.get('https://ca.indeed.com'+button)
except:
break
Can I fix my code to get the expected results everytime I run it? Also, an additional issue: I'm scraping around 60 pages. But usually the program stops between 20 and 30 pages before the last page. Is there a way to fix the code so that it scrapes until the last page everytime?

Here is a simplified example with requests library:
import requests
from bs4 import BeautifulSoup
cookies = {}
headers = {}
params = {
'q': 'data analyst',
'l': 'toronto',
'from': 'searchOnHP',
}
response = requests.get('https://ca.indeed.com/jobs', params=params, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text)
postings = soup.find_all('div', class_ = 'slider_container css-g7s71f eu4oa1w0')
len(postings)
prints
15

Web Scryping in Python

I was trying to scrape a website for some university project. The website is https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813.
I have a problem with my python code. What I want to obtain is all the reviews for the pages from 1 to 5, but instead I get all [].Any help would be appreciated!
Here is the code:
import csv
from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd
import requests
reviewlist = []
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
response = opener.open('https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813')
soup = BeautifulSoup(response,'html.parser')
reviews = soup.find_all('div',{'class':'reviewContent'})
for i in reviews:
review = {
'per_review_name' : i.find('span',{'itemprop':'name'}).text.strip(),
'per_review' : i.find('p',{'class':'reviewText'}).text.strip(),
'per_review_taglia' : i.find('p',{'class':'singleReviewSizeDescr'}).text.strip(),
}
reviewlist.append(review)
for page in range (1,5):
prova = soup.find_all('div',{'data-page': '{page}'})
print(prova)
print(len(reviewlist))
df = pd.DataFrame(reviewlist)
df.to_csv('list.csv',index=False)
print('Fine.')
And here the output that I get:
[]
5
[]
5
[]
5
[]
5
Fine.

As I understand it the site uses Javascript to load most of its content, therfore you cant scrape that data, as it isn't loaded initially, but you can use the rating backend for your product site the link is:
https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page=1&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters=
You can go through the pages by changing the page parameter in the url/get request, the link returns a html document of the rating page an you can get the rating from the rating value meta tag

The website only loads first page of the reviews in the first request. If you inspect its requests, you can see that it requests for additional data when you change the page of the reviews. You can rewrite your code as following to get the reviews from all pages:
reviews_dom = []
for page in range(1,6):
url = f"https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page={page}&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters="
r = requests.request("GET", url)
soup = BeautifulSoup(r.text, "html.parser")
reviews_dom += soup.find_all("div", attrs={"class": "reviewContent"})
reviews = []
for review_item in reviews_dom:
review = {
'per_review_name' : review_item.find('span', attrs={'itemprop':'name'}).text.strip(),
'per_review' : review_item.find('p', attrs={'class':'reviewText'}).text.strip(),
'per_review_taglia' : review_item.find('p', attrs={'class':'singleReviewSizeDescr'}).text.strip(),
}
reviews.append(review)
print(len(reviews))
print(reviews)
What happens in the code?
In the first iteration, we request the data for each page of reviews (first 5 pages in the above example).
In the second iteration, we parse the reviews dom and extract the data we need.

Data Mining IMDB Reviews - Only extracting the first 25 reviews

I am currently trying to extract all the reviews on Spiderman Homecoming movie but I am only able to get the first 25 reviews. I was able to load more in IMDB to get all the reviews as originally it only shows the first 25 but for some reason I am unable to mine all the reviews after every review has been loaded. Does anyone know what I am doing wrong?
Below is the code I am running:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
#Set the web browser
driver = webdriver.Chrome(executable_path=r"C:\Users\Kent_\Desktop\WorkStudy\chromedriver.exe")
#Go to Google
driver.get("https://www.imdb.com/title/tt6320628/reviews?ref_=tt_urv")
#Loop load more button
wait = WebDriverWait(driver,10)
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
#Scrape IMBD review
ans = driver.current_url
page = requests.get(ans)
soup = BeautifulSoup(page.content, "html.parser")
all = soup.find(id="main")
#Get the title of the movie
all = soup.find(id="main")
parent = all.find(class_ ="parent")
name = parent.find(itemprop = "name")
url = name.find(itemprop = 'url')
film_title = url.get_text()
print('Pass finding phase.....')
#Get the title of the review
title_rev = all.select(".title")
title = [t.get_text().replace("\n", "") for t in title_rev]
print('getting title of reviews and saving into a list')
#Get the review
review_rev = all.select(".content .text")
review = [r.get_text() for r in review_rev]
print('getting content of reviews and saving into a list')
#Make it into dataframe
table_review = pd.DataFrame({
"Title" : title,
"Review" : review
})
table_review.to_csv('Spiderman_Reviews.csv')
print(title)
print(review)

Well, actually, there's no need to use Selenium. The data is available via sending a GET request to the websites API in the following format:
https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey=MY-KEY
where you have to provide a key for the paginationKey in the URL (...&paginationKey=MY-KEY)
The key is found in the class load-more-data:
<div class="load-more-data" data-key="g4wp7crmqizdeyyf72ux5nrurdsmqhjjtzpwzouokkd2gbzgpnt6uc23o4zvtmzlb4d46f2swblzkwbgicjmquogo5tx2">
</div>
So, to scrape all the reviews into a DataFrame, try:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = (
"https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"title": [], "review": []}
while True:
response = requests.get(url.format(key))
soup = BeautifulSoup(response.content, "html.parser")
# Find the pagination key
pagination_key = soup.find("div", class_="load-more-data")
if not pagination_key:
break
# Update the `key` variable in-order to scrape more reviews
key = pagination_key["data-key"]
for title, review in zip(
soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
):
data["title"].append(title.get_text(strip=True))
data["review"].append(review.get_text())
df = pd.DataFrame(data)
print(df)
Output (truncated):
title review
0 Terrific entertainment Spiderman: Far from Home is not intended to be...
1 THe illusion of the identity of Spider man. Great story in continuation of spider man home...
2 What Happened to the Bad Guys I believe that Quinten Beck/Mysterio got what ...
3 Spectacular One of the best if not the best Spider-Man mov...
...
...

Scraping each element from website with BeautifulSoup

I wrote a code for scraping one real estate website. This is the link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/
From this page I can get only location, size and price of the apartment, but Is it possible to write a code that will go on page of each appartment and scrape values from it, because it contains much more info. Check this link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I have posted a code. I noticed that my url changes when I click on specific real estate. For example:
arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I taught about creating for loop, but there is no way to know how it changes because it has some id number at the end:
NkvJK0Ou5tV
This is the code that I have:
from bs4 import BeautifulSoup
import requests
website = "https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
lokacija = my_html.find_all('p', class_='offer-location text-truncate')
ukupna_kvadratura = my_html.find_all('p', class_='offer-price offer-price--invert')
ukupna_cena = my_html.find_all('div', class_='d-flex justify-content-between w-100')
ukupni_opis = my_html.find_all('div', class_='mt-1 mb-1 mt-lg-0 mb-lg-0 d-md-block offer-meta-info offer-adress')
for lok, kvadratura, cena_stana, sumarno in zip(lokacija, ukupna_kvadratura, ukupna_cena, ukupni_opis):
lok = lok.text.split(',')[0] #lokacija
kv = kvadratura.span.text.split(' ')[0] #kvadratura
jed = kvadratura.span.text.split(' ')[1] #jedinica mere
cena = cena_stana.span.text #cena
sumarno = sumarno.text
datum = sumarno.split('|')[0].strip()
status = sumarno.split('|')[1].strip()
opis = sumarno.split('|')[2].strip()
print(lok, kv, jed, cena, datum, status, opis)

You can get href from div class="placeholder-preview-box ratio-4-3".
From here you can find the URL.

You can iterate over the links provided by the pagination at the bottom of the page:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/').text, 'html.parser')
def scrape_page(page):
return [{'title':i.h2.get_text(strip=True), 'loc':i.p.get_text(strip=True), 'price':i.find('p', {'class':'offer-price'}).get_text(strip=True)} for i in page.find_all('div', {'class':'row offer'})]
result = [scrape_page(d)]
while d.find('a', {'class':'pagination-arrow arrow-right'}):
d = soup(requests.get(f'https://www.nekretnine.rs{d.find("a", {"class":"pagination-arrow arrow-right"})["href"]}').text, 'html.parser')
result.append(scrape_page(d))

Dynamic Web scraping

I am trying to scrape this page ("http://www.arohan.in/branch-locator.php") in which when I select the state and city, an address will be displayed and I have to write the state,city and address in csv/excel file. I am able to reach this till step, now I am stuck.
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
chrome_path= r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("http://www.arohan.in/branch-locator.php")
select = Select(driver.find_element_by_name('state'))
select.select_by_visible_text('Bihar')
drop = Select(driver.find_element_by_name('branch'))
city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[#id='city1']/option[text()='Gaya']"))
city_option.click()

Is selenium necessary? looks like you can use URLs to arrive at what you want: http://www.arohan.in/branch-locator.php?state=Assam&branch=Mirza.
Get a list of the state / branch combinations then use the beautiful soup tutorial to get the info from each page.

In a slightly organized manner:
import requests
from bs4 import BeautifulSoup
link = "http://www.arohan.in/branch-locator.php?"
def get_links(session,url,payload):
session.headers["User-Agent"] = "Mozilla/5.0"
res = session.get(url,params=payload)
soup = BeautifulSoup(res.text,"lxml")
item = [item.text for item in soup.select(".address_area p")]
print(item)
if __name__ == '__main__':
for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']):
payload = {
'state':st ,
'branch':br
}
with requests.Session() as session:
get_links(session,link,payload)
Output:
['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A']
['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', 'contact#arohan.in']

A better approach would be to avoid using selenium. That is useful if you require the javascript processing required to render the HTML. In your case, this is not needed. The required information is already contained within the HTML.
What is needed is to first make a request to get a page containing all of the states. Then for each state, request the list of branch. Then for each state/branch combination, a URL request can be made to get the HTML containing the address. This happens to be contained in the second <li> entry following a <ul class='address_area'> entry:
from bs4 import BeautifulSoup
import requests
import csv
import time
# Get a list of available states
r = requests.get('http://www.arohan.in/branch-locator.php')
soup = BeautifulSoup(r.text, 'html.parser')
state_select = soup.find('select', id='state1')
states = [option.text for option in state_select.find_all('option')[1:]]
# Open an output CSV file
with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['State', 'Branch', 'Address'])
# For each state determine the available branches
for state in states:
r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state})
soup = BeautifulSoup(r_branches.text, 'html.parser')
# For each branch, request a page contain the address
for option in soup.find_all('option')[1:]:
time.sleep(0.5) # Reduce server loading
branch = option.text
print("{}, {}".format(state, branch))
r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch})
soup_branch = BeautifulSoup(r_branch.text, 'html.parser')
ul = soup_branch.find('ul', class_='address_area')
if ul:
address = ul.find_all('li')[1].get_text(strip=True)
row = [state, branch, address]
csv_output.writerow(row)
else:
print(soup_branch.title)
Giving you an output CSV file starting:
State,Branch,Address
West Bengal,Kolkata,"PTI Building, 4th Floor,DP Block, DP-9, Salt Lake CityCalcutta, 700091"
West Bengal,Maheshtala,"Narmada Park, Par Bangla,Baddir Bandh Bus Stop,Opp Lane Kismat Nungi Road,Maheshtala,Kolkata- 700140. (W.B)"
West Bengal,ShyamBazar,"First Floor, 6 F.b.T. Road,Ward No.-6,Kolkata-700002"
You should slow the script down using a time.sleep(0.5) to avoid too much loading on the server.
Note: [1:] is used as the first item in the drop down lists is not a branch or state, but a Select Branch entry.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape data from a website that URL doesn't change - python

Related

Scraping: can't get stable results

Web Scryping in Python

Data Mining IMDB Reviews - Only extracting the first 25 reviews

Scraping each element from website with BeautifulSoup

Dynamic Web scraping

Categories

Resources