Python Scraping from website

Python Scraping from website - python

I've tried to write a web scraper for https://www.waug.com/area/?idx=15:
#!/usr/bin/env python3
#_*_coding:utf8_*_
import requests
from bs4 import BeautifulSoup
url = requests.get('https://www.abcd.com/area/?abc=15')
html = url.text
soup = BeautifulSoup(html, 'html.parser')
count = 1
names = soup.select('#good_{} > div > div.class_name > div > div'.format(count))
prices = soup.select('#good_{} > div > div.class_name > div.class_name'.format(count))
for name in names:
while count < 45:
print(name.text)
count = count + 1
for price in prices:
while count < 45:
print(price.text)
count = count + 1
The output is only 45 times first item name and no price. How can I get all item name and price? I want to get item name and price on same line. (I've changed the url and some of the class names just in case)

In order to be sure to get the right name for the right title I'd get the whole "item-good" class.
Then using a for loop would allow me to be sure that the title I am getting matches its price.
Here's an example of how to parse a website with BeautifulSoup:
#!/usr/bin/env python3
#_*_coding:utf8_*_
import requests
from bs4 import BeautifulSoup
url = requests.get('https://www.waug.com/area/?idx=15')
html = url.text
soup = BeautifulSoup(html, 'html.parser')
count = 1
items = soup.findAll("div", {"class": "item-good"})
for item in items:
item_title = item.find("div", {"class": "good-title-text"})
item_price = item.find("div", {"class": "price-selling"})
print item_title.text + " " + item_price.text
# If you get encoding errors delete the row above and uncomment the one below
#print item_title.text.encode("utf-8") + " " + item_price.text.encode("utf-8")
As per OP's request this is not enough because there is a "more" button to push in the webpage in order to retrieve all the results.
This can be done using Selenium Webdriver.
=== IMPORTANT NOTE ===
In order to make this work you'll need to copy in your script folder also the "chromedriver" file.
You can download it from this Google website.
Here's the script:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('https://www.waug.com/area/?idx=15')
for number in range(10):
try:
WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.ID, "more_good")))
more_button = browser.find_element_by_id('more_good')
more_button.click()
time.sleep(10)
except:
print "Scrolling is now complete!"
source = browser.page_source
# This source variable should be used as input for BeautifulSoup
print source
Now it is tie to merge the two explained soultions in order to get the final requested result.
Please keep it mind that this is just a quick'n'dirty hack and needs proper error handling and polishing but it should be enough to get you started:
#!/usr/bin/env python3
#_*_coding:utf8_*_
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('https://www.waug.com/area/?idx=15')
def is_page_load_complete():
close_button = browser.find_element_by_id('close_good');
return close_button.is_displayed();
while(True):
WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.ID, "more_good")))
time.sleep(10)
more_button = browser.find_element_by_id('more_good')
if (more_button.is_displayed()):
more_button.click()
else:
if (is_page_load_complete()):
break
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
items = soup.findAll("div", {"class": "item-good"})
for item in items:
item_title = item.find("div", {"class": "good-title-text"})
item_price = item.find("div", {"class": "price-selling"})
print item_title.text + " " + item_price.text
# If you get encoding errors comment the row above and uncomment the one below
#print item_title.text.encode("utf-8") + " " + item_price.text.encode("utf-8")
print "Total items found: " + str(len(items))

Related

Want to scraping titles, dates, links, and content from IOL website but can't

I am new to web scraping, and I am trying to scrape the titles, dates, links, and contents of news articles on this website: https://www.iol.co.za/news/south-africa/eastern-cape.
The titles of the articles have different class names and heading (h) tag. I was able to scrape the dates, links, and titles using h tag. However, when I tried to store them in a pandas dataframe, I received the following errors-> ValueError: All arrays must be of the same length.
I also wrote the code to get the content of each article using the links. I got an error as well. I will thankful if I can be assisted.
I have tried different options to scrape the titles by creating a list of the different class names, but to no avail.
Please see my code below:
import sys, time
from bs4 import BeautifulSoup
import requests
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from datetime import timedelta
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import re
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['south-africa/eastern-cape']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
url = 'https://www.iol.co.za' + str(pagesToGet[i])
print(url)
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
#time.sleep(5) # allow you to sleep your code before your retrieve the elements from the webpage. Additionally, to
# prevent the chrome driver opening a new instance for every url, open the browser outside of the loop.
# an exception might be thrown, so the code should be in a try-except block
try:
# use the browser to get the url. This is suspicious command that might blow up.
driver.get("https://www.iol.co.za/news/" +str(pagesToGet[i]))
except Exception as e: # this describes what to do if an exception is thrown
error_type, error_obj, error_info = sys.exc_info() # get the exception information
print('ERROR FOR LINK:', url) # print the link that cause the problem
print(error_type, 'Line:', error_info.tb_lineno) # print error info and line that threw the exception
continue # ignore this page. Abandon this and go back.
time.sleep(3) # Allow 3 seconds for the web page to open
# Code to scroll the screen to the end and click on more news till the 15th page before scraping all the news
k = 1
while k<=2:
scroll_pause_time = 1 # You can set your own pause time. My laptop is a bit slow so I use 1 sec
screen_height = driver.execute_script("return window.screen.height;") # get the screen height of the web
i = 1
while True:
# scroll one screen height each time
driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
i += 1
time.sleep(scroll_pause_time)
# update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
scroll_height = driver.execute_script("return document.body.scrollHeight;")
# Break the loop when the height we need to scroll to is larger than the total scroll height
if (screen_height) * i > scroll_height:
break
driver.find_element(By.CSS_SELECTOR, '.Articles__MoreFromButton-sc-1mrfc98-0').click()
k += 1
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'html.parser')
news = soup.find_all('article', attrs={'class': 'sc-ifAKCX'})
print(len(news))
# Getting titles, dates, and links
for j in news:
# Article title
title = j.findAll(re.compile('^h[1-6]'))
for news_title in title:
art_title.append(news_title.text)
# Article dates
dates = j.find('p', attrs={'class': 'sc-cIShpX'})
if dates is not None:
date = dates.text
split_date = date.rsplit('|', 1)[1][10:].rsplit('<', 1)[0]
art_date.append(split_date)
# Article links
address = j.find('a').get('href')
news_link = 'https://www.iol.co.za' + address
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title, 'Date': art_date, 'Source': art_link})
# Getting contents
new_articles = ...struggling to write the code
df['Content'] = news_articles
df.to_csv('data.csv')
driver.quit()

I think this is what you are looking for:
# Needed libs
from selenium.webdriver import ActionChains, Keys
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
# Initialize drivver and navigate
driver = webdriver.Chrome()
driver.maximize_window()
url = 'https://www.iol.co.za/news/south-africa/eastern-cape'
wait = WebDriverWait(driver, 5)
driver.get(url)
time.sleep(3)
# take the articles
articles = wait.until(EC.presence_of_all_elements_located((By.XPATH, f"//article//*[(name() = 'h1' or name()='h2' or name()='h3' or name()='h4' or name()='h5' or name()='h6' or name()='h7') and string-length(text()) > 0]/ancestor::article")))
# For every article we take what we want
for article in articles:
header = article.find_element(By.XPATH, f".//*[name() = 'h1' or name()='h2' or name()='h3' or name()='h4' or name()='h5' or name()='h6' or name()='h7']")
print(header.get_attribute('textContent'))
author_and_date = article.find_elements(By.XPATH, f".//*[name() = 'h1' or name()='h2' or name()='h3' or name()='h4' or name()='h5' or name()='h6' or name()='h7']/following-sibling::p[1]")
if author_and_date:
print(author_and_date[0].get_attribute('textContent'))
else:
print("No author found")
link = article.find_element(By.XPATH, f".//a")
print(link.get_attribute('href'))

How to scrape data from each product page from Aliexpress using python selenium

I am trying to scrape each product page from this website: https://www.aliexpress.com/wholesale?catId=0&initiative_id=SB_20220315022920&SearchText=bluetooth+earphones
Especially I want to get comments and custumer countries as I mentionned in the photo:
enter image description here
The main issue is that my code does not inspect the right elements and this is what I am struggling with .
First, I tried my scraping on this product : https://www.aliexpress.com/item/1005003801507855.html?spm=a2g0o.productlist.0.0.1e951bc72xISfE&algo_pvid=6d3ed61e-f378-43d0-a429-5f6cddf3d6ad&algo_exp_id=6d3ed61e-f378-43d0-a429-5f6cddf3d6ad-8&pdp_ext_f=%7B%22sku_id%22%3A%2212000027213624098%22%7D&pdp_pi=-1%3B40.81%3B-1%3B-1%40salePrice%3BMAD%3Bsearch-mainSearch
Here is my code :
from selenium import webdriver
from selenium.webdriver.common.by import By
from lxml import html
import cssselect
from time import sleep
from itertools import zip_longest
import csv
driver = webdriver.Edge(executable_path=r"C:/Users/OUISSAL/Desktop/wscraping/XEW/scraping/codes/msedgedriver")
url = "https://www.aliexpress.com/item/1005003801507855.html?spm=a2g0o.productlist.0.0.1e951bc72xISfE&algo_pvid=6d3ed61e-f378-43d0-a429-5f6cddf3d6ad&algo_exp_id=6d3ed61e-f378-43d0-a429-5f6cddf3d6ad-8&pdp_ext_f=%7B%22sku_id%22%3A%2212000027213624098%22%7D&pdp_pi=-1%3B40.81%3B-1%3B-1%40salePrice%3BMAD%3Bsearch-mainSearch"
with open ("data.csv", "w", encoding="utf-8") as csvfile:
wr = csv.writer(csvfile)
wr.writerow(["Comment","Custumer country"])
driver.get(url)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
review_buttom = driver.find_element_by_xpath('//li[#ae_button_type="tab_feedback"]')
review_buttom.click()
html_source = driver.find_element_by_xpath('//div[#id="transction-feedback"]')
tree = html.fromstring(html_source)
#tree = html.fromstring(driver.page_source)
for rvw in tree.xpath('//div[#class="feedback-item clearfix"]'):
country = rvw.xpath('//div[#class="user-country"]//b/text()')
if country:
country = country[0]
else:
country = ''
print('country:', country)
comment = rvw.xpath('//dt[#id="buyer-feedback"]//span/text()')
if comment:
comment = comment[0]
else:
comment = ''
print('comment:', comment)
driver.close()
Thank you !!

What happens?
There is one main issue, the feedback you are looking for is in an iframe, so you wont get your information by calling the elements directly.
How to fix?
Scroll into view of element that holds the iframe navigate to its source and interact with its pagination to get all the feedbacks.
Example
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
url = 'https://www.aliexpress.com/item/1005003801507855.html?spm=a2g0o.productlist.0.0.1e951bc72xISfE&algo_pvid=6d3ed61e-f378-43d0-a429-5f6cddf3d6ad&algo_exp_id=6d3ed61e-f378-43d0-a429-5f6cddf3d6ad-8&pdp_ext_f=%7B%22sku_id%22%3A%2212000027213624098%22%7D&pdp_pi=-1%3B40.81%3B-1%3B-1%40salePrice%3BMAD%3Bsearch-mainSearch'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 10)
driver.execute_script("arguments[0].scrollIntoView();", wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.tab-content'))))
driver.get(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#product-evaluation'))).get_attribute('src'))
data=[]
while True:
for e in driver.find_elements(By.CSS_SELECTOR, 'div.feedback-item'):
try:
country = e.find_element(By.CSS_SELECTOR, '.user-country > b').text
except:
country = None
try:
comment = e.find_element(By.CSS_SELECTOR, '.buyer-feedback span').text
except:
comment = None
data.append({
'country':country,
'comment':comment
})
try:
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#complex-pager a.ui-pagination-next'))).click()
except:
break
pd.DataFrame(data).to_csv('filename.csv',index=False)

Wrong output in the CSV file using xPath expression

I wrote a code to get the following value "Exam Code", "Exam Name" and "Total Question". The issue is that in the put CSV file I am getting the wrong value in the "Exam Code" column. I am getting the same value as "Exam Name". The xPath looks fine to me. I don't know where is the issue happening.
Following is the code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time
option = Options()
option.add_argument("--disable-infobars")
option.add_argument("start-maximized")
option.add_argument("--disable-extensions")
option.add_experimental_option("excludeSwitches", ['enable-automation'])
# Pass the argument 1 to allow and 2 to block
# option.add_experimental_option("prefs", {
# "profile.default_content_setting_values.notifications": 1
# })
driver = webdriver.Chrome(chrome_options=option, executable_path='C:\\Users\\Awais\\Desktop\\web crawling\\chromedriver.exe')
url = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]
driver.implicitly_wait(0.5)
na = "N/A"
# text = 'Note: This exam is available on Demand only. You can Pre-Order this Exam and we will arrange this for you.'
links = []
exam_code = []
exam_name = []
total_q = []
for items in range(0, 5):
driver.get(url[items])
# if driver.find_element_by_xpath("//div[contains(#class, 'alert') and contains(#class, 'alert-danger')]") == text:
# continue
items += 1
try:
c_url = driver.current_url
links.append(c_url)
except:
pass
try:
codes = driver.find_element_by_xpath('''//div[contains(#class, 'col-sm-6') and contains(#class, 'exam-row-data') and position() = 2]''')
exam_code.append(codes.text)
except:
exam_code.append(na)
try:
names = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[3]/div[2]/a')
exam_name.append(names.text)
except:
exam_name.append(na)
try:
question = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[4]/div[2]/strong')
total_q.append(question.text)
except:
total_q.append(na)
continue
all_info = list(zip(links, exam_name, exam_name, total_q))
print(all_info)
df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)
driver.close()

You are getting the exam name in there twice, and instead of exam codes because that's what you are telling it to do (minor typo here with having exam_name in there twice):
all_info = list(zip(links, exam_name, exam_name, total_q))
change to: all_info = list(zip(links, exam_code, exam_name, total_q))
Few things I'm confused about.
1) Why use Selnium? There is no need for selenium as the data is returned in the initial request in the html source. So I would just use requests as it would speed up the processing.
2) The link and the exam code are already in the url you are iterating through. I would just split or use regex to that string to get the link and the code. You only really need to get the exam name and number of questions then.
With that being said, I adjusted it slightly to just get exam name and number of questions:
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]
links = []
exam_code = []
exam_name = []
total_q = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links.append(url)
exam_code.append(url.rsplit('-exam')[0].split('/')[-1])
exam_row = soup.select('div[class*="exam-row-data"]')
for exam in exam_row:
if exam.text == 'Exam Name: ':
exam_name.append(exam.find_next_sibling("div").text)
continue
if 'Questions' in exam.text and 'Total Questions' not in exam.text:
total_q.append(exam.text.strip())
continue
all_info = list(zip(links, exam_code, exam_name, total_q))
print(all_info)
df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)

Hi to get the exam code I think it is better to work with regex and get it from URL itself.
Also below code gives me the exam codes correctly except for 4th link which has a different structure as compared to others.
# -*- coding: utf-8 -*-
"""
Created on Fri Mar 6 14:48:00 2020
#author: prakh
"""
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time
option = Options()
option.add_argument("--disable-infobars")
option.add_argument("start-maximized")
option.add_argument("--disable-extensions")
option.add_experimental_option("excludeSwitches", ['enable-automation'])
# Pass the argument 1 to allow and 2 to block
# option.add_experimental_option("prefs", {
# "profile.default_content_setting_values.notifications": 1
# })
driver = webdriver.Chrome(executable_path='C:/Users/prakh/Documents/PythonScripts/chromedriver.exe')
url = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]
driver.implicitly_wait(0.5)
na = "N/A"
# text = 'Note: This exam is available on Demand only. You can Pre-Order this Exam and we will arrange this for you.'
links = []
exam_code = []
exam_name = []
total_q = []
for items in range(0, 5):
driver.get(url[items])
# if driver.find_element_by_xpath("//div[contains(#class, 'alert') and contains(#class, 'alert-danger')]") == text:
# continue
items += 1
try:
c_url = driver.current_url
links.append(c_url)
except:
pass
try:
codes = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[2]/div[2]')
exam_code.append(codes.text)
except:
exam_code.append(na)
try:
names = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[3]/div[2]/a')
exam_name.append(names.text)
except:
exam_name.append(na)
try:
question = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[4]/div[2]/strong')
total_q.append(question.text)
except:
total_q.append(na)
continue
all_info = list(zip(links, exam_code, exam_name, total_q))
print(all_info)
df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)
driver.close()

You don't need selenium because the source code contains the info you need without using JavaScript.
Also, most pages redirect to marks4sure.com/200-301-exam.html, so you'll get the same results. Only marks4sure.com/300-470-exam.html don't.
import requests
from bs4 import BeautifulSoup
urls = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]
with open("output.csv", "w") as f:
f.write("exam_code,exam_name,exam_quest\n")
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html5lib')
for n, v in enumerate(soup.find_all(class_ = "col-sm-6 exam-row-data")):
if n == 1:
exam_code = v.text.strip()
if n == 3:
exam_name = v.text.strip()
if n == 5:
exam_quest = v.text.strip()
f.write(f"{exam_code},{exam_name},{exam_quest}\n")

Selenium and BeautifulSoup scraper only scrapes first page of results

I have built a scraper, which works just fine in terms of handling data, but for some reason, it only scrapes the first page of the desired search.
I have two functions, one to find the elements I want in the page and the other to search for a NEXT link and click it if it exists. Otherwise, the scraper prints just that page and continues. I am using the following:
from __future__ import print_function
import fileinput
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import re
import sys
reload(sys)
sys.setdefaultencoding('utf8')
letters = ["x"]
for letter in letters:
try:
driver = webdriver.Chrome()
#driver.set_window_size(1120, 550)
driver.get("http://sam.gov")
driver.find_element_by_css_selector("a.button[title='Search Records']").click()
except:
print("Failed for " + letter)
pass
else:
driver.find_element_by_id('q').send_keys(letter)
driver.find_element_by_id('RegSearchButton').click()
def findRecords():
bsObj = BeautifulSoup(driver.page_source, "html.parser")
tableList = bsObj.find_all("table", {"class":"width100 menu_header_top_emr"})
tdList = bsObj.find_all("td", {"class":"menu_header width100"})
for table,td in zip(tableList,tdList):
a = table.find_all("span", {"class":"results_body_text"})
b = td.find_all("span", {"class":"results_body_text"})
hf = open("sam.csv",'a')
hf.write(', '.join(tag.get_text().strip() for tag in a+b) +'\n')
def crawl():
if driver.find_element_by_id('anch_16'):
print("Found next button")
findRecords()
driver.find_element_by_id('anch_16').click()
print("Going to next page")
else:
print("Scraping last page for " + letter)
findRecords()
print("Done scraping letter " + letter + "\nNow cleaning results file...")
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('sam.csv', inplace=1):
if line in seen: continue # skip duplicate
seen.add(line)
print(line)
print("Scraping and cleaning done for " + letter)
crawl()
driver.quit()

Selenium find all elements by xpath

I used selenium to scrap a scrolling website and conducted the code below
import requests
from bs4 import BeautifulSoup
import csv
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import unittest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import unittest
import re
output_file = open("Kijubi.csv", "w", newline='')
class Crawling(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.set_window_size(1024, 768)
self.base_url = "http://www.viatorcom.de/"
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "de/7132/Seoul/d973-allthingstodo")
for i in range(1,1):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
html_source = driver.page_source
data = html_source.encode("utf-8")
My next step was to crawl specific information from the website like the price.
Hence, I added the following code:
all_spans = driver.find_elements_by_xpath("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div[1]/div")
print(all_spans)
for price in all_spans:
Header = driver.find_elements_by_xpath("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div[1]/div/div[2]/div[2]/span[2]")
for span in Header:
print(span.text)
But I get just one price instead all of them. Could you provide me feedback on what I could improve my code? Thanks:)
EDIT
Thanks to your guys I managed to get it running. Here is the additional code:
elements = driver.find_elements_by_xpath("//div[#id='productList']/div/div")
innerElements = 15
outerElements = len(elements)/innerElements
print(innerElements, "\t", outerElements, "\t", len(elements))
for j in range(1, int(outerElements)):
for i in range(1, int(innerElements)):
headline = driver.find_element_by_xpath("//div[#id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/h2/a").text
price = driver.find_element_by_xpath("//div[#id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/div[2]/span[2]").text
deeplink = driver.find_element_by_xpath("//div[#id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/h2/a").get_attribute("href")
print("Header: " + headline + " | " + "Price: " + price + " | " + "Deeplink: " + deeplink)
Now my last issue is that I still do not get the last 20 prices back, which have a English description. I only get back the prices which have German description. For English ones, they do not get fetched although they share the same html structure.
E.g. html structure for the English items
headline = driver.find_element_by_xpath("//div[#id='productList']/div[6]/div[1]/div/div[2]/h2/a")
Do you guys know what I have to modify? Any feedback is appreciated:)

To grab all prices on that page you should use such XPATH:
Header = driver.find_elements_by_xpath("//span[contains(concat(' ', normalize-space(#class), ' '), 'price-amount')]")
which means: find all span elements with class=price-amount, why so complex - see here
But more simply to find the same elements is by CSS locator:
.price-amount

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Scraping from website - python

Related

Want to scraping titles, dates, links, and content from IOL website but can't

How to scrape data from each product page from Aliexpress using python selenium

Wrong output in the CSV file using xPath expression

Selenium and BeautifulSoup scraper only scrapes first page of results

Selenium find all elements by xpath

Categories

Resources