Exporting Scrapy items with Selenium scraped content - python

I'm trying to scrape a website using Scrapy and Selenium, and everything works just fine except the "yield item" part of the code.
In the "def parse_product(self, response)" part, I'm using Selenium find_element_by_css_selector to fill a list and then use the "for element in zip(list1, list2, etc)" approach to generate my items. I have also set up a Pipeline to export the result into a csv.
The problem is that although my spider is scraping the objects correctly (I have tested it with some prints along the way), the item creation part is not working and I'm getting an empty csv.
I have tried another approach that works, but is too slow. It consists in defining a Middleware to pass the request through Selenium, load the page source code and return a HtmlResponse. Then, I just simply use the response.css() method to fill the lists, the same approach to generate the items and the same Pipeline to export it as csv.
spider.py
def __init__(self):
#Headless Option
opt = Options()
opt.headless = True
## get the Firefox profile object
prf = FirefoxProfile()
## Disable CSS
prf.set_preference('permissions.default.stylesheet', 2)
## Disable images
prf.set_preference('permissions.default.image', 2)
self.browser = webdriver.Firefox(firefox_options = opt, firefox_profile = prf)
def parse(self, response):
self.browser.get(response.url)
print('Current URL: ' + response.request.url)
# Find the total number of pages
# Go to last page (click on '>>')
self.browser.find_element_by_css_selector('li.ais-pagination__item:nth-child(9) a:nth-child(1)').click()
n = self.browser.find_element_by_css_selector('li.ais-pagination__item:nth-child(7) a:nth-child(1)').get_attribute('text')
pages = int(n.strip())
# Go back to first page (click on '<<')
self.browser.find_element_by_css_selector('li.ais-pagination__item:nth-child(1) a:nth-child(1)').click()
# Scrape product links
pdcts = []
href = []
i=1
while i<=n:
# Append Product Links
atag = self.browser.find_elements_by_css_selector('div[class*="ph-proucts"] a')
for a in atag:
href.append(a.get_attribute('href'))
prd = list(set(href))
for p in prd:
pdcts.append(p)
# Load new page of products (Click '>')
self.browser.find_element_by_css_selector('li.ais-pagination__item:nth-child(8) a:nth-child(1)').click()
i+=1
for link in pdcts:
yield scrapy.Request(url = link, callback = self.parse_product)
def parse_product(self, response):
self.browser.get(response.url)
print("Current URL: " + response.request.url)
#Item Initializator
name = []
#Item Filling
try:
n = self.browser.find_element_by_css_selector('div[class="flex xs12 sm12 md12"] h1').text
name.append(string.capwords(n))
except:
n1 = response.request.url
n = n1.split('products/')[1].replace('-',' ')
name.append(string.capwords(n))
#Item Creation
for i in zip(name):
self.browser.quit()
item = ProjectItem()
item['name'] = i[0]
yield item
The expected result is a csv with the scraped information but I'm getting an empty one instead.
Could anyone help me with this please? I would really appreciate it.

Related

collecting multiple data from multiple requests into one item in scrapy

basically I have a website that contains clothing items. I am starting my spider where I have all the items and I am looping over them one by one, and entering the item by taking the url and accessing the page of it. then, I am trying to get the values of the images (.jpeg url files) and returning them ( each item has multiple colors so I am trying to take all the images of all the colors of this specific item ). the problem is my code right now returns the url of the colors on each line. what I want to do is return all the colors urls of the specific item inside 1 line of the json file and then loop to the next item.
my current code:
import scrapy
class USSpider(scrapy.Spider):
name = 'US'
start_urls = ['https://tr.uspoloassn.com/sadece-online-erkek/?attributes_filterable_product_base_type=T-Shirt']
def parse(self, response):
for j in range(int(response.css('.js-product-list-load').xpath("#page").extract_first()) , int(response.css('.js-product-list-load').xpath("#numpages").extract_first())):
l = 'https://tr.uspoloassn.com/sadece-online-erkek/?attributes_filterable_product_base_type=T-Shirt' + '&page=' + str(j)
yield scrapy.Request(url=l, callback=self.parse2)
def parse2(self,response):
for i in range(len(response.css('a.js-product-images-wrapper'))):
link = 'https://tr.uspoloassn.com' + response.css('a.js-product-images-wrapper')[i].attrib['href']
Url = response.urljoin(link)
yield scrapy.Request(Url, callback=self.parse3)
def parse3(self,response):
colors = list(set(response.xpath('//*[#class="js-variant-area "]').css('ul li').xpath("//a[#class='js-variant ']").xpath(
"#data-value").extract()))
link = response.url
arnold = []
for i in colors:
if i[0].lower() == 'v':
url = link + '?integration_color=' + i
yield scrapy.Request(url,callback=self.parseImage)
def parseImage(self, response):
yield{
'image links': response.css("a.js-product-thumbnail").xpath("#data-image").extract()
}

How to iterate a variable in XPATH, extract a link and store it into a list for further iteration

I'm following a Selenium tutorial for an Amazon price tracker (Clever Programming on Youtube) and I got stuck at getting the links from amazon using their techniques.
tutorial link: https://www.youtube.com/watch?v=WbJeL_Av2-Q&t=4315s
I realized the problem laid on the fact that I'm only getting one link out of the 17 available after doing the product search. I need to get all the links for every product after doing a search and them use then to get into each product and get their title, seller and price.
funtion get_products_links() should get all links and stores them into a list to be used by the function get_product_info()
def get_products_links(self):
self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
element = self.driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(self.search_term)
element.send_keys(Keys.ENTER)
time.sleep(2) # Wait to load page
self.driver.get(f'{self.driver.current_url}{self.price_filter}')
time.sleep(2) # Wait to load page
result_list = self.driver.find_elements_by_class_name('s-result-list')
links = []
try:
### Tying to get a list for Xpath links attributes ###
### Only numbers from 3 to 17 work after doing product search where 'i' is placed in the XPATH ###
i = 3
results = result_list[0].find_elements_by_xpath(
f'//*[#id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
links = [link.get_attribute('href') for link in results]
return links
except Exception as e:
print("Didn't get any products...")
print(e)
return links
At this point get_products_links() only returns one link since I just made 'i' a fixed value of 3 to make it work for now.
I was thinking to iterate 'i' in some sort so I can save every different PATHs but I don't know how to implement this.
I've tried performing a for loop and append the result into a new list but them the app stops working
Here is the complete code:
from amazon_config import(
get_web_driver_options,
get_chrome_web_driver,
set_browser_as_incognito,
set_ignore_certificate_error,
NAME,
CURRENCY,
FILTERS,
BASE_URL,
DIRECTORY
)
import time
from selenium.webdriver.common.keys import Keys
class GenerateReport:
def __init__(self):
pass
class AmazonAPI:
def __init__(self, search_term, filters, base_url, currency):
self.base_url = base_url
self.search_term = search_term
options = get_web_driver_options()
set_ignore_certificate_error(options)
set_browser_as_incognito(options)
self.driver = get_chrome_web_driver(options)
self.currency = currency
self.price_filter = f"&rh=p_36%3A{filters['min']}00-{filters['max']}00"
def run(self):
print("Starting script...")
print(f"Looking for {self.search_term} products...")
links = self.get_products_links()
time.sleep(1)
if not links:
print("Stopped script.")
return
print(f"Got {len(links)} links to products...")
print("Getting info about products...")
products = self.get_products_info(links)
# self.driver.quit()
def get_products_info(self, links):
asins = self.get_asins(links)
product = []
for asin in asins:
product = self.get_single_product_info(asin)
def get_single_product_info(self, asin):
print(f"Product ID: {asin} - getting data...")
product_short_url = self.shorten_url(asin)
self.driver.get(f'{product_short_url}?language=en_GB')
time.sleep(2)
title = self.get_title()
seller = self.get_seller()
price = self.get_price()
def get_title(self):
try:
return self.driver.find_element_by_id('productTitle')
except Exception as e:
print(e)
print(f"Can't get title of a product - {self.driver.current_url}")
return None
def get_seller(self):
try:
return self.driver.find_element_by_id('bylineInfo')
except Exception as e:
print(e)
print(f"Can't get title of a product - {self.driver.current_url}")
return None
def get_price(self):
return '$99'
def shorten_url(self, asin):
return self.base_url + 'dp/' + asin
def get_asins(self, links):
return [self.get_asin(link) for link in links]
def get_asin(self, product_link):
return product_link[product_link.find('/dp/') + 4:product_link.find('/ref')]
def get_products_links(self):
self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
element = self.driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(self.search_term)
element.send_keys(Keys.ENTER)
time.sleep(2) # Wait to load page
self.driver.get(f'{self.driver.current_url}{self.price_filter}')
time.sleep(2) # Wait to load page
result_list = self.driver.find_elements_by_class_name('s-result-list')
links = []
try:
### Tying to get a list for Xpath links attributes ###
### Only numbers from 3 to 17 work after doing product search where 'i' is placed ###
i = 3
results = result_list[0].find_elements_by_xpath(
f'//*[#id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
links = [link.get_attribute('href') for link in results]
return links
except Exception as e:
print("Didn't get any products...")
print(e)
return links
if __name__ == '__main__':
print("HEY!!!🚀🔥")
amazon = AmazonAPI(NAME, FILTERS, BASE_URL, CURRENCY)
amazon.run()
Steps to Run the script:
Step 1:
install Selenium==3.141.0 into your virtual environment
Step 2:
Search for Chrome Drivers on google and download the driver that matches you Chrome version. After download, extract the driver and paste it into your working folder
Step 3:
create a file called amazon_config.py and insert the following code:
from selenium import webdriver
DIRECTORY = 'reports'
NAME = 'PS4'
CURRENCY = '$'
MIN_PRICE = '275'
MAX_PRICE = '650'
FILTERS = {
'min': MIN_PRICE,
'max': MAX_PRICE
}
BASE_URL = "https://www.amazon.com/"
def get_chrome_web_driver(options):
return webdriver.Chrome('./chromedriver', chrome_options=options)
def get_web_driver_options():
return webdriver.ChromeOptions()
def set_ignore_certificate_error(options):
options.add_argument('--ignore-certificate-errors')
def set_browser_as_incognito(options):
options.add_argument('--incognito')
If you performed the steps correctly you should be able to run the script and it will perform the following:
Go to www.amazon.com
Search for a product (In this case "PS4")
Get a link for the first product
Visit that product link
Terminal should print:
HEY!!!🚀🔥
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
What I'm not able to do is to get all links and iterate them so the script will visit all links in the first page
If you are able to get all links, the terminal should print:
HEY!!!🚀🔥
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
# and so on until all links are visited
I can't run it so I only guess how I would do it.
I would put all try/except in for-loop, and use links.append() instead of links = [...], and I would use return after exiting loop
# --- before loop ---
links = []
# --- loop ---
for i in range(3, 18):
try:
results = result_list[0].find_elements_by_xpath(
f'//*[#id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
for link in results:
links.append(link.get_attribute('href'))
except Exception as e:
print(f"Didn't get any products... (i = {i})")
print(e)
# --- after loop ---
return links
But I would also try to use xpath with // to skip most of divs - and maybe if I would skip div[{i}] then I could get all products without for-loop.
BTW:
In get_products_info() I see similar problem - you create empty list product = [] but later in loop you assing value to product = ... so you remove previous value from product. It would need product.append() to keep all values.
Something like
def get_products_info(self, links):
# --- before loop ---
asins = self.get_asins(links)
product = []
# --- loop ---
for asin in asins:
product.append( self.get_single_product_info(asin) )
# --- after loop ---
return product

Scrapy iterating over list of elements on page

I'm having issues with my scrapy project. I want to extract all adds on the page in a list and then iterate over that list to extract and save data for every add. I'm sure I'm doing something terribly wrong and yet I don't know what. I suspect the problem is with the .extract_first() command but I'm calling that on a single object in the list not the whole response. As of right now the spider is only extracting the first data that conforms to the xpath that it finds on the page.
Here is the code:
class OddajastanovanjeljmestoSpider(scrapy.Spider):
name = 'OddajaStanovanjeLjMesto'
allowed_domains = ['www.nepremicnine.net']
start_urls = ['https://www.nepremicnine.net/oglasi-oddaja/ljubljana-mesto/stanovanje/']
def parse(self, response):
oglasi = response.xpath('//div[#itemprop="item"]')
for oglas in oglasi:
item = NepremicninenetItem()
item['velikost'] = oglas.xpath('//div[#class="main-data"]/span[#class="velikost"]/text()').extract_first(default="NaN")
item['leto'] = oglas.xpath('//div[#class="atributi"]/span[#class="atribut leto"]/strong/text()').extract_first(default="NaN")
item['zemljisce'] = oglas.xpath('//div[#class="atributi"]/span[#class="atribut"][text()="Zemljišče: "]/strong/text()').extract_first(default="NaN")
request = scrapy.Request("https://www.nepremicnine.net" + response.xpath('//div[#itemprop="item"]/h2[#itemprop="name"]/a[#itemprop="url"]/#href').extract_first(), callback=self.parse_item_page)
request.meta['item'] = item
yield request
next_page_url = response.xpath('//div[#id="pagination"]//a[#class="next"]/#href').extract_first()
if next_page_url:
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_next_page_url)
def parse_item_page(self, response):
item = response.meta['item']
item['referencnaStevilka'] = response.xpath('//div[#id="opis"]/div[#class="dsc"][preceding-sibling::div[#class="lbl"][text()="Referenčna št.:"]]/strong/text()').extract_first(default="NaN")
item['tipOglasa'] = response.xpath('//li[#itemprop="itemListElement"]/a[../meta[#content="1"]]/#title').extract_first(default="NaN")
item['cena'] = response.xpath('//div[#class="galerija-container"]/meta[#itemprop="price"]/#content').extract_first(default="NaN")
item['valuta'] = response.xpath('//div[#class="galerija-container"]/meta[#itemprop="priceCurrency"]/#content').extract_first(default="NaN")
item['vrstaNepremicnine'] = response.xpath('//li[#itemprop="itemListElement"]/a[../meta[#content="5"]]/#title').extract_first(default="NaN")
item['tipNepremicnine'] = response.xpath('//li[#itemprop="itemListElement"]/a[../meta[#content="6"]]/#title').extract_first(default="NaN")
item['regija'] = response.xpath('//li[#itemprop="itemListElement"]/a[../meta[#content="2"]]/#title').extract_first(default="NaN")
item['upravnaEnota'] = response.xpath('//li[#itemprop="itemListElement"]/a[../meta[#content="3"]]/#title').extract_first(default="NaN")
item['obcina'] = response.xpath('//li[#itemprop="itemListElement"]/a[../meta[#content="4"]]/#title').extract_first(default="NaN")
item['prodajalec'] = response.xpath('//div[#itemprop="seller"]/meta[#itemprop="name"]/#content').extract_first(default="NaN")
yield item
the parse_item_page method works correctly and returns the appropriate data but the parse method just returns the first data that it sees on the page...
Looks like the issue is with your xpath expressions. It looks like you need relative xpath expression inside the iteration which mean they need to start with a "."
item['velikost'] = oglas.xpath(
'.//div[#class="maindata"]/span[#class="velikost"]/text()'
).extract_first(default="NaN")
item['leto'] = oglas.xpath(
'.//div[#class="atributi"]/span[#class="atribut leto"]/strong/text()'
).extract_first(default="NaN")
If you paste a sample HTML code block I might be able to confirm.

Scrapy (Python): Iterating over 'next' page without multiple functions

I am using Scrapy to grab stock data from Yahoo! Finance.
Sometimes, I need to loop over several pages, 19 in this example , in order to get all of the stock data.
Previously (when I knew there would only be two pages), I would use one function for each page, like so:
def stocks_page_1(self, response):
returns_page1 = []
#Grabs data here...
current_page = response.url
next_page = current_page + "&z=66&y=66"
yield Request(next_page, self.stocks_page_2, meta={'returns_page1': returns_page1})
def stocks_page_2(self, response):
# Grab data again...
Now, instead of writing 19 or more functions, I was wondering if there was a way I could loop through an iteration using one function to grab all data from all pages available for a given stock.
Something like this:
for x in range(30): # 30 was randomly selected
current_page = response.url
# Grabs Data
# Check if there is a 'next' page:
if response.xpath('//td[#align="right"]/a[#rel="next"]').extract() != ' ':
u = x * 66
next_page = current_page + "&z=66&y={0}".format(u)
# Go to the next page somehow within the function???
Updated Code:
Works, but only returns one page of data.
class DmozSpider(CrawlSpider):
name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/"]
start_urls = ['http://finance.yahoo.com/q?s=CAT']
rules = [
Rule(LinkExtractor(restrict_xpaths='//td[#align="right"]/a[#rel="next"]'),
callback='stocks1',
follow=True),
]
def stocks1(self, response):
returns = []
rows = response.xpath('//table[#class="yfnc_datamodoutline1"]//table/tr')[1:]
for row in rows:
cells = row.xpath('.//td/text()').extract()
try:
values = cells[-1]
try:
float(values)
returns.append(values)
except ValueError:
continue
except ValueError:
continue
unformatted_returns = response.meta.get('returns_pages')
returns = [float(i) for i in returns]
global required_amount_of_returns, counter
if counter == 1 and "CAT" in response.url:
required_amount_of_returns = len(returns)
elif required_amount_of_returns == 0:
raise CloseSpider("'Error with initiating required amount of returns'")
counter += 1
print counter
# Iterator to calculate Rate of return
# ====================================
if data_intervals == "m":
k = 12
elif data_intervals == "w":
k = 4
else:
k = 30
sub_returns_amount = required_amount_of_returns - k
sub_returns = returns[:sub_returns_amount]
rate_of_return = []
if len(returns) == required_amount_of_returns or "CAT" in response.url:
for number in sub_returns:
numerator = number - returns[k]
rate = numerator/returns[k]
if rate == '':
rate = 0
rate_of_return.append(rate)
k += 1
item = Website()
items = []
item['url'] = response.url
item['name'] = response.xpath('//div[#class="title"]/h2/text()').extract()
item['avg_returns'] = numpy.average(rate_of_return)
item['var_returns'] = numpy.cov(rate_of_return)
item['sd_returns'] = numpy.std(rate_of_return)
item['returns'] = returns
item['rate_of_returns'] = rate_of_return
item['exchange'] = response.xpath('//span[#class="rtq_exch"]/text()').extract()
item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR) / numpy.std(rate_of_return))
items.append(item)
yield item
You see, a parse callback is just a function that takes the response and returns or yields either Items or Requests or both. There is no issue at all with reusing these callbacks, so you can just pass the same callback for every request.
Now, you could pass the current page info using the Request meta but instead, I'd leverage the CrawlSpider to crawl across every page. It's really easy, start generating the Spider with the command line:
scrapy genspider --template crawl finance finance.yahoo.com
Then write it like this:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
Scrapy 1.0 has deprecated the scrapy.contrib namespace for the modules above, but if you're stuck with 0.24, use scrapy.contrib.linkextractors and scrapy.contrib.spiders.
from yfinance.items import YfinanceItem
class FinanceSpider(CrawlSpider):
name = 'finance'
allowed_domains = ['finance.yahoo.com']
start_urls = ['http://finance.yahoo.com/q/hp?s=PWF.TO&a=04&b=19&c=2005&d=04&e=19&f=2010&g=d&z=66&y=132']
rules = (
Rule(LinkExtractor(restrict_css='[rel="next"]'),
callback='parse_items',
follow=True),
)
LinkExtractor will pick up the links in the response to follow, but it can be limited with XPath (or CSS) and regular expressions. See documentation for more.
Rules will follow the links and call the callback on every response. follow=True will keep extracting links on every new response, but it can be limited by depth. See documentation again.
def parse_items(self, response):
for line in response.css('.yfnc_datamodoutline1 table tr')[1:-1]:
yield YfinanceItem(date=line.css('td:first-child::text').extract()[0])
Just yield the Items, since Requests for the next pages will be handled by the CrawlSpider Rules.

StaleElementReferenceException selenium webdriver python

I'm writing a crawler using Selenium, Python and PhantomJS to use Google's reverse image search. So far I've successfully been able to upload an image and crawl the search results on the first page. However, when I try to click on the search results navigation, I'm getting a StaleElementReferenceError. I have read about it in many posts but still I could not implement the solution. Here is the code that breaks:
ele7 = browser.find_element_by_id("nav")
ele5 = ele7.find_elements_by_class_name("fl")
count = 0
for elem in ele5:
if count <= 2:
print str(elem.get_attribute("href"))
elem.click()
browser.implicitly_wait(20)
ele6 = browser.find_elements_by_class_name("rc")
for result in ele6:
f = result.find_elements_by_class_name("r")
for line in f:
link = line.find_elements_by_tag_name("a")[0].get_attribute("href")
links.append(link)
parsed_uri = urlparse(link)
domains.append('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))
count += 1
The code breaks at print str(elem.get_attribute("href")) . How can I solve this?
Thanks in advance.
Clicking a link will cause the browser to go to another page; make references to the elements in old page (ele5, elem) invalid.
Modify the code not to reference invalid elements.
For example, you can get urls before you visit other pages:
ele7 = browser.find_element_by_id("nav")
ele5 = ele7.find_elements_by_class_name("fl")
urls = [elem.get_attribute('href') for elem in ele5] # <-----
browser.implicitly_wait(20)
for url in urls[:2]: # <------
print url
browser.get(url) # <------ used `browser.get` instead of `click`.
# ; using `element.click` will cause the error.
ele6 = browser.find_elements_by_class_name("rc")
for result in ele6:
f = result.find_elements_by_class_name("r")
for line in f:
link = line.find_elements_by_tag_name("a")[0].get_attribute("href")
links.append(link)
parsed_uri = urlparse(link)
domains.append('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))

Categories

Resources