scrapy xpath: can't get google next page - python

I want to get next page in https://www.google.com.tw/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=test
But my code not work.
Please guide me. Thank you so much.
scrapy shell "https://www.google.com.tw/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=test"
response.xpath("//a[#id='pnnext']/#href")

Here it is the working code
scrapy shell "https://www.google.com.tw/search?q=test"
response.xpath("//a[#id='pnnext']/#href")
The issue was in the way you were making the request to google.
In any case be aware about the policy dealing with Google search.
Google's Custom Search Terms of Service (TOS) can be found at http://www.google.com/cse/docs/tos.html.
UPDATE:
I wrote down a spider to test more in deep this issue.
Not pythonic at all (improvements are welcome), but I was interested in the mechanism of dealing with google results.
As previous comments suggested, a test for the internationalization of the interface is needed.
class googleSpider(CrawlSpider):
name = "googlish"
allowed_domains = ["google.com"]
start_urls = ["http://www.google.com"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
login_form = self.driver.find_element_by_name('q')
login_form.send_keys("scrapy\n")
time.sleep(4)
found = False
while not found:
try :
for element in self.driver.find_elements_by_xpath("//div[#class='rc']"):
print element.text + "\n"
for i in self.driver.find_elements_by_id('pnnext'):
i.click()
time.sleep(5)
except NoSuchElementException:
found = True
pass
self.driver.close()

Can you try using below x path and let me know what the result is.Looks like the the xpath used is not pointing to the exact location of web-element in the DOM.
//a[#id='pnnext']//span[2]

Related

Python Web Scraper - Issue grabbing links from href

I've been following along this guide to web scraping LinkedIn and google searches. There have been some changes in the HTML of google's search results since the guide was created so I've had to tinker with the code a bit. I'm at the point where I need to grab the links from the search results but have run into an issue where the program doesn't return anything even after implementing a code fix from this post due to an error. I'm not sure what I'm doing wrong here.
import Parameters
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from parsel import Selector
import csv
# defining new variable passing two parameters
writer = csv.writer(open(Parameters.file_name, 'w'))
# writerow() method to the write to the file object
writer.writerow(['Name', 'Job Title', 'Company', 'College', 'Location', 'URL'])
# specifies the path to the chromedriver.exe
driver = webdriver.Chrome('/Users/.../Python Scripts/chromedriver')
driver.get('https://www.linkedin.com')
sleep(0.5)
# locate email form by_class_name then send_keys() to simulate key strokes
username = driver.find_element_by_id('session_key')
username.send_keys(Parameters.linkedin_username)
sleep(0.5)
password = driver.find_element_by_id('session_password')
password.send_keys(Parameters.linkedin_password)
sleep(0.5)
sign_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
sign_in_button.click()
sleep(3)
driver.get('https:www.google.com')
sleep(3)
search_query = driver.find_element_by_name('q')
search_query.send_keys(Parameters.search_query)
sleep(0.5)
search_query.send_keys(Keys.RETURN)
sleep(3)
################# HERE IS WHERE THE ISSUE LIES ######################
#linkedin_urls = driver.find_elements_by_class_name('iUh30')
linkedin_urls = driver.find_elements_by_css_selector("yuRUbf > a")
for url_prep in linkedin_urls:
url_prep.get_attribute('href')
#linkedin_urls = [url.text for url in linkedin_urls]
sleep(0.5)
print('Supposed to be URLs')
print(linkedin_urls)
The search parameter is
search_query = 'site:linkedin.com/in/ AND "python developer" AND "London"'
Results in an empty list:
Snippet of the HTML section I want to grab:
EDIT: This is the output if I go by .find_elements_by_class_name or by Sector97's 1st edits.
Found an alternative solution that might make it a bit easier to achieve what you're after. Credit to A.Pond at
https://stackoverflow.com/a/62050505
Use the google search api to get the links from the results.
You may need to install the library first
pip install google
You can then use the api to quickly extract an arbitrary number of links:
from googlesearch import search
links = []
query = 'site:linkedin.com/in AND "python developer" AND "London"'
for j in search(query, tld = 'com',start = 0,stop = 100,pause=4):
links.append(j)
I got the first 100 results but you can play around with the parameters to get more or less as you need.
You can see more about this api here:
https://www.geeksforgeeks.org/performing-google-search-using-python-code/
I think I found the error in your code.
Instead of using
linkedin_urls = driver.find_elements_by_css_selector("yuRUbf > a")
Try this instead:
web_elements = driver.find_elements_by_class_name("yuRUbf")
That gets you the parent elements. You can then extract the url text using a simple list comprehension:
linkedin_urls = [elem.find_element_by_css_selector('a').get_attribute('href') for elem in web_elements]

Cannot scrape AliExpress HTML element

I would like to scrape an arbitrary offer from aliexpress. Im trying to use scrapy and selenium. The issue I face is that when I use chrome and do right click > inspect on a element I see the real HTML but when I do right click > view source I see something different - a HTML CSS and JS mess all around.
As far as I understand the content is pulled asynchronously? I guess this is the reason why I cant find the element I am looking for on the page.
I was trying to use selenium to load the page first and then get the content I want but failed. I'm trying to scroll down to get to reviews section and get its content
Is this some advanced anti-bot solution that they have or maybe my approach is wrong?
The code that I currently have:
import scrapy
from selenium import webdriver
import logging
import time
logging.getLogger('scrapy').setLevel(logging.WARNING)
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://pl.aliexpress.com/item/32998115046.html']
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
self.driver.get(response.url)
scroll_retries = 20
data = ''
while scroll_retries > 0:
try:
data = self.driver.find_element_by_class_name('feedback-list-wrap')
scroll_retries = 0
except:
self.scroll_down(500)
scroll_retries -= 1
print("----------")
print(data)
print("----------")
self.driver.close()
def scroll_down(self, pixels):
self.driver.execute_script("window.scrollTo(0, {});".format(pixels))
time.sleep(2)
By watching requests in network tab in inspect tool of browser you will find out comments are comming from here so you can crawl this page instead.

Scraping Flickr with Selenium/Beautiful soup in Python - ABSWP

I'm going through Automate Boring Stuff with Python and I'm stuck at the chapter about downloading data from the internet. One of the tasks is download photos for a given keyword from Flickr.
I have a massive problem with scraping this site. I've tried BeautifulSoup (which I think is not appropriate in this case as it uses Javascript) and Selenium. Looking at the html I think that I should locate 'overlay' class. However no matter which option I use (find_element_by_class_name, ...by_text, ...by_partial_text) I am not able to find these elements (I get: ".
Could you please help me to clarify what I'm doing wrong? I'd be also grateful for any materials that could help me understadt such cases better. Thanks!
Here's my simple code:
import sys
search_keywords = sys.argv[1]
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(f'https://www.flickr.com/search/?text={search_keywords}')
elems = browser.find_element_by_class_name("overlay")
print(elems)
elems.click()
Sample keywords I type in shell: "industrial design interior"
Are you getting any error message? With Selenium it's useful to surround your code in try/except blocks.
What are you trying to do exactly, download the photos? With a bit of re-writing
try:
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options = options)
search_keywords = "cars"
driver.get(f'https://www.flickr.com/search/?text={search_keywords}')
time.sleep(1)
except Exception as e:
print("Error loading search results page" + str(e))
try:
elems = driver.find_element_by_class_name("overlay")
print(elems)
elems.click()
time.sleep(5)
except Exception as e:
print(str(e))
Loads the page as expected and then clicks on the photo, taking us to This Page
I would be able to help more if you could go into more detail of what you're wanting to accomplish.

Click on a button using Selenium and scrapy spider

I just started and I've been on this for a week or two. Just using the internet to help but now I reached the point where I cant understand or my problem cannot be found anywhere else. In case you didnt understand my program I want to scrape data then click on a button then scrape data until I scrape an already collected data. then go to the next page which is in the list.
I reached the point where I scrape the first 8 data but I cant find a way to click on the "see more!" button. I know I should use Selenium and the button's Xpath. Anyway here is my code :
class KickstarterSpider(scrapy.Spider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ["https://www.kickstarter.com/projects/zwim/zwim-smart-swimming-goggles/community", "https://www.kickstarter.com/projects/zunik/oriboard-the-amazing-origami-multifunctional-cutti/community"]
def _init_(self, driver):
self.driver = webdriver.Chrome(chromedriver)
def parse(self, response):
self.driver.get('https://www.kickstarter.com/projects/zwim/zwim-smart-swimming-goggles/community')
backers = response.css('.founding-backer.community-block-content')
b = backers[0]
while True:
try:
seemore = selfdriver.find_element_by_xpath('//*[#id="content-wrap"]').click()
except:
break
self.driver.close()
def parse2(self,response):
print('you are here!')
for b in backers:
name = b.css('.name.js-founding-backer-name::text').extract_first()
backed = b.css('.backing-count.js-founding-backer-backings::text').extract_first()
print(name, backed)
Be shure web driver used in scrapy loads and interprets JS (idk... it can be a solution)

Scrapy - Xpath works in shell but not in code

I'm trying to crawl a website (I got their authorization), and my code returns what I want in scrapy shell, but I get nothing in my spider.
I also checked all the previous question similar to this one without any success, e.g., the website doesn't use javascript in the home page to load the elements I need.
import scrapy
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
'https://www.app4health.it/',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print ('PRE RISULTATI')
results = response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
# results = response.css('li a>href').extract()
# This works on scrapy shell, not in code
#risultati = response.xpath('//*[#id="nav"]/ol/li[1]/a').extract()
print (risultati)
#for pagineitems in risultati:
# next_page = pagineitems
print ('NEXT PAGE')
#Ignores the request cause already done. Insert dont filter
yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)
def prodotti(self, response):
self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
return 1
The website i'm trying to crawl is https://shop.app4health.it/
The xpath command that i use is this one :
response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
I know there are some problems with the prodotti function ecc..., but that's not the point. I would like to understand why the xpath selector works with scrapy shell ( i get exactly the links that i need ), but when i run it in my spider, i always get a null list.
If it can help, when i use CSS selectors in my spider, it works fine and it finds the elements, but i would like to use xpath ( i need it in the future development of my application ).
Thanks for the help :)
EDIT:
I tried to print the body of the first response ( from start_urls ) and it's correct, i get the page i want. When i use selectors in my code ( even the one that have been suggested ) they all work fine in shell, but i get nothing in my code!
EDIT 2
I have become more experienced with Scrapy and web crawling, and i realised that sometimes, the HTML page that you get in your browser might be different from the one you get with the Scrapy request! In my experience some website would respond with a different HTML compared to the one you see in your browser! That's why sometimes if you use the "correct" xpath/css query taken from the browser, it might return nothing if used in your Scrapy code.
Always check if the body of your response is what you were expecting!
SOLVED:
Path is correct. I wrote the wrong start_urls!
Alternatively to Desperado's answer you can use css selectors which are much simpler but more than enough for your use case:
$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]:
['https://shop.app4health.it/sonno',
'https://shop.app4health.it/monitoraggio-e-diagnostica',
'https://shop.app4health.it/terapia',
'https://shop.app4health.it/integratori-alimentari',
'https://shop.app4health.it/fitness',
'https://shop.app4health.it/benessere',
'https://shop.app4health.it/ausili',
'https://shop.app4health.it/prodotti-in-offerta',
'https://shop.app4health.it/kit-regalo']
scrapy shell command is perfect for debugging issues like this.
//nav[#id="mmenu"]//ul/li[contains(#class,"level0")]/a[contains(#class,"level-top")]/#href
use this xpath, also consider 'view-source' of page before creating xpath

Categories

Resources