scrape Glassdoor for multiple pages using python lxml - python

I'm using the following script to scrape job listings via Glassdoor. The script below only scrapes the first page. I was wondering, how might I extend it so that it scrapes from page 1 up to the last page?
https://www.scrapehero.com/how-to-scrape-job-listings-from-glassdoor-using-python-and-lxml/
I'd greatly appreciate any help

I'll provide a more general answer. When scraping, to get the next page simply get the link on the page to the next page.
In the case of Glassdoor, your page links all have the page class and the next page is accessed by clicking an li button with class next. Your XPath then becomes:
//li[#class="next"]
You can then access it with:
element = document.xpath("//li[#class='next']")
We are specifically looking for the link so we can add a to our xpath:
//li[#class="next"]//a
And further specify that we just need the href attribute:
//li[#class="next"]//a/#href
And now you can access the link with
link = document.xpath('//li[#class="next"]//a/#href')
Tested and working on Glassdoor as of 2/9/18.

Related

How to extract a text from ng-herf with scrapy

There is a real state website with an infinite scroll down and I have tried to extract the companies' names and other details but I have a problem with writing the selectors need some insights for a new learner in scrapy.
HTML Snippet:
After handling if "more" button is available in website.
So, the selector appears in most browsers you can copy selectors like this
based on the function you are using you copy "xpath" or something else for scrapping process,
If that's does not help please give the link to webpage and select what values you want to scrap.
As I understand, you want to get the href from the tag and you don't know how to do it in scrapy.
you just need to add ::attr(ng-href) this to the last of your CSS selectors.
link = response.css('your_selector::attr(ng-href)').get()
to make it easier for you your CSS selector should be
link = response.css('.companyNameSpecs a::attr(ng-href)').get()
but it looks like the href and ng-href is the same you can also do the same with it
link = response.css('your_selector::attr(href)').get()

Python Selenium - Source from href link

I would like to be able to get the source page of the link I got from a a href without making Selenium changing page.
I am getting the a element using
driver.find_element(By.XPATH, "//a[contains(#class, 'css-1xyedec e1pf1lj70')]")
Then I can get the link in the href using
elem.get_attribute('href')
But I cannot find a way to get the source page of the link using selenium without changing the page of the browser.
EDIT: Here is the website on which I am trying to do it. The <a> is located for each sale in each div that includes the photo and the part with the title, price...
Check this answer https://sqa.stackexchange.com/questions/17022/how-to-fill-captcha-using-test-automation .
You cannot automate captcha. You should ask dev team for a workaround.
I would ask to disable CAPTCHA in test environment. It is no sense to have it there.

i am trying to scrape this website with scrapy python. I scraped most of the information but for some reason xpath doesnot scrape a division

Page i am trying to scrape
this is my code
Download_links = response.xpath('//div[#class = "download-block"]').extract()
this returns a empy list. Why cannot i scrape this div only?
This is the part of page i am trying to scrape
photo for the part i am trying to scrape
Please provide some help
You are getting an empty list because the division is not in the page source. Always check whether the data exists in the page source before writing xpaths.
The data may be in some other part of the page, please search the page source (ctrl+u) and get the correct xpath for the same.
Here in this page the download links are there in the pagesource.
see the image of the page source

How to use Python to scrape all the table contents on this website which is written by AJAX?

https://www.fedsdatacenter.com/federal-pay-rates/index.php?y=2017&n=&l=&a=&o=
This website seems to be written by jquery(AJAX). I would like to scrape all pages' tables. When I inspect the 1,2,3,4 page tags, they do not have a specific href link. Besides, clicking on them does not create a clear pattern of get requests, therefore, I find it hard to use Python urllib to send a get request for each page.
You can use Selenium with Python http://selenium-python.readthedocs.io/ to navigate through the pages. I would find the Next button and .click() it then time.sleep(seconds) and scrape the page. I can't navigate to the last page on this site, unfortunately (it seems broken - which you should also be aware of), but I'm assuming the Next button disappears or something when you get to the last page. If not, you might want to save the what you've scraped everytime you go to a new page, this way you don't lose your data in the event of an error.

Scrapy xpath selector not parsing

I'm trying to scrape https://www.grailed.com/ using scrapy. I have been able to get the elements I want in each listing (price, item title, size). I currently trying to get the ahrefs for each listing at the home page.
When I try response.xpath('.//div[starts-with(#id, "product")]').extract
returns
<bound method SelectorList.extract of [<Selector xpath='.//div[starts-with(#id, "product")]'
data=u'<div id="products">\n<div id="loading">\n<'>]>
Based on the inspect element it should be returning div class="feed-wrapper">?
I'm just trying to get those links so scrapy knows to go into each listing. Thank you for any help.
When you do scrapping always check source of page (not in the inspector but view-source) - that would be real data you operate with.
That div is added dynamically after page loads. JS does that job.
When you send request to server and receive pure HTML - JS will not be executed and so you see real Server response which you support to work with.
div class="feed-wrapper">
Here is real Server response to you. You must deal with it.

Categories

Resources