There is a real state website with an infinite scroll down and I have tried to extract the companies' names and other details but I have a problem with writing the selectors need some insights for a new learner in scrapy.
HTML Snippet:
After handling if "more" button is available in website.
So, the selector appears in most browsers you can copy selectors like this
based on the function you are using you copy "xpath" or something else for scrapping process,
If that's does not help please give the link to webpage and select what values you want to scrap.
As I understand, you want to get the href from the tag and you don't know how to do it in scrapy.
you just need to add ::attr(ng-href) this to the last of your CSS selectors.
link = response.css('your_selector::attr(ng-href)').get()
to make it easier for you your CSS selector should be
link = response.css('.companyNameSpecs a::attr(ng-href)').get()
but it looks like the href and ng-href is the same you can also do the same with it
link = response.css('your_selector::attr(href)').get()
Related
I am trying to scrap a website using Selenium in Python in order to extract few links.
But for some of the tags, I am not able to find the links. When I inspect element for these links, it points me to ::before and ::after. One way to do this is to click on it which opens a new window and get the link from the new window. But this solutions is quite slow. Can someone help me know how can I fetch these links directly from this page?
Looks like the links you are trying to extract are not statically stored inside the i elements you see there. These links are dynamically generated by some JavaScripts running on that page.
So, the answer is "No", you can not extract there links from that page without human-like iterating elements of that page.
I'm using the following script to scrape job listings via Glassdoor. The script below only scrapes the first page. I was wondering, how might I extend it so that it scrapes from page 1 up to the last page?
https://www.scrapehero.com/how-to-scrape-job-listings-from-glassdoor-using-python-and-lxml/
I'd greatly appreciate any help
I'll provide a more general answer. When scraping, to get the next page simply get the link on the page to the next page.
In the case of Glassdoor, your page links all have the page class and the next page is accessed by clicking an li button with class next. Your XPath then becomes:
//li[#class="next"]
You can then access it with:
element = document.xpath("//li[#class='next']")
We are specifically looking for the link so we can add a to our xpath:
//li[#class="next"]//a
And further specify that we just need the href attribute:
//li[#class="next"]//a/#href
And now you can access the link with
link = document.xpath('//li[#class="next"]//a/#href')
Tested and working on Glassdoor as of 2/9/18.
Working with Python and Beautifulsoup. A bit new to CSS markup, so I know I'm making some beginner mistakes, a specific example would go a long way in helping me understand.
I'm trying to scrape a page for links, but only certain links.
CSS
links = soup.find_all("a", class_="details-title")
The code you have will search for links with the details-title class, which don't exist in the sample you provided. It seems like you are trying to find links located inside divs with the details-title class. I believe the easiest way to do this is to search using CSS selectors, which you can do with Beautiful Soup's .select method.
Example: links = soup.select("div.details-title a")
The <tag>.<class> syntax searches for all tags with that class, and elements separated by a space will search for sub-elements of the results before it. See here for more information.
[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.
I am scrapying over this page
http://www.modeluxproperties.com/?m=search&web=1&act=details_web&id=503
I want to get the values of all the Amenities
my xpath is
normalize-space(.//div[#id='specimen']/div[#class='section']/table//tr[4]/td/table//tr/td/text())
I got an empty results, why please?
The correct xpath for amenities is:
"//table//div[#id='specimen']//table/tr[4]/td/table/tr/td/text()"
so your xpath is actually completely ok, perhaps you are extracting it in some strange way?You can extract it like so:
sel.xpath("//table//div[#id='specimen']//table/tr[4]/td/table/tr/td/text()").extract()
where sel is simply an instance of Selector, created like so sel = Selector(response).
To debug that kind of issues Firefox firepath extension is very helpful, for Chrome there is xpath helper.Typically you should start with finding the right xpath with firepath and then trying it in scrapy shell, it's really simple something like:
scrapy shell
fetch "http://[your url]"
then you will get selector object sel, and you can test your xpath there. Testing with scrapy shell is often necessary because browsers are modifying html displayed on pages. For example in case of tables most browsers add tbody to tables.