I'm just getting started with scrapy and am interested in the best practices for this situation. Scrapy is designed to select elements on the page using either CSS or XPath. Disqus comments appear to load in iFrame making them harder to scrape. I know they have an API, but is there a way to scrape them using xpath/css or some other easy selector?
Here's an example post: http://www.ibtimes.com/who-aaron-ybarra-suspected-seattle-pacific-university-shooter-obsessed-columbine-1595326
I tried just using the xpath of Disqus comments count, but that didn't appear to work.
In [36]: sel.xpath('//*[#id="main-nav"]/nav/ul/li[1]/a/span[1]').extract()
Out[36]: []
Is there some other way to get the count? What is the best strategy here?
Disqus is in an iframe object on third party websites.
By accessing the "src" in iframe, you can follow the link and then proceed as normal.
You would need to use a headless browser. Try importing modules such as scrapy-selenium
Related
There is a real state website with an infinite scroll down and I have tried to extract the companies' names and other details but I have a problem with writing the selectors need some insights for a new learner in scrapy.
HTML Snippet:
After handling if "more" button is available in website.
So, the selector appears in most browsers you can copy selectors like this
based on the function you are using you copy "xpath" or something else for scrapping process,
If that's does not help please give the link to webpage and select what values you want to scrap.
As I understand, you want to get the href from the tag and you don't know how to do it in scrapy.
you just need to add ::attr(ng-href) this to the last of your CSS selectors.
link = response.css('your_selector::attr(ng-href)').get()
to make it easier for you your CSS selector should be
link = response.css('.companyNameSpecs a::attr(ng-href)').get()
but it looks like the href and ng-href is the same you can also do the same with it
link = response.css('your_selector::attr(href)').get()
What is the best way you guys know to get the XPATH AND CSS SELECTOR of a scraped website using selenium?
Someone suggested that I use these XPATH and CSS SELECTORS as parameters for an exercise I'm working on:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[placeholder='Search']"))).send_keys('Tech')
wait.until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Cancel']/.."))).click()
These parameters work very well for the exercise. However, I'm unsure on how to get (or "build") those parameters...
If I use Chrome's Inspect > right click > Copy XPATH or Copy Selector, I get some very different parameters that don't seem to work as well, and don't seem to be found by selenium.
#search-bar
//*[#id="app-container"]/div/section/div/div/div[2]/button
Is there a tool or a technique to get better XPATH or CSS SELECTORS as in my first example?
I like the resources shared by #JD2775. They are good to get you started understanding how to construct and understand xpaths and css selectors. When you are comfortable with that, you can work on your selector strategy. Hopefully you find at least some of the following helpful.
What makes a "good" xpath or css selector?
The selector should reliably and uniquely identify the targeted element.
For example, if an element's class occurs multiple times on the page, do not use only this class to identify the element. This is the most basic requirement for your selector
The selector should not be prone to "flakiness" -- ie, false failures that occur as a result of changes that are unrelated to the test.
Accomplish this by relying on as little of the DOM as possible to identify your element. For example, if both work to uniquely identify the element, //*[#id="app-container"]//button should be preferred over //*[#id="app-container"]/div/section/div/div/div[2]/button. Or, as you identify, "//button[text()='Cancel']/.." is the better choice.
Probably less important, but still worth considering: how easy is it to understand from the selector which element is being grabbed?
Some best practices
If you are working with a development team and thus have access to the source code of the application you are testing, implement a custom HTML attribute that is used ONLY for automation, and which has a value to uniquely identify and describe the element. In your test code you can then identify each of the elements you need with a line like this:
my_field = driver.find_element_by_css_selector('input[data-e2e-attribute="aCertainField"]')`
Organize your selection of elements into a Page Object Model, which abstracts the definition of webelements to one spot. So you can use these elements anywhere in your test without having to locate them, and it's easier to make changes to your selectors when necessary
You are correct that right-clicking and Copy Xpath is a bad way to get an Xpath. You are left with a long and brittle selector. It is much better to build your own. Once you get the hang of it, it is pretty simple to start building your own CSS and Xpath selectors. Some of them get complicated but if you keep practicing and searching for solutions you will get better and better.
The problem is it is very difficult to explain how to do it in a forum like this. Your best bet is to YouTube some videos on how to create Xpath and CSS selectors for Selenium. Here is a decent one I just found for Xpath:
https://www.youtube.com/watch?v=3uktjWgKrtI
This follows the approach I use in Chrome Dev Tools and using the built in Find window (no plugins)
Here is a good cheatsheet I have used in the past for Xpath and CSS Selectors
https://www.automatetheplanet.com/selenium-webdriver-locators-cheat-sheet/
Good luck
While creating crawler for some website using scrapy I extracted links using xpath. But these links are some thing link this
https://somedomain.com/someOtherUrl;sid=someSessionIdByServer;pgid=AgainSomeIdByServer
Now I don't understand why this sid and pgid are attached even when there is only url in the href. And the xpath code I used is some what like
//a/#href
Can I get just links. So, is there any way of getting only links with Scrapy.
I can just extract links using some python code. But I was curious to know if there is any way of doing things in the xpath or may be with setting in scrapy.
Another way is to use Scrapy's .re() or re_first():
response.xpath('//a/#href').re(r'^([^;]+)')
use xpath substring-before function.
//a/substring-before(#href, ';')
since scrapy still not supporting tokenize() available in xpath 2.0
Well with some time and efforts, I got to know some of the reasons, why this happens.So, I am answering my own question because it might help somebody else.
So, pgid (Process Group ID) and sid (Session ID) were added by the server itself. When I see through DOM on my browser. My browser already processed it and there I wasn't able to see sid and pgid on links. But when I fetch html using python then links do come url+sid+pgid format. The reason is given in this Scrapy Documentation
I used
element.xpath("/a/#href").split(";")[0]
to get just the url and remove sid and pgid from links.
It's not complete xpath solution. But that solved my problem.
Is there a way to scrap css values while scraping using python scrapy framework or by using php scraping.
any help will be appreaciated
scrapy.Selector allows you to use xpath to extract properties of HTML elements including CSS.
e.g. https://github.com/okfde/odm-datenerfassung/blob/master/crawl/dirbot/spiders/data.py#L83
(look around that code for how it fits into an entire scrapy spider)
If you don't need web crawling and just html parsing, you can use xpath directly from lxml in python. Another example:
https://github.com/codeformunich/feinstaubbot/blob/master/feinstaubbot.py
Finally, to get at the css from xpath I only know how to do it via css=element.attrib['style'] - this gives you everything inside of the style attribute which you further split by e.g. css.split(';') and then each of those by ':'.
It wouldn't surprise me if someone has a better suggestion. A little knowledge is enough to do a lot of scraping and that's how I would approach it based on previous projects.
Yes, please check the documentation for selectors basically you've two methods response.xpath() for xpath and response.css() for css selectors. For example, to get a title's text you could do any of the following:
response.xpath('//title/text()').extract_first()
response.css('title::text').extract_first()
I want to crawl a website having multiple pages and when a page number is clicked it is dynamically loaded.How to screen scrape it?
i.e as the url is not present as href or a how to crawl to other pages?
Would be greatful if someone helped me on this.
PS:URL remains the same when different page is clicked.
You should consider also Ghost.py, since it allows you tu run arbitrary javascript commands, fill forms and take snapshoot very quickly.
if you are using google chrome, you can check the url which is dynamically being called in
network->headers of the developer tools
so based on that you can identify whether it is a GET or POST request.
If it is a GET request you can find the parameters straight away from the url.
If it is a POST request you can find the parameters from form data in network->headers
of the developer tools.
You could look for the data you want in the javascript code instead of the HTML. This is usually a pain but you can do fun things with regular expressions.
Alternatively, some of the browser testing libraries like splinter work by loading the page up in an actual browser like firefox or chrome before scraping. One of those would work if you are running this on a machine with a browser installed.
Since this post has been tagged with python and web-crawler, Beautiful Soup has to be mentioned: http://www.crummy.com/software/BeautifulSoup/
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html
You can not do that easily since it is an ajax pagination (even with mechanize). Instead, open the source file of the page and try to know what is the url request used for the ajax pagination. Then, you can create a fake request back and process the returned data by your own way
If you don't mind using gevent.GRobot is another good choose.