I am trying to scrape some content from a website, but selenium.page_source() does not contain all the content I need beacuse the webiste is dynamically rendered. When opening DevTools in Chrome you are able to inspect all of the DOM-elements - even those rendered dynamically. This made me believe that there must be a way in selenium to do this as well.
Any suggestions?
Get the inner html of the html or body
driver.find_element_by_xpath("/html/body").get_attribute('innerHTML')
If that does not get everything, please post the source html/website.
Related
I am trying to scrape the addresses of Dunkin' locations using this website: https://www.dunkindonuts.com/en/locations?location=10001. However, when trying to access the list of each Dunkin' on the web page, it shows up as comment. How do I access the list? I've never done web scraping before.
Here's my current code, I'm expecting a list of Dunkin' stores which I can then extract the addresses from.
requests.get() will return the raw HTML for a web page. This is only the beginning of the journey when you view this page in the browser. Your browser will parse that HTML to create the DOM. It will load other resources, such as images and scripts from other files. Then it will execute those scripts. In the modern web, those scripts will modify the DOM to give the page that you finally see in the browser. requests alone doesn't give you all that.
One solution is to use a library that loads the HTML into a browser and does all of the magic. selenium is one such library.
I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.
First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.
If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.
Say I want to scrape the following url:
https://soundcloud.com/search/sounds?q=edm&filter.created_at=last_week
I have the following python code:
import requests
from lxml import html
urlToSearch = 'https://soundcloud.com/search/sounds?q=edm&filter.created_at=last_week'
page = requests.get(urlToSearch)
tree = html.fromstring(page.content)
print(tree.xpath('//*[#id="content"]/div/div/div[3]/div/div/div/ul/div/div/text()'))
The trouble is when I print the text at the following xpath:
//*[#id="content"]/div/div/div[3]/div/div/div/ul/div/div
nothing appears but [] despite me confirming that "Found 500+ tracks" should be there. What am i doing wrong?
The problem is that requests does not generate dynamic content.
Right click on the page and view the page source, you'll see that the static content does not include any of the content that you see after the dynamic content has loaded.
However, (using Chrome) open dev tools, click on network and XHR. It looks like you can get the data through an API which is better than scraping anyway!
Problem is that with modern websites almost all web pages will change quite a lot after its been loaded with JavaScript, css etc. You will fetch the basic html before any DOM updates etc been made and will look differently to actually visiting the page with a browser.
Use the Selenium WebDriver framework (mostly used for test automation), it will emulate loading the page, executing javascripts etc.
Selenium Documentation for Python
I have a link i want to scrape the content from that looks like this:
https://www.whatever.com/getDescModuleAjax.htm?productId=32663684002&t=1478698394335
But when i want to open it with selenium it won't work. When i load it in a normal Browser it opens as plain Text with the Html in a bracket like this:
window.productDescription='<div style="clea....
#I want this
....n.jpg" width="950"/></p></div>'";
I was thinking i will Download the source code as plain text and extract the content i need using Bs4. But this can't be the best solution. is there a way to ignore the tags and load the web page normally using selenium and python?
If all the source code is inside of JS variable:
window.variable="<div>...</div>" then you probably can't use bs4 to resolve it since bs4 works for pure html DOM nodes.
Is there a way to ignore the tags and load the web page normally using selenium and python
Most likely Selenium should be able to force on-page JS to get executed and load variable content into page's DOM. Try to search where window.productDescription or productDescription expression is applied/used (in which onloaded .js files)?
I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)