Extract embedded script from web page - python

I have a link i want to scrape the content from that looks like this:
https://www.whatever.com/getDescModuleAjax.htm?productId=32663684002&t=1478698394335
But when i want to open it with selenium it won't work. When i load it in a normal Browser it opens as plain Text with the Html in a bracket like this:
window.productDescription='<div style="clea....
#I want this
....n.jpg" width="950"/></p></div>'";
I was thinking i will Download the source code as plain text and extract the content i need using Bs4. But this can't be the best solution. is there a way to ignore the tags and load the web page normally using selenium and python?

If all the source code is inside of JS variable:
window.variable="<div>...</div>" then you probably can't use bs4 to resolve it since bs4 works for pure html DOM nodes.
Is there a way to ignore the tags and load the web page normally using selenium and python
Most likely Selenium should be able to force on-page JS to get executed and load variable content into page's DOM. Try to search where window.productDescription or productDescription expression is applied/used (in which onloaded .js files)?

Related

Beautiful Soup object's HTML, does not match HTML from the web browser

I am scraping the links from this website https://www.firstmallorca.com/en/search, for each of the properties that appear on it, so I can further scrape them and collect more detailed data.
My problem is that the parsed HTML(I am using html5lib parser) from which I scrape the data seems to be different in some areas with respect to the HTML which I see on the browser's DevTool. To demonstrate this:
1.This is the last link I select. On the browser, its href="/en/sales/penthouse-in-santa-ponsa/102512"
1.Image
2.I print the parsed HTML from the Beautiful Soup Object from the webpage with bs4Object.prettfy() and I copy the whole output into notepad++.
3.Then, in the notepad I look for the same element as in point 1. I find it and the href="/en/sales/finca-in-portocolom/159515", which is different from what I see on the actual webpage.3.Image
I do not understand the nature of what's happening. On point 3, I was expecting to see href="/en/sales/penthouse-in-santa-ponsa/102512" instead of href="/en/sales/finca-in-portocolom/159515".
It seems to me like I am doing the scraping on other similar webpage, though not the one I see through the browser.
The website loads content via javascript, which your parser does not execute.
This is a task for selenium.
The selenium package is used to automate the interaction with the web browser from Python.

Retrieving dynamic DOM content in selenium

I am trying to scrape some content from a website, but selenium.page_source() does not contain all the content I need beacuse the webiste is dynamically rendered. When opening DevTools in Chrome you are able to inspect all of the DOM-elements - even those rendered dynamically. This made me believe that there must be a way in selenium to do this as well.
Any suggestions?
Get the inner html of the html or body
driver.find_element_by_xpath("/html/body").get_attribute('innerHTML')
If that does not get everything, please post the source html/website.

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

Using Beautifulsoup To Scrape iframe

Hi I would like to scrape with beautiful soup, but normally the iframe src should be an html link, this time I encounter a wordpress URL that is basically the folder structure that leads to the PHP file.
I was wondering if there is any way to scrape the table inside that file?
The table DIV tags exist when I inspect elements in Chrome, howeve, when I loaded the link with BeautifulSoup, the content within the iframe disappears(table).
Please help
When the contents are loaded by JavaScript or PHP, the Selenium library can be more useful and handy to extract the required data.

Why am I unable to scrape this website?

Say I want to scrape the following url:
https://soundcloud.com/search/sounds?q=edm&filter.created_at=last_week
I have the following python code:
import requests
from lxml import html
urlToSearch = 'https://soundcloud.com/search/sounds?q=edm&filter.created_at=last_week'
page = requests.get(urlToSearch)
tree = html.fromstring(page.content)
print(tree.xpath('//*[#id="content"]/div/div/div[3]/div/div/div/ul/div/div/text()'))
The trouble is when I print the text at the following xpath:
//*[#id="content"]/div/div/div[3]/div/div/div/ul/div/div
nothing appears but [] despite me confirming that "Found 500+ tracks" should be there. What am i doing wrong?
The problem is that requests does not generate dynamic content.
Right click on the page and view the page source, you'll see that the static content does not include any of the content that you see after the dynamic content has loaded.
However, (using Chrome) open dev tools, click on network and XHR. It looks like you can get the data through an API which is better than scraping anyway!
Problem is that with modern websites almost all web pages will change quite a lot after its been loaded with JavaScript, css etc. You will fetch the basic html before any DOM updates etc been made and will look differently to actually visiting the page with a browser.
Use the Selenium WebDriver framework (mostly used for test automation), it will emulate loading the page, executing javascripts etc.
Selenium Documentation for Python

Categories

Resources