Render HTML string with selenium python - python

Is there any way to render an HTML string with selenium?
The system does not allow writing anything on disk so saving content in a file and opening it in the browser will not work.
"data:text/html;base64" load HTML is also not working because the content size is large.
Can I open io. Bytes like objects with selenium?
Any appropriate way to open html content by executing js with selenium.
Thanks,

Related

PDF file scraping to extract html source content

Goal : I am trying to convert a PDF file(located both locally/on web) to HTML.
Using selenium firefox webdriver to open the pdf file and extract html content from it using its xpath.
Problem is when the file gets larger the process time increases significantly is there a way to make it faster?
I have tried threading(python), but i have to open a webdriver instance for each page in a pdf file, which takes loading time for each page to open.
If anything is not clear, iam happy to explain in more detail.
Thanks in advance for any help.

Retrieving dynamic DOM content in selenium

I am trying to scrape some content from a website, but selenium.page_source() does not contain all the content I need beacuse the webiste is dynamically rendered. When opening DevTools in Chrome you are able to inspect all of the DOM-elements - even those rendered dynamically. This made me believe that there must be a way in selenium to do this as well.
Any suggestions?
Get the inner html of the html or body
driver.find_element_by_xpath("/html/body").get_attribute('innerHTML')
If that does not get everything, please post the source html/website.

How to get the html from a page opened via Javascript?

I am using # driver.execute_script('window.open("{}", "_blank");'.format(input_url)) instead of driver.get in a function which is being executed by using the Pool like below:
with Pool(2) as p:
records = p.map(process_url, fetch_links)
In order to speed up the process, I am opening the URLs via Javascript by using window.open. It does open but did not fetch the HTML of the page. How can I tackle this? I tried driver.get() but parallelism is not working as it is opening URLs one by one in the same window.
I also had a similar problem while working with javascript and html.
To open a html file or another website through url, you can try the following code in javascript:
location.replace("name.html")
The html file should be in the same folder in which the js file is present.
And try this:
location.replace("https://stackoverflow.com")
To open another website.

Python Save XML Webpage to .mht

I have a single diagnostic webpage on a device with charts that is in XML format made up of an xsl and gif files. Is there a way with Python to download the entire page and save it as a single .mht file rather than separate files?
This is essentially a combination of those two problems:
How to save "complete webpage" not just basic html using Python
https://stackoverflow.com/a/44815028/679240
AFAIK, you could download the page with urllib, parse the HTML with Beautiful Soup, find the images and other dependencies in the parsed HTML, download those, rewrite the image urls in the parsed html to point to the local copies (Beautiful Soup can do this), save the modified HTML back to the disk, and use MHTifier to generate the MHT.
Perhaps Scrapy could help you, too.
Hi I was able to convert html page from web page and local html to .mht using win32com.
You can have a look at this
https://stackoverflow.com/a/59321911/5290876.
You can share sample xml with xsl with images for testing.

Extract embedded script from web page

I have a link i want to scrape the content from that looks like this:
https://www.whatever.com/getDescModuleAjax.htm?productId=32663684002&t=1478698394335
But when i want to open it with selenium it won't work. When i load it in a normal Browser it opens as plain Text with the Html in a bracket like this:
window.productDescription='<div style="clea....
#I want this
....n.jpg" width="950"/></p></div>'";
I was thinking i will Download the source code as plain text and extract the content i need using Bs4. But this can't be the best solution. is there a way to ignore the tags and load the web page normally using selenium and python?
If all the source code is inside of JS variable:
window.variable="<div>...</div>" then you probably can't use bs4 to resolve it since bs4 works for pure html DOM nodes.
Is there a way to ignore the tags and load the web page normally using selenium and python
Most likely Selenium should be able to force on-page JS to get executed and load variable content into page's DOM. Try to search where window.productDescription or productDescription expression is applied/used (in which onloaded .js files)?

Categories

Resources