PDF file scraping to extract html source content

PDF file scraping to extract html source content - python

Goal : I am trying to convert a PDF file(located both locally/on web) to HTML.
Using selenium firefox webdriver to open the pdf file and extract html content from it using its xpath.
Problem is when the file gets larger the process time increases significantly is there a way to make it faster?
I have tried threading(python), but i have to open a webdriver instance for each page in a pdf file, which takes loading time for each page to open.
If anything is not clear, iam happy to explain in more detail.
Thanks in advance for any help.

Related

Render HTML string with selenium python

Is there any way to render an HTML string with selenium?
The system does not allow writing anything on disk so saving content in a file and opening it in the browser will not work.
"data:text/html;base64" load HTML is also not working because the content size is large.
Can I open io. Bytes like objects with selenium?
Any appropriate way to open html content by executing js with selenium.
Thanks,

Extracting entire webpage source code using Selenium

I'm new to Selenium and need help with a task. I want to somehow download the source code of a webpage, change the image src's and then take a screenshot of the resulting webpage. I need to do this using a mobile emulator, hence I am using selenium. The screenshots have to be reflective of the mobile emulator (the screenshots need to be as if you opened the webpage on a mobile device).
I know how to open a local html file using selenium as well as take a screenshot through selenium. I also already have the images locally stored in my working directory.
However, the page source code that I get from selenium doesn't actually look like if we open that webpage on a mobile broswer. I used the following code to get the source html:
html = driver.page_source
However, if I save this html code and reload it using selenium and then take a screenshot, it looks nothing like the original page (most of the time). The dimensions are ok (that of a mobile browser) but the elements are all missing, even if I haven't replaced the image sources yet. Is there a way to get visually similar code or another way to change the image sources?

Python Save XML Webpage to .mht

I have a single diagnostic webpage on a device with charts that is in XML format made up of an xsl and gif files. Is there a way with Python to download the entire page and save it as a single .mht file rather than separate files?

This is essentially a combination of those two problems:
How to save "complete webpage" not just basic html using Python
https://stackoverflow.com/a/44815028/679240
AFAIK, you could download the page with urllib, parse the HTML with Beautiful Soup, find the images and other dependencies in the parsed HTML, download those, rewrite the image urls in the parsed html to point to the local copies (Beautiful Soup can do this), save the modified HTML back to the disk, and use MHTifier to generate the MHT.
Perhaps Scrapy could help you, too.

Hi I was able to convert html page from web page and local html to .mht using win32com.
You can have a look at this
https://stackoverflow.com/a/59321911/5290876.
You can share sample xml with xsl with images for testing.

Extract embedded script from web page

I have a link i want to scrape the content from that looks like this:
https://www.whatever.com/getDescModuleAjax.htm?productId=32663684002&t=1478698394335
But when i want to open it with selenium it won't work. When i load it in a normal Browser it opens as plain Text with the Html in a bracket like this:
window.productDescription='<div style="clea....
#I want this
....n.jpg" width="950"/></p></div>'";
I was thinking i will Download the source code as plain text and extract the content i need using Bs4. But this can't be the best solution. is there a way to ignore the tags and load the web page normally using selenium and python?

If all the source code is inside of JS variable:
window.variable="<div>...</div>" then you probably can't use bs4 to resolve it since bs4 works for pure html DOM nodes.
Is there a way to ignore the tags and load the web page normally using selenium and python
Most likely Selenium should be able to force on-page JS to get executed and load variable content into page's DOM. Try to search where window.productDescription or productDescription expression is applied/used (in which onloaded .js files)?

Downloading URL in Python without urllib

I have a problem in downloadig URL.
I need to download webpage with the table. When I get .html file with the help of urllib or urllib2, it has some problems connected with javascript (or same languages). There's only source code such as id_name e.t.c, but it don't have any table information (columns and rows).
Nevertheless, when I save .html in Google Chrome, it actually has information in table (not source code, but columns and rows). So what should I do to make it in Python?

You can use selenium to simulate browser. It will execute javascript then you can get the information you want

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PDF file scraping to extract html source content - python

Related

Render HTML string with selenium python

Extracting entire webpage source code using Selenium

Python Save XML Webpage to .mht

Extract embedded script from web page

Downloading URL in Python without urllib

Categories

Resources