Python Save XML Webpage to .mht - python

I have a single diagnostic webpage on a device with charts that is in XML format made up of an xsl and gif files. Is there a way with Python to download the entire page and save it as a single .mht file rather than separate files?

This is essentially a combination of those two problems:
How to save "complete webpage" not just basic html using Python
https://stackoverflow.com/a/44815028/679240
AFAIK, you could download the page with urllib, parse the HTML with Beautiful Soup, find the images and other dependencies in the parsed HTML, download those, rewrite the image urls in the parsed html to point to the local copies (Beautiful Soup can do this), save the modified HTML back to the disk, and use MHTifier to generate the MHT.
Perhaps Scrapy could help you, too.

Hi I was able to convert html page from web page and local html to .mht using win32com.
You can have a look at this
https://stackoverflow.com/a/59321911/5290876.
You can share sample xml with xsl with images for testing.

Related

Why does a list appear as a comment with Python Beautiful Soup?

I am trying to scrape the addresses of Dunkin' locations using this website: https://www.dunkindonuts.com/en/locations?location=10001. However, when trying to access the list of each Dunkin' on the web page, it shows up as comment. How do I access the list? I've never done web scraping before.
Here's my current code, I'm expecting a list of Dunkin' stores which I can then extract the addresses from.
requests.get() will return the raw HTML for a web page. This is only the beginning of the journey when you view this page in the browser. Your browser will parse that HTML to create the DOM. It will load other resources, such as images and scripts from other files. Then it will execute those scripts. In the modern web, those scripts will modify the DOM to give the page that you finally see in the browser. requests alone doesn't give you all that.
One solution is to use a library that loads the HTML into a browser and does all of the magic. selenium is one such library.

How do scrape or download pdfs from a collection of shtml links?

I scraped a list of shtml links. They are now saved in a .xlsx file.
List
I already tried looking for excel macros, r code, python code, chrome extensions and desktop programs. I could not find any research that was helpful to me.
Each .shtml link leads to a web page with at least one .pdf at the center of the page that I need to download.
Any help appreciated!
The basic workflow is:
you need to use css or xpath to locate the pdf download button.
Use Rselenium to simulate the download action; or get the href attribute and use rvest to make a request to that link, and write the binary response to disk using writeBin()
To download a pdf file, I'll use a government form as the example:
pdf url: https://www.uscis.gov/sites/default/files/files/form/i-765.pdf
library(rvest)
library(httr)
session <- html_session("https://www.uscis.gov/sites/default/files/files/form/i-765.pdf")
# save pdf to test.pdf
writeBin(session$response$content,"test.pdf")
that is helpful!
install.packages("rvest")
install.packages("httr")
install.packages("readxl")
update.packages("tibble")
library(rvest)
library(httr)
library(readxl)
setwd("C:/Users/Andreas/Desktop/481064 A.F. - Master Thesis - Election Outcome Prediction/Full Repository Austrian Bundestag")
my_data <- read_excel("StenographischeProto.xlsx")
View(my_data)
session <- html_session("https://www.uscis.gov/sites/default/files/files/form/i-765.pdf")
# save pdf to test.pdf
writeBin(session$response$content,"test.pdf")

Extract embedded script from web page

I have a link i want to scrape the content from that looks like this:
https://www.whatever.com/getDescModuleAjax.htm?productId=32663684002&t=1478698394335
But when i want to open it with selenium it won't work. When i load it in a normal Browser it opens as plain Text with the Html in a bracket like this:
window.productDescription='<div style="clea....
#I want this
....n.jpg" width="950"/></p></div>'";
I was thinking i will Download the source code as plain text and extract the content i need using Bs4. But this can't be the best solution. is there a way to ignore the tags and load the web page normally using selenium and python?
If all the source code is inside of JS variable:
window.variable="<div>...</div>" then you probably can't use bs4 to resolve it since bs4 works for pure html DOM nodes.
Is there a way to ignore the tags and load the web page normally using selenium and python
Most likely Selenium should be able to force on-page JS to get executed and load variable content into page's DOM. Try to search where window.productDescription or productDescription expression is applied/used (in which onloaded .js files)?

python - pull pdfs from webpage and convert to html

My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables.
So far I have imported mechanize (for browsing the pages/finding the pdf files) and I have pdfminer, however I'm not sure how to use it in a script to perform the same functionality it does on the command line.
What is the most effective group of libraries for accomplishing my task, and how would you recommend approaching each step? I apologize if this is too specific for stackoverflow, but I'm having trouble using google searches and sparse documentation to piece together how to code this. Thanks!
EDIT:
So I've decided to go with Scrapy on this one. I'm really liking it so far, but now I have a new question. I've defined a PDFItem() class to use with my spider with fields title and url. I have a selector thats grabbing all the links I want, and I want to go through these links and create a PDFItem for each one. Here's the code I have below:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
item = PDFItem()
for link in links:
item['title'] = link.xpath('/text()')
item['url'] = URL + link.xpath('#href').extract()[0]
The url line works well, but I don't really know how to do the same for title. I guess I could just perform the query at the top, but adding '/text()' to the end of the selector, but this seems excessive. Is there a better way to just go through each link object in the links array and grab the text and href value?
I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.
After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.
Edit:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
for link in links:
item = PDFItem()
item['title'] = link.xpath('text()').extract()[0]
item['url'] = URL + link.xpath('#href').extract()[0]
In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.
The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.
So the libraries you would need are a HTTP library, like Requests and PDFMiner.
The work flow of your script will be something like:
import os
import requests
from subprocess import Popen
...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
# get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)
More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here

scrapy for table content in pdf file

I am working on web scraping for tables in pdf file using python
Can some one suggest me a good module which fetch's only required table
I have tried pypdf,pdf2html,ocr,slate but nothing works
Thanks
First, convert PDF to HTML. See Converting PDF to HTML with Python.
And then, using an HTML parsing library, parse the HTML generated from the PDF. See BeautifulSoup HTML table parsing

Categories

Resources