python - pull pdfs from webpage and convert to html - python

My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables.
So far I have imported mechanize (for browsing the pages/finding the pdf files) and I have pdfminer, however I'm not sure how to use it in a script to perform the same functionality it does on the command line.
What is the most effective group of libraries for accomplishing my task, and how would you recommend approaching each step? I apologize if this is too specific for stackoverflow, but I'm having trouble using google searches and sparse documentation to piece together how to code this. Thanks!
EDIT:
So I've decided to go with Scrapy on this one. I'm really liking it so far, but now I have a new question. I've defined a PDFItem() class to use with my spider with fields title and url. I have a selector thats grabbing all the links I want, and I want to go through these links and create a PDFItem for each one. Here's the code I have below:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
item = PDFItem()
for link in links:
item['title'] = link.xpath('/text()')
item['url'] = URL + link.xpath('#href').extract()[0]
The url line works well, but I don't really know how to do the same for title. I guess I could just perform the query at the top, but adding '/text()' to the end of the selector, but this seems excessive. Is there a better way to just go through each link object in the links array and grab the text and href value?

I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.
After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.
Edit:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
for link in links:
item = PDFItem()
item['title'] = link.xpath('text()').extract()[0]
item['url'] = URL + link.xpath('#href').extract()[0]

In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.
The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.
So the libraries you would need are a HTTP library, like Requests and PDFMiner.
The work flow of your script will be something like:
import os
import requests
from subprocess import Popen
...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
# get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)
More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here

Related

Reading a html page, replacing in a custom tag and outputting (cgi). Is this a good idea?

I am making (my first) webpage. I installed apache2, and in order to have dynamic content, am using cgi: i thought of making a python script that reads a html page, which has a custom tag and then the python script edits things in that tag in order to have dynamic content in the page. Is this a good or bad way to accomplish this?
The python script would be something like this:
page = open("page.html", "r").read()
# some code to replace things inside the custom tag <custom-tag>
print(page)

Scraping subfiles from URL using Python

A webpage I would like to scrape consists of several files:
I'm interested of scraping only the highlighted file, that is: mboxFrame.
My method of scraping pages
import requests
from bs4 import BeautifulSoup
webPage = requests.get(URL, verify=False)
soup = BeautifulSoup(webPage.content, "html.parser" )
is able to scrape only the file mail.html. Is there a way to scrape only what I want?
I would appreciate any hints or tips.
The way to open a file from a server is to request it with a URL.
In fact, in the beginnings of the world wide web this was the only way to get content: content creators would put various files on servers and clients would open or download those files. The dynamic processing of URIs and parameters is a later invention. That is why commenters are asking for the URL you use. We want to see it and modify accordingly to help you see what parts need changing in order to get that particular file. You can omit the password, or replace it with some other string of letters.
In general, the file you want would be under the url you use, but ending with the file name.
If the startong URL is www.example.com/mail/, then this file would be at www.example.com/mail/mbox.msc.
Please note that any parameters should follow the path, so www.example.com/mail?user=hendrra&password=hendras_password would turn into
www.example.com/mail/mbox.msc?user=hendrra&password=hendras_password

Python Save XML Webpage to .mht

I have a single diagnostic webpage on a device with charts that is in XML format made up of an xsl and gif files. Is there a way with Python to download the entire page and save it as a single .mht file rather than separate files?
This is essentially a combination of those two problems:
How to save "complete webpage" not just basic html using Python
https://stackoverflow.com/a/44815028/679240
AFAIK, you could download the page with urllib, parse the HTML with Beautiful Soup, find the images and other dependencies in the parsed HTML, download those, rewrite the image urls in the parsed html to point to the local copies (Beautiful Soup can do this), save the modified HTML back to the disk, and use MHTifier to generate the MHT.
Perhaps Scrapy could help you, too.
Hi I was able to convert html page from web page and local html to .mht using win32com.
You can have a look at this
https://stackoverflow.com/a/59321911/5290876.
You can share sample xml with xsl with images for testing.

Download Multiple Linked CSV files from a site

The sample site I am using is: http://stats.jenkins.io/jenkins-stats/svg/svgs.html
There are a ton of CSVs linked on this site. Now obviously I can go through each link click and download, but I know there is a better way.
I was able to put together the following Python script using BeautifulSoup but all it does is print the soup:
from bs4 import BeautifulSoup
import urllib2
jenkins = "http://stats.jenkins.io/jenkins-stats/svg/svgs.html"
page = urllib2.urlopen(jenkins)
soup = BeautifulSoup(page)
print soup
Below is a sample I get when I print the soup, but I am still missing how to actually download the multiple CSV files from this detail.
<td>
<a alt="201412-jobs.svg" class="info" data-content="<object data='201412-jobs.svg' width='200' type='image/svg+xml'/>" data-original-title="201412-jobs.svg" href="201412-jobs.svg" rel="popover">SVG</a>
<span>/</span>
<a alt="201412-jobs.csv" class="info" href="201412-jobs.csv">CSV</a>
</td>
Just use a BeatifulSoup to parse this webpage and get all the URLs of the CSV files and then download each one using urllib.request.urlretrieve().
This is a one time task, so I don`t think, that you need anything like Scrapy for it.
I totally get where youre coming from, have wanted to do the same myself, lucky if you are a linux use there is a super easy way to do what you want. On the other side, using a webscraper, im familiar with bs4 but scrapy is my life (sadly) but as far as I recall bs/4 has no real option-able way to download without to use of urlib/request but all the same !!
As to your current bs4 spider,,, First you should probably ascertain only the links that are .csv, extract clean.. I IMAGINE it would look like
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv'. '.fileformatetcetc'])
continue
This is like doing find all but limiting the response to ... well only the once with .csv or desired extension...
Then you would join the responses from that to the base url(if its incomplete). If not needed the Using csv module you would read out the csv files... (from the responses right!!?) the write it out to a new file...
For the lols Im going to create a scrapy version.
AS for that easy method... why not just use wget?
Found this... sums up on the whole csv read/write process... https://stackoverflow.com/a/21501574/3794089

How can i grab pdf links from website with Python script

Quite often i have to download the pdfs from websites but sometimes they are not on one page.
They have divided the links in pagination and I have to click on every page of get the links.
I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.
I am new to python so can anyone please give me the directions how can i do it
Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:
# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse
# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'
# fetch the page
res = urllib2.urlopen(base_url)
# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())
# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}
# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(#href, "\.pdf$", "i")]', namespaces=ns):
# print the href, joining it to the base_url
print urlparse.urljoin(base_url, node.attrib['href'])
Result:
http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...
If there is a lot of pages with links you can try excellent framework -- Scrapy(http://scrapy.org/).
It is pretty easy to understand how to use it and can download pdf files you need.
By phone, maybe it is not very readable
If you is going to gran things from website which are all static pages or other things. You can easily grab html by requests
import requests
page_content=requests.get(url)
But if you grab things like some communication website. There will be some anti-grabing ways.(how to break these noisy things will be the problem)
Frist way:make your requests more like a browser(human).
add the headers(you can use the dev tools by Chrome or Fiddle to copy the headers)
make the right post form.This one should copy the ways you post the form by browser.
get the cookies, and add it to requests
Second way. use selenium and browser driver. Selenium will use true browser driver(like me, i use chromedriver)
remeber to add chromedriver to the path
Or use code to load the driver.exe
driver=WebDriver.Chrome(path)
not sure is this set up code
driver.get(url)
It is trully surf the url by browser, so it will decrease the difficulty of grabing things
get the web page
page=driver.page_soruces
some of the website will jump several page. This will cause some error. Make your website wait for some certain element showing.
try:
certain_element=ExpectedConditions.presenceOfElementLocated(By.id,'youKnowThereIsAElement'sID)
WebDriverWait(certain_element)
or use implict wait: wait the time you like
driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS)
And you can controll the website by WebDriver. Here is not going to describe. You can search the module.

Categories

Resources