How convert dynamic html page to pdf in django? - python

1) I parse some pages to get information.
2) As it information hard to detach, i install it to html page and make it beautiful with custom css.
3) Then i try to convert it to pdf to provide it to customers.
But all pdf-convectors ask for certain url, or file and so on. For example:
def parse(request):
done = csrf(request)
if request.POST:
USERNAME = request.POST.get('logins', '')
PASSWORD = request.POST.get('password', '')
dialogue_url = request.POST.get('links', '')
total_pages = int(request.POST.get('numbers', ''))
news = []
news.extend(parse_one(USERNAME, PASSWORD, dialogue_url, total_pages))
contex = {
"news" : news,
}
done.update(contex)
pageclan = render(request, 'marketing/parser.html', done)
# create an API client instance
client = pdfcrowd.Client(*** ***)
# convert a web page and store the generated PDF to a variable. That is doesn't work. Convertor doesn't support such url.
pdf = client.convertURI('pageclan')
# set HTTP response headers
response = HttpResponse(content_type="application/pdf")
response["Cache-Control"] = "max-age=0"
response["Accept-Ranges"] = "none"
response["Content-Disposition"] = "attachment; filename=jivo_log.pdf"
# send the generated PDF
response.write(pdf)
return response
Is there any tools, that can work fine?

From PDFCrowd Python API documentation:
You can also convert raw HTML code, just use the convertHtml() method instead of convertURI():
pdf = client.convertHtml("<head></head><body>My HTML Layout</body>")
which means that you can modify your code to use the convertHtml method with your rendered page (which is an HTML string):
pdf = client.convertHtml(pageclan.content)

Related

Trying to convert html to text in python?

I'm writing an email application in python. Currently when I try and display any emails using html it just displays the html text. Is there a simple way to convert an email string to just plain text to be viewed?
The relevant part of my code:
rsp, data = self.s.uid('fetch', msg_id, '(BODY.PEEK[HEADER])')
raw_header = data[0][1].decode('utf-8')
rsp, data = self.s.uid('fetch', msg_id, '(BODY.PEEK[TEXT])')
raw_body = data[0][1].decode('utf-8')
header_ = email.message_from_string(raw_header)
body_ = email.message_from_string(raw_body)
self.message_box.insert(END, header_)
self.message_box.insert(END, body_)
Where the message box is just a tkinter text widget to display the email
Thanks
Most emails contain both an html version and a plain/text version. For those emails you can just take the plain/text bit. For emails that only have an html version you have to use an html parser like BeautifulSoup to get the text.
Something like this:
message = email.message_from_string(raw_body)
plain_text_body = ''
if message.is_multipart():
for part in message.walk():
if part.get_content_type() == "text/plain":
plain_text_body = part.get_payload(decode=True)
break
if plain_text_body == '':
plain_text_body = BeautifulSoup(message.as_string()).get_text()
Note: I have not actually tested my code, so it probably won't work as is.

using xpath to parse images from the web

I'm written a bit of code in an attempt to pull photos from a website. I want it to find photos, then download them for use to tweet:
import urllib2
from lxml.html import fromstring
import sys
import time
url = "http://www.phillyhistory.org/PhotoArchive/Search.aspx"
response = urllib2.urlopen(url)
html = response.read()
dom = fromstring(html)
sels = dom.xpath('//*[(#id = "large_media")]')
for pic in sels[:1]:
output = open("file01.jpg","w")
output.write(pic.read())
output.close()
#twapi = tweepy.API(auth)
#twapi.update_with_media(imagefilename, status=xxx)
I'm new at this sort of thing, so I'm not really sure why this isn't working. No file is created, and no 'sels' are being created.
Your problem is that the image search (Search.aspx) doesn't just return a HTML page with all the content in it, but instead delivers a JavaScript application that then makes several subsequent requests (see AJAX) to fetch raw information about assets, and then builds a HTML page dynamically that contains all those search results.
You can observe this behavior by looking at the HTTP requests your browser makes when you load the page. Use the Firebug extension for Firefox or the builtin Chrome developer tools and open the Network tab. Look for requests that happen after the initial page load, particularly POST requests.
In this case the interesting requests are the ones to Thumbnails.ashx, Details.ashx and finally MediaStream.ashx. Once you identify those requests, look at what headers and form data your browser sends, and emulate that behavior with plain HTTP requests from Python.
The response from Thumbnails.ashx is actually JSON, so it's much easier to parse than HTML.
In this example I use the requests module because it's much, much better and easier to use than urllib(2). If you don't have it, install it with pip install requests.
Try this:
import requests
import urllib
BASE_URL = 'http://www.phillyhistory.org/PhotoArchive/'
QUERY_URL = BASE_URL + 'Thumbnails.ashx'
DETAILS_URL = BASE_URL + 'Details.ashx'
def get_media_url(asset_id):
response = requests.post(DETAILS_URL, data={'assetId': asset_id})
image_details = response.json()
media_id = image_details['assets'][0]['medialist'][0]['mediaId']
return '{}/MediaStream.ashx?mediaId={}'.format(BASE_URL, media_id)
def save_image(asset_id):
filename = '{}.jpg'.format(asset_id)
url = get_media_url(asset_id)
with open(filename, 'wb') as f:
response = requests.get(url)
f.write(response.content)
return filename
urlqs = {
'maxx': '-8321310.550067',
'maxy': '4912533.794965',
'minx': '-8413034.983992',
'miny': '4805521.955385',
'onlyWithoutLoc': 'false',
'sortOrderM': 'DISTANCE',
'sortOrderP': 'DISTANCE',
'type': 'area',
'updateDays': '0',
'withoutLoc': 'false',
'withoutMedia': 'false'
}
data = {
'start': 0,
'limit': 12,
'noStore': 'false',
'request': 'Images',
'urlqs': urllib.urlencode(urlqs)
}
response = requests.post(QUERY_URL, data=data)
result = response.json()
print '{} images found'.format(result['totalImages'])
for image in result['images']:
asset_id = image['assetId']
print 'Name: {}'.format(image['name'])
print 'Asset ID: {}'.format(asset_id)
filename = save_image(asset_id)
print "Saved image to '{}'.\n".format(filename)
Note: I didn't check what http://www.phillyhistory.org/'s Terms of Service have to say about automated crawling. You need to check yourself and make sure you're not in violation of their ToS with whatever you're doing.

Python 2&3: both urllib & requests POST data mysteriously disappears

I'm using Python to scrape data from a number of web pages that have simple HTML input forms, like the 'Username:' form at the bottom of this page:
http://www.w3schools.com/html/html_forms.asp (this is just a simple example to illustrate the problem)
Firefox Inspect Element indicates this form field has the following HTML structure:
<form name="input0" target="_blank" action="html_form_action.asp" method="get">
Username:
<input name="user" size="20" type="text"></input>
<input value="Submit" type="submit"></input>
</form>
All I want to do is fill out this form and get the resulting page:
http://www.w3schools.com/html/html_form_action.asp?user=ThisIsMyUserName
Which is what is produced in my browser by entering 'ThisIsMyUserName' in the 'Username' field and pressing 'Submit'. However, every method that I have tried (details below) returns the contents of the original page containing the unaltered form without any indication the form data I submitted was recognized, i.e. I get the content from the first link above in response to my request, when I expected to receive the content of the second link.
I suspect the problem has to do with action="html_form_action.asp" in the form above, or perhaps some kind of hidden field I'm missing (I don't know what to look for - I'm new to form submission). Any suggestions?
HERE IS WHAT I'VE TRIED SO FAR:
Using urllib.requests in Python 3:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Encode dict
example_data = example_data.encode('utf-8')
# Create request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
request = urllib.request.Request(example_url, data=example_data)
# Create opener and install
my_url_opener = urllib.request.build_opener() # no handlers
urllib.request.install_opener(my_url_opener)
# Open the page and read content
web_page = urllib.request.urlopen(request)
content = web_page.read()
# Save content to file
my_html_file = open('my_html_file.html', 'wb')
my_html_file.write(content)
But what is returned to me and saved in 'my_html_file.html' is the original page containing
the unaltered form without any indication that my form data was recognized, i.e. I get this page in response: qqqhttp://www.w3schools.com/html/html_forms.asp
...which is the same thing I would have expected if I made this request without the
data parameter at all (which would change the request from a POST to a GET).
Naturally the first thing I did was check whether my request was being constructed properly:
# Just double-checking the request is set up correctly
print("GET or POST?", request.get_method())
print("DATA:", request.data)
print("HEADERS:", request.header_items())
Which produces the following output:
GET or POST? POST
DATA: b'user=ThisIsMyUserName'
HEADERS: [('Content-length', '21'), ('Content-type', 'application/x-www-form-urlencoded'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'www.w3schools.com')]
So it appears the POST request has been structured correctly. After re-reading the
documentation and unsuccessfuly searching the web for an answer to this problem, I
moved on to a different tool: the requests module. I attempted to perform the same task:
import requests
example_url = 'http://www.w3schools.com/html/html_forms.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.post(example_url, params=data_to_send)
contents = response.content
And I get the same exact result. At this point I'm thinking maybe this is a Python 3
issue. So I fire up my trusty Python 2.7 and try the following:
import urllib, urllib2
data = urllib.urlencode({'user' : 'ThisIsMyUserName'})
resp = urllib2.urlopen('http://www.w3schools.com/html/html_forms.asp', data)
content = resp.read()
And I get the same result again! For thoroughness I figured I'd attempt to achieve the
same result by encoding the dictionary values into the url and attempting a GET request:
# Using Python 3
# Construct the url for the GET request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
form_values = {'user': 'ThisIsMyUserName'}
example_data = urllib.parse.urlencode(form_values)
final_url = example_url + '?' + example_data
print(final_url)
This spits out the following value for final_url:
qqqhttp://www.w3schools.com/html/html_forms.asp?user=ThisIsMyUserName
I plug this into my browser and I see that this page is exactly the same as
the original page, which is exactly what my program is downloading.
I've also tried adding additional headers and cookie support to no avail.
I've tried everything I can think of. Any idea what could be going wrong?
The form states an action and a method; you are ignoring both. The method states the form uses GET, not POST, and the action tells you to send the form data to html_form_action.asp.
The action attribute acts like any other URL specifier in an HTML page; unless it starts with a scheme (so with http://..., https://..., etc.) it is relative to the current base URL of the page.
The GET HTTP method adds the URL-encoded form parameters to the target URL with a question mark:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Create request
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
get_url = example_url + '?' + example_data
# Open the page and read content
web_page = urllib.request.urlopen(get_url)
print(web_page.read().decode(web_page.info().get_param('charset', 'utf8')))
or, using requests:
import requests
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.get(example_url, params=data_to_send)
contents = response.text
print(contents)
In both examples I also decoded the response to Unicode text (something requests makes easier for me with the response.text attribute).

How can I get the current URL or the URL clicked on and save it as a string in python?

How can I get the current URL and save it as a string in python?
I have some code that uses encodedURL = urllib.quote_plus to change the URL in a for loop going through a list. I cannot save encodedURL as a new variable because it's in a for loop and will always return the last item in the list.
My end goal is that I want to get the URL of a hyperlink that the user clicks on, so I can display certain content on that specific URL.
Apologies if I have left out important information. There is too much code and too many modules to post it all here. If you need anything else please let me know.
EDIT: To add more description:
I have a page which has a list of user comments about a website. The website is hyperlinked to that actual website, and there is a "list all comments about this website" link. My goal is that when the user clicks on list all comments about this website, it will open another page showing every comment that is about that website. The problem is I cannot get the website they are referring to when clicking 'all comments about this website'
Don't know if it helps but this is what I am using:
z=[ ]
for x in S:
y = list(x)
z.append(y)
for coms in z:
url = urllib.quote_plus(coms[2])
coms[2] = "'Commented on:' <a href='%s'> %s</a> (<a href = 'conversation?page=%s'> all </a>) " % (coms[2],coms[2], url)
coms[3] += "<br><br>"
deCodedURL = urllib.unquote_plus(url)
text2 = interface.list_comments_page(db, **THIS IS THE PROBLEM**)
page_comments = {
'comments_page':'<p>%s</p>' % text2,
}
if environ['PATH_INFO'] == '/conversation':
headers = [('content-type' , 'text/html')]
start_response("200 OK", headers)
return templating.generate_page(page_comments)
So your problem is you need to parse the URL for the query string, and urllib has some helpers for that:
>>> i
'conversation?page=http://www.google.com/'
>>> urllib.splitvalue(urllib.splitquery(i)[1])
('page', 'http://www.google.com/')

Converting a pdf to text/html in python so I can parse it

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:
EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):
import mechanize
import urllib2
import re
from BeautifulSoup import *
adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"
url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"
def get_pdf(soup2):
link = soup2.findAll("a", "com_acronym")
new_link = []
amendments = []
for i in link:
if "REPORT" in i["href"]:
new_link.append(i["href"])
if new_link == None:
print "No A number"
else:
for i in new_link:
page = br.open(str(i)).read()
bs = BeautifulSoup(page)
text = bs.findAll("a")
for i in text:
if re.search("PDF", str(i)) != None:
pdf_link = "http://www.europarl.europa.eu/" + i["href"]
pdf = urllib2.urlopen(pdf_link)
name_pdf = "%s_%s.pdf" % (y,p)
localfile = open(name_pdf, "w")
localfile.write(pdf.read())
localfile.close()
br.open(adobe)
br.select_form(name = "convertFrm")
br.form["srcPdfUrl"] = str(pdf_link)
br["convertTo"] = ["html"]
br["visuallyImpaired"] = ["notcompatible"]
br.form["platform"] =["Macintosh"]
pdf_html = br.submit()
soup = BeautifulSoup(pdf_html)
page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years
for y in year:
for p in page:
br = mechanize.Browser()
br.open(url)
br.select_form(name = "byReferenceForm")
br.form["year"] = str(y)
br.form["sequence"] = str(p)
response = br.submit()
soup1 = BeautifulSoup(response)
test = soup1.find(text="No search result")
if test != None:
print "%s %s No page skipping..." % (y,p)
else:
print "%s %s Writing dossier..." % (y,p)
for i in br.links(url_regex="file.jsp"):
link = i
response2 = br.follow_link(link).read()
soup2 = BeautifulSoup(response2)
get_pdf(soup2)
In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?
Thomas
Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.
To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().
If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:
balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]
It's not exactly magic. I suggest
downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.
For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

Categories

Resources