Trying to convert html to text in python?

Trying to convert html to text in python? - python

I'm writing an email application in python. Currently when I try and display any emails using html it just displays the html text. Is there a simple way to convert an email string to just plain text to be viewed?
The relevant part of my code:
rsp, data = self.s.uid('fetch', msg_id, '(BODY.PEEK[HEADER])')
raw_header = data[0][1].decode('utf-8')
rsp, data = self.s.uid('fetch', msg_id, '(BODY.PEEK[TEXT])')
raw_body = data[0][1].decode('utf-8')
header_ = email.message_from_string(raw_header)
body_ = email.message_from_string(raw_body)
self.message_box.insert(END, header_)
self.message_box.insert(END, body_)
Where the message box is just a tkinter text widget to display the email
Thanks

Most emails contain both an html version and a plain/text version. For those emails you can just take the plain/text bit. For emails that only have an html version you have to use an html parser like BeautifulSoup to get the text.
Something like this:
message = email.message_from_string(raw_body)
plain_text_body = ''
if message.is_multipart():
for part in message.walk():
if part.get_content_type() == "text/plain":
plain_text_body = part.get_payload(decode=True)
break
if plain_text_body == '':
plain_text_body = BeautifulSoup(message.as_string()).get_text()
Note: I have not actually tested my code, so it probably won't work as is.

Related

How to add multiple embed images in a body email using exchangelib

I have a message in HTML format. I need to reply to this email, keeping all the pictures from the previous email.
I have all the pictures from the email saved, and now I need to add them back. How to do it?
exchabgelib has instructions for this:
from exchangelib import HTMLBody
message = Message()
logo_filename = 'logo.png'
with open(logo_filename, 'rb') as f:
my_logo = FileAttachment(
name=logo_filename, content=f.read(), is_inline=True,
content_id=logo_filename
)
message.attach(my_logo)
# Most email systems
message.body = HTMLBody(
'<html><body>Hello logo: <img src="cid:%s"></body></html>' % logo_filename
)

If the new email must also be in HTML format, then attach the images one by one and add an <img> tag for each image you have attached. Otherwise, just attach the images and compose the email in plain text.

How convert dynamic html page to pdf in django?

1) I parse some pages to get information.
2) As it information hard to detach, i install it to html page and make it beautiful with custom css.
3) Then i try to convert it to pdf to provide it to customers.
But all pdf-convectors ask for certain url, or file and so on. For example:
def parse(request):
done = csrf(request)
if request.POST:
USERNAME = request.POST.get('logins', '')
PASSWORD = request.POST.get('password', '')
dialogue_url = request.POST.get('links', '')
total_pages = int(request.POST.get('numbers', ''))
news = []
news.extend(parse_one(USERNAME, PASSWORD, dialogue_url, total_pages))
contex = {
"news" : news,
}
done.update(contex)
pageclan = render(request, 'marketing/parser.html', done)
# create an API client instance
client = pdfcrowd.Client(*** ***)
# convert a web page and store the generated PDF to a variable. That is doesn't work. Convertor doesn't support such url.
pdf = client.convertURI('pageclan')
# set HTTP response headers
response = HttpResponse(content_type="application/pdf")
response["Cache-Control"] = "max-age=0"
response["Accept-Ranges"] = "none"
response["Content-Disposition"] = "attachment; filename=jivo_log.pdf"
# send the generated PDF
response.write(pdf)
return response
Is there any tools, that can work fine?

From PDFCrowd Python API documentation:
You can also convert raw HTML code, just use the convertHtml() method instead of convertURI():
pdf = client.convertHtml("<head></head><body>My HTML Layout</body>")
which means that you can modify your code to use the convertHtml method with your rendered page (which is an HTML string):
pdf = client.convertHtml(pageclan.content)

HTML formatting issues with python smtplib and Outlook 2010

I am generating html files using elementtree.ElementTree.dump on an Element. The files look ok in all browsers, and the underlying code within the files looks fine (no unclosed brackets or anything).
When I send an email to Outlook 2010 via smtplib, I am seeing weird formatting issues. These issues will be 100% repeatable, so the issue is logical. Here is an example:
<table b="" order="1">
That is from the source code of a HTML email I sent myself. It is correctly written as:
<table border="1">
within the original source code.
If in Outlook I write a HTML email using the original HTML as source, it correctly formats. (New email-attach html file->insert as text)
Is the issue going to be Outlook or Python? The function I used for reading the html file and sending is below.
def email_Report(mailOptions):
reportName = time.strftime("%Y%m%d.%H%M") + ".html"
ElementTree(mailOptions['report']).write("/home/%s/%s" %(mailOptions['username'],reportName))
#Set sender and receiver to the user building the report.
mailaddr = '%s#acme.com' %(mailOptions['username'])
#Access the report file. Added binary in case we ever use code on Windows
filename = "/home/%s/%s" % (mailOptions['username'], reportName)
open_file = open(filename, 'rb')
emsg = MIMEText(open_file.read(), 'html')
open_file.close()
emsg['Subject'] = "Report for %s generated by %s %s" % (mailOptions['zone'], mailOptions['username'], time.strftime("%d%m%Y-%H%M"))
emsg['To'] = mailaddr
emsg['From'] = mailaddr
#Hostname can be a parameter to SMTP method if localhost isn't listening
sc = smtplib.SMTP()
sc.connect()
sc.sendmail(mailaddr, mailaddr, emsg.as_string())
sc.close()
return
The HTML is extremely simple. No CSS, no title or head tags etc. Just html->body->table->tr->th->(newrow)->td->td etc. Could I have overlooked something like encoding/escaping? Do I have to use mime multipart? I am using Python 2.4.3 and can't use any module that didn't come stock.

Are you sure you're not running into the 990 character limit for mail servers as per
workaround for the 990 character limitation for email mailservers

unnecessary exclamation marks(!)'s in HTML code

I am emailing the content of a text file "gerrit.txt" # http://pastie.org/8289257 in outlook using the below code,
however after the email is sent when I look at the source code( #http://pastie.org/8289379) of the email in outlook ,i see unnecessary
exclamation markds(!)'s in the code which is messing up the output, can anyone provide inputs on why is it so and how to avoid this ?
from email.mime.text import MIMEText
from smtplib import SMTP
def email (body,subject):
msg = MIMEText("%s" % body, 'html')
msg['Content-Type'] = "text/html; charset=UTF8"
msg['Subject'] = subject
s = SMTP('localhost',25)
s.sendmail('userid#company.com', ['userid2#company.com'],msg=msg.as_string())
def main ():
# open gerrit.txt and read the content into body
with open('gerrit.txt', 'r') as f:
body = f.read()
subject = "test email"
email(body,subject)
print "Done"
if __name__ == '__main__':
main()

Some info available here: http://bugs.python.org/issue6327
Note that mailservers have a 990-character limit on each line
contained within an email message. If an email message is sent that
contains lines longer than 990-characters, those lines will be
subdivided by additional line ending characters, which can cause
corruption in the email message, particularly for HTML content. To
prevent this from occurring, add your own line-ending characters at
appropriate locations within the email message to ensure that no lines
are longer than 990 characters.
I think you must split your html to some lines. You can use textwrap.wrap method.

adding a '\n' in between my html string , some random 20 characters before "!" was appearing solved my problem

I also faced the same issue, Its because outlook doesn't support line more than 990 characters it starts giving below issues.
Nested tables
Color change of column heading
Adding unwanted ! marks .
Here is solution for the same.
if you are adding for single line you can add
"line[:40]" + \r\n + "line[40:]".
If you are forming a table then you can put the same in loop like
"<td>" + line[j][:40]+"\r\n"+line[j][40:] + "</td>"

In my case the html is being constructed outside of the python script and is passed in as an argument. I added line breaks after each html tag within the python script which resolved my issue:
import re
result_html = re.sub(">", ">\n", html_body)

How can I get the current URL or the URL clicked on and save it as a string in python?

How can I get the current URL and save it as a string in python?
I have some code that uses encodedURL = urllib.quote_plus to change the URL in a for loop going through a list. I cannot save encodedURL as a new variable because it's in a for loop and will always return the last item in the list.
My end goal is that I want to get the URL of a hyperlink that the user clicks on, so I can display certain content on that specific URL.
Apologies if I have left out important information. There is too much code and too many modules to post it all here. If you need anything else please let me know.
EDIT: To add more description:
I have a page which has a list of user comments about a website. The website is hyperlinked to that actual website, and there is a "list all comments about this website" link. My goal is that when the user clicks on list all comments about this website, it will open another page showing every comment that is about that website. The problem is I cannot get the website they are referring to when clicking 'all comments about this website'
Don't know if it helps but this is what I am using:
z=[ ]
for x in S:
y = list(x)
z.append(y)
for coms in z:
url = urllib.quote_plus(coms[2])
coms[2] = "'Commented on:' <a href='%s'> %s</a> (<a href = 'conversation?page=%s'> all </a>) " % (coms[2],coms[2], url)
coms[3] += "<br><br>"
deCodedURL = urllib.unquote_plus(url)
text2 = interface.list_comments_page(db, **THIS IS THE PROBLEM**)
page_comments = {
'comments_page':'<p>%s</p>' % text2,
}
if environ['PATH_INFO'] == '/conversation':
headers = [('content-type' , 'text/html')]
start_response("200 OK", headers)
return templating.generate_page(page_comments)

So your problem is you need to parse the URL for the query string, and urllib has some helpers for that:
>>> i
'conversation?page=http://www.google.com/'
>>> urllib.splitvalue(urllib.splitquery(i)[1])
('page', 'http://www.google.com/')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to convert html to text in python? - python

Related

How to add multiple embed images in a body email using exchangelib

How convert dynamic html page to pdf in django?

HTML formatting issues with python smtplib and Outlook 2010

unnecessary exclamation marks(!)'s in HTML code

How can I get the current URL or the URL clicked on and save it as a string in python?

Categories

Resources