Parsing IMAP Email BODYSTRUCTURE for Attachment Names

Parsing IMAP Email BODYSTRUCTURE for Attachment Names - python

I wrote a Python script to access, manage and filter my emails via IMAP (using Python's imaplib).
To get the list of attachment for an email (without first downloading the entire email), I fetched the bodystructure of the email using the UID of the email, i.e.:
imap4.uid('FETCH', emailUID, '(BODYSTRUCTURE)')
and retrieve the attachment names from there.
Normally, the "portion" containing the attachment name would look like:
("attachment" ("filename" "This is the first attachment.zip"))
But on a couple of occasions, I encountered something like:
("attachment" ("filename" {34}', 'This is the second attachment.docx'))
I read somewhere that sometimes, instead of representing strings wrapped in double quotes, IMAP would use curly brackets with string length followed by the actual string (without quotes).
e.g.
{16}This is a string
But the string above doesn't seem to strictly adhere to that (there's a single-quote, a comma, and a space after the closing curly bracket, and the string itself is wrapped in single-quotes).
When I downloaded the entire email, the header for the message part containing that attachment seemed normal:
Content-Type: application/docx
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="This is the second attachment.docx"
How can I interpret (erm... parse) that "abnormal" body structure, making sense of the extra single-quotes, comma, etc...
And is that "standard"?

What you're looking at is a mangled literal, perhaps damaged by cut and waste? A literal looks like
{5}
Hello
That is, the length, then a CRLF, then that many bytes (not characters):
{4}
🐮

Looks like IMAP-Tools, a GitHub project, includes a bodystructure parser.

Related

Parsing Email Headers Tabs

I am parsing E-Mails with the Python email module.
If I parse it with the Python E-Mail parser, it does not remove the tab in front of the header items:
from email.parser import Parser
from email.policy import default
testmail = """Date: Wed, 26 Jan 2022 10:45:29 +0100
Message-ID:
<123123123123123123123123123123123123123.testinst.themultiverse.com>
Subject:
=?iso-8859-1?Q?Auftragsbest=E4tigung_blablabla?=
=?iso-8859-1?Q?_one nice thing?=
Content Body Whatnot"""
message = Parser(policy=default).parsestr(testmail)
print(repr(message["Message-Id"]))
print(repr(message["Subject"]))
results in:
'\t<123123123123123123123123123123123123123.testinst.themultiverse.com>'
'\tAuftragsbestätigung blablabla one nice thing'
I have tried the different policies of the email parser, but I do not manage to remove the tab in the beginning. I saw the header_source_parse method of the EmailPolicy class does strip the whitespace, but only in combination with a space in the beginning.
<pythonlib>/email/policy.py:
[...]
value = value.lstrip(' \t') + ''.join(sourcelines[1:])
[...]
Not sure if that is intended behavior or a bug.
My question now: Is there a way in the standard library to do this, or do I need to write a custom policy? The E-Mails are unchanged from an IMAP Server (exchange) and it feels strange that the standard tools do not cover this.

Something let me think that the message is not strictly conformant to RFC5322.
We can see at 3.2.2. Folding White Space and Comments:
However, where CFWS occurs in this specification, it MUST NOT
be inserted in such a way that any line of a folded header field is
made up entirely of WSP characters and nothing else.
But for the Subject and Message-ID fields, the first line will only contain spaces before the first newline.
IIUC, it correspond to an obsolete syntax, because we find at 4. Obsolete Syntax:
Another key difference between the obsolete and the current syntax is
that the rule in section 3.2.2 regarding lines composed entirely of
white space in comments and folding white space does not apply.
The doc for EmailPolicy from Python Standard Library is even more explicit on what happens:
header_source_parse(sourcelines)
The name is parsed as everything up to the ‘:’ and returned unmodified. The value is determined by stripping leading whitespace off the remainder of the first line, joining all subsequent lines together, and stripping any trailing carriage return or linefeed characters.
As the tab occurs on the second line, it is not stripped.
I am unsure whether this interpretation is correct, but a possible workaround is to specialize a subclass or EmailPolicy to strip that initial line:
class ObsoletePolicy(email.policy.EmailPolicy):
def header_source_parse(self, sourcelines):
header, value = super().header_source_parse(sourcelines)
value = value.lstrip(' \t\r\n')
return header, value
If you use:
message = Parser(policy=ObsoletePolicy()).parsestr(testmail)
you will now get for print(repr(message['Subject'])):
'Auftragsbestätigung blablabla one nice thing'

Parsing url string from '+' to '%2B'

I have url address where its extension needs to be in ASCII/UTF-8
a='sAE3DSRAfv+HG='
i need to convert above as this:
a='sAE3DSRAfv%2BHG%3D'
I searched but not able to get it.

Please see built-in method urllib.parse.quote()
A very important task for the URL is its safe transmission. Its meaning must not change after you created it till it is received by the intended receiver. To achieve that end URL encoding was incorporated. See RFC 2396
URL might contain non-ascii characters like cafés, López etc. Or it might contain symbols which have different meaning when put in the context of a URL. For example, # which signifies a bookmark. To ensure safe transmitting of such characters HTTP standards maintain that you quote the url at the point of origin. And URL is always present in quoted format to anyone else.
I have put sample usage below.
>>> import urllib.parse
>>> a='sAE3DSRAfv+HG='
>>> urllib.parse.quote(a)
'sAE3DSRAfv%2BHG%3D'
>>>

Why is my script is not consistently detecting contents in email bodies?

I've setup a sieve filter which invokes a Python script when it detects a postal service email about package deliveries. The sieve filter works fine and invokes the Python script reliably. However, the Python script does not reliably do its work. Here is my Python script, reduced to the relevant parts:
#!/usr/bin/env python3
import sys
from email import message_from_file
from email import policy
import subprocess
msg = message_from_file(sys.stdin, policy=policy.default)
if " out for delivery " in str(msg.get_body(("html"))):
print("It is out for delivery")
I get email messages that have the string " out for delivery " in the body of the message but the script does not print out "It is out for delivery". I've already checked the HTML in the messages to make sure it is consistent and it is 100% consistent. The frustrating thing though is that if I save the message from my mail reader that should have triggered the script, and I feed it to sieve-test manually, then the script works 100% of the time!
How come my script never works during actual mail delivery but always works whenever I test it with sieve-test?
Notes:
The email contains only a single part, which is HTML, so I have to use the HTML part.
I know I can do a body test in sieve. I'm doing it in Python for reasons outside the scope of this question.

The problem is that you use str(msg.get_body(("html"))), which is unreliable for your purpose. What you get is the body of the message as a string, but it is encoded for inclusion inside an email message. You're dealing with MIME part, which may be encoded with quoted-printable, in which case the string you test for (" out for delivery ") could be split across multiple lines when encoded. The string against which you test could have the text you are looking for encoded like this:
[other text] out for=
delivery [more text]
The = sign is part of the encoding and indicates that the newline that follows is there because of the encoding rather than because it was there prior to encoding.
Ok, but why does it always work when you use sieve-test? What happens is that your mail reader encodes the message differently, and the way it encodes it, the text you are looking for is not split across lines, and your script works! It is perfectly correct for the mail reader to save the message with a different encoding so long as once the email is decoded its content has not changed.
What you should do is use msg.get_body(("html")).get_content(). This gets the body in decoded form exactly byte-for-byte the same as when the postal service composed the email.

Python Emailing - Use of colon causes no output

Really odd experience with this which took me over an hour to figure out. I have a cgi script written in python which takes form data and forwards it to an email, the problem was, if there was a colon in the string it would cause python to output nothing to the email.
Does anyone know why this is?
For example:
output = "Print- Customer"
works, though:
output = "Print: Customer"
prints no output.
My email function works essentially:
server.sendmail(fromaddr, toaddrs, msg)
where msg = output
Just wondering if the colon is a special character in python string output

Colon isn't a special character in python string output, but it is special to email headers. Try inserting a blank line in the output:
output = "\nPrint: Customer"

Allow me to make a few guesses:
The mail is actually being sent, but the body appears to be empty (You question doesn't say this).
You're not using the builtin python mailing library.
If you open the mail in your mail reader, and look at the headers, the "print:" line will be present.
If so, the problem is that you're not ending the mail headers with a "\r\n" pair, and the mail reader thinks that "print:" is a mail header, while "print -" is part of the body of a mal-formed email.
If you add the "\r\n" after your headers, everything should be fine.

How to distinguish sender-generated carriage returns from word wrap auto-generated carriage returns in email body?

How can I distinguish sender-generated carriage returns from word wrap auto-generated carriage returns in an email body? I'm using Python imaplib to access Gmail and download message bodies like so:
user='whoever#gmail.com'
pwd='password'
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
m.select("INBOX")
resp, items = m.search(None, "ALL")
items = items[0].split()
messages = []
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)")
email_body = data[0][1]
mail = email.message_from_string(email_body)
for part in mail.walk():
if part.get_content_type() == 'text/plain':
body = part.get_payload(decode=1)
messages.append(body)
I'm focusing on the case of messages received from another Gmail user. The message body text has a number of carriage returns ('\r\n') in it. These fall into two classes: 1) those inserted by the sender of the email, the "true" returns, 2) those created by Gmail word wrapping at ~78 characters, the "false" returns. I want to remove the second class of carriage returns only. I'm sure I could come up with a programmatic approximation that searches for the '\r\n' at a window around every 78th character but that wouldn't be bulletproof and isn't what I want. Interestingly, I notice that when the message displays in Gmail in the web browser, there are not returns for the second class of carriage returns. Gmail somehow knows to remove/not display these specifically. How? Is there some special encoding I'm missing?

Gmail sends messages in both the MIME multipart format, in both a text/plain version (what you are grabbing) and a text/html version. The latter version is what contains fancy formatting like bold, italic, links, etc., and is what Gmail displays. While the text/html version is also line-broken at 78 characters (a part of the e-mail standard -- the underlying text must never have a line exceeding 78 characters), the "real" line breaks that you are looking for are embedded therein as HTML <br> tags. You can see this yourself if you send yourself a message and then, using the little down-arrow next to the Reply button, click "Show original".
You cannot distinguish between "fake" and "real" line-breaks in the text/plain version of the message, at least not reliably (as you obviously know). You can, however, pull the text/html version instead, knowing then that the "real" line-breaks are the <br> tags, however you then have to deal with the additional HTML (as well as first correctly processing the "Content-Transfer-Encoding" used therein).

I don't know how many email clients interpret or generate this correctly, but RFC 3676 includes the following:
When creating flowed text, the generating agent wraps, that is,
inserts 'soft' line breaks as needed. Soft line breaks are added at
natural wrapping points, such as between words. A soft line break is
a SP CRLF sequence.
So if the previous line has a space at the end of it, the current line should be interpreted as a continuation of the previous line. I suggest reviewing the entire RFC.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing IMAP Email BODYSTRUCTURE for Attachment Names - python

What you're looking at is a mangled literal, perhaps damaged by cut and waste? A literal looks like {5} Hello That is, the length, then a CRLF, then that many bytes (not characters): {4} 🐮

Looks like IMAP-Tools, a GitHub project, includes a bodystructure parser.

Related

Parsing Email Headers Tabs

Parsing url string from '+' to '%2B'

Why is my script is not consistently detecting contents in email bodies?

Python Emailing - Use of colon causes no output

How to distinguish sender-generated carriage returns from word wrap auto-generated carriage returns in email body?

Categories

Resources