Why is my script is not consistently detecting contents in email bodies? - python

I've setup a sieve filter which invokes a Python script when it detects a postal service email about package deliveries. The sieve filter works fine and invokes the Python script reliably. However, the Python script does not reliably do its work. Here is my Python script, reduced to the relevant parts:
#!/usr/bin/env python3
import sys
from email import message_from_file
from email import policy
import subprocess
msg = message_from_file(sys.stdin, policy=policy.default)
if " out for delivery " in str(msg.get_body(("html"))):
print("It is out for delivery")
I get email messages that have the string " out for delivery " in the body of the message but the script does not print out "It is out for delivery". I've already checked the HTML in the messages to make sure it is consistent and it is 100% consistent. The frustrating thing though is that if I save the message from my mail reader that should have triggered the script, and I feed it to sieve-test manually, then the script works 100% of the time!
How come my script never works during actual mail delivery but always works whenever I test it with sieve-test?
Notes:
The email contains only a single part, which is HTML, so I have to use the HTML part.
I know I can do a body test in sieve. I'm doing it in Python for reasons outside the scope of this question.

The problem is that you use str(msg.get_body(("html"))), which is unreliable for your purpose. What you get is the body of the message as a string, but it is encoded for inclusion inside an email message. You're dealing with MIME part, which may be encoded with quoted-printable, in which case the string you test for (" out for delivery ") could be split across multiple lines when encoded. The string against which you test could have the text you are looking for encoded like this:
[other text] out for=
delivery [more text]
The = sign is part of the encoding and indicates that the newline that follows is there because of the encoding rather than because it was there prior to encoding.
Ok, but why does it always work when you use sieve-test? What happens is that your mail reader encodes the message differently, and the way it encodes it, the text you are looking for is not split across lines, and your script works! It is perfectly correct for the mail reader to save the message with a different encoding so long as once the email is decoded its content has not changed.
What you should do is use msg.get_body(("html")).get_content(). This gets the body in decoded form exactly byte-for-byte the same as when the postal service composed the email.

Related

How to test content of a Django email?

I'm new to Django and am trying to use unittest to check if there's some text in an outbound email:
class test_send_daily_email(TestCase):
def test_success(self):
self.assertIn(mail.outbox[0].body, "My email's contents")
However, I'm having an issue with mail.outbox[0].body. It will output \nMy email&#39s contents\n and won't match the test text.
I've attempted a few different fixes with no luck:
str(mail.outbox[0].body).rstrip() - returns an idential string
str(mail.outbox[0].body).decode('utf-8') - no attribute decode
Apologies, I know this must be a trivial task. In Rails I would use something like Nokogiri to parse the text. What's the right way to parse this in Django? I wasn't able to find instructions on this in the documentation.
It depends on the actual content of your mail (plain or html) but the easy way is to also encode the string you are testing against.
# if you are testing HTML content
self.assertTextInHTML("My email's contents", mail.outbox[0].body)
# the string may need escaping the same way django escapes
from django.utils.html import escape
self.assertIn(escape("My email's contents"), mail.outbox[0].body)

Error querying messages by its rfc822msgid

I'm trying to get a message by its Message-ID. The Gmail API has no get() method to pass the Message-ID in, so I have to list() first passing the q parameter as given below:
q="rfc822msgid:%s" % message_id
The response brings a list with a single message, just as hoped. Then I use the get() method to retrieve the message by its Google style identifier. This works like a charm, unless the Message-ID contains a + character:
message_id="a+b#c"
In this case, the Google Api Client requests this URL:
url="https://www.googleapis.com/gmail/v1/users/me/messages?q=rfc822msgid%3Aa+b%40c&alt=json"
I think the client is doing a quote_plus() with safe="+" to avoid the encoding of the + character. But this causes a problem in the commented cases, because the server interprets the + character as an space one, so the Message-ID is no more valid:
message_id="a b#c"
I tried to switch the + character for its quoted representation (%2B), but when the client encodes the URL, the Message-ID becomes quite worst due to the quote(quote()):
message_id="a%252Bb%40c"
So, is there a way to send the + character avoiding the server to decode it as a space character?
Thanks in advance.
EDIT: I was working on the solutions commented here with no positive result. But since a few days ago, my original code started to work. I've not changed a single line, so I think Google has fixed something related this. Thanks for the comments.
URLEncoder.encode("+", "UTF-8"); yields "%2B"
replace "+" with query parameter. ie
URLEncoder.encode("rfc822msgid:", "UTF-8");

How do I press enter with pexpect [duplicate]

I am working with pythons pexpect module to automate tasks, I need help in figuring out key characters to use with sendcontrol. how could one send the controlkey ENTER ? and for future reference how can we find the key characters?
here is the code i am working on.
#!/usr/bin/env python
import pexpect
id = pexpect.spawn ('ftp 192.168.3.140')
id.expect_exact('Name')
id.sendline ('anonymous')
id.expect_exact ('Password')
*# Not sure how to send the enter control key
id.sendcontrol ('???')*
id.expect_exact ('ftp')
id.sendline ('dir')
id.expect_exact ('ftp')
lines = id.before.split ('\n')
for line in lines :
print line
pexpect has no sendcontrol() method. In your example you appear to be trying to send an empty line. To do that, use:
id.sendline('')
If you need to send real control characters then you can send() a string that contains the appropriate character value. For instance, to send a control-C you would:
id.send('\003')
or:
id.send(chr(3))
Responses to comment #2:
Sorry, I typo'ed the module name -- now fixed. More importantly, I was looking at old documentation on noah.org instead of the latest documentation at SourceForge. The newer documentation does show a sendcontrol() method. It takes an argument that is either a letter (for instance, sendcontrol('c') sends a control-C) or one of a variety of punctuation characters representing the control characters that don't correspond to letters. But really sendcontrol() is just a convenient wrapper around the send() method, which is what sendcontrol() calls after after it has calculated the actual value that you want to send. You can read the source for yourself at line 973 of this file.
I don't understand why id.sendline('') does not work, especially given that it apparently works for sending the user name to the spawned ftp program. If you want to try using sendcontrol() instead then that would be either:
id.sendcontrol('j')
to send a Linefeed character (which is control-j, or decimal 10) or:
id.sendcontrol('m')
to send a Carriage Return (which is control-m, or decimal 13).
If those don't work then please explain exactly what does happen, and how that differs from what you wanted or expected to happen.
If you're just looking to "press enter", you can send a newline:
id.send("\n")
As for other characters that you might want to use sendcontrol() with, I found this useful: https://condor.depaul.edu/sjost/lsp121/documents/ascii-npr.htm
For instance, I was interested in Ctrl+v. Looking it up in the table shows this line:
control character
python & java
decimal
description
^v
\x16
22
synchronous idle
So if I want to send that character, I can do any of these:
id.send('\x16')
id.send(chr(22))
id.sendcontrol('v')
sendcontrol() just looks up the correct character to send and then sends it like any other text
For keys not listed in that table, you can run this script: https://github.com/pexpect/pexpect/blob/master/tests/getch.py (ctrl space to exit)
For instance, ran that script and pressed F4 and it said:
27<STOP>
79<STOP>
83<STOP>
So then to press F4 via pexpect:
id.send(chr(27) + chr(79) + chr(83))

Python Emailing - Use of colon causes no output

Really odd experience with this which took me over an hour to figure out. I have a cgi script written in python which takes form data and forwards it to an email, the problem was, if there was a colon in the string it would cause python to output nothing to the email.
Does anyone know why this is?
For example:
output = "Print- Customer"
works, though:
output = "Print: Customer"
prints no output.
My email function works essentially:
server.sendmail(fromaddr, toaddrs, msg)
where msg = output
Just wondering if the colon is a special character in python string output
Colon isn't a special character in python string output, but it is special to email headers. Try inserting a blank line in the output:
output = "\nPrint: Customer"
Allow me to make a few guesses:
The mail is actually being sent, but the body appears to be empty (You question doesn't say this).
You're not using the builtin python mailing library.
If you open the mail in your mail reader, and look at the headers, the "print:" line will be present.
If so, the problem is that you're not ending the mail headers with a "\r\n" pair, and the mail reader thinks that "print:" is a mail header, while "print -" is part of the body of a mal-formed email.
If you add the "\r\n" after your headers, everything should be fine.

How to distinguish sender-generated carriage returns from word wrap auto-generated carriage returns in email body?

How can I distinguish sender-generated carriage returns from word wrap auto-generated carriage returns in an email body? I'm using Python imaplib to access Gmail and download message bodies like so:
user='whoever#gmail.com'
pwd='password'
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
m.select("INBOX")
resp, items = m.search(None, "ALL")
items = items[0].split()
messages = []
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)")
email_body = data[0][1]
mail = email.message_from_string(email_body)
for part in mail.walk():
if part.get_content_type() == 'text/plain':
body = part.get_payload(decode=1)
messages.append(body)
I'm focusing on the case of messages received from another Gmail user. The message body text has a number of carriage returns ('\r\n') in it. These fall into two classes: 1) those inserted by the sender of the email, the "true" returns, 2) those created by Gmail word wrapping at ~78 characters, the "false" returns. I want to remove the second class of carriage returns only. I'm sure I could come up with a programmatic approximation that searches for the '\r\n' at a window around every 78th character but that wouldn't be bulletproof and isn't what I want. Interestingly, I notice that when the message displays in Gmail in the web browser, there are not returns for the second class of carriage returns. Gmail somehow knows to remove/not display these specifically. How? Is there some special encoding I'm missing?
Gmail sends messages in both the MIME multipart format, in both a text/plain version (what you are grabbing) and a text/html version. The latter version is what contains fancy formatting like bold, italic, links, etc., and is what Gmail displays. While the text/html version is also line-broken at 78 characters (a part of the e-mail standard -- the underlying text must never have a line exceeding 78 characters), the "real" line breaks that you are looking for are embedded therein as HTML <br> tags. You can see this yourself if you send yourself a message and then, using the little down-arrow next to the Reply button, click "Show original".
You cannot distinguish between "fake" and "real" line-breaks in the text/plain version of the message, at least not reliably (as you obviously know). You can, however, pull the text/html version instead, knowing then that the "real" line-breaks are the <br> tags, however you then have to deal with the additional HTML (as well as first correctly processing the "Content-Transfer-Encoding" used therein).
I don't know how many email clients interpret or generate this correctly, but RFC 3676 includes the following:
When creating flowed text, the generating agent wraps, that is,
inserts 'soft' line breaks as needed. Soft line breaks are added at
natural wrapping points, such as between words. A soft line break is
a SP CRLF sequence.
So if the previous line has a space at the end of it, the current line should be interpreted as a continuation of the previous line. I suggest reviewing the entire RFC.

Categories

Resources