unwanted line break in email header when using email.mime - python

I'm playing around with SMTP and using email.mime to provide the header structure. For some reason when a try to add a header that exceeds a certain length a line break is added into my header line.
e.g.
from email.mime.text import MIMEText
message = 'some message'
msg = MIMEText(message)
msg.add_header('some header', 'just wondering why this sentence is continually cut in half for a reason I can not find')
print msg['some header']
print msg
print msg['some header'] prints:-
some header: just wondering just wondering why this sentence is continually cut in half for a reason I can not find
print msg prints:-
some header: just wondering why this sentence is continually cut in half for a
reason I can not find
One thing I did discover is that the length at which it's cut off is a combination of the header title and its value. So when I shorted 'some header' to 'some', the line return changes to after 'reason' instead of before.
It's not just my viewing page width :), it actually sends the email with the new line character in the email header.
Any thoughts?

This is correct behaviour, and it's the email package that does this (as well as most of the email generating code out there.) RFC822 messages (and all successors to that standard) have a way of continuing headers so they don't have to be a single line. It's considered good practice to fold headers like that, and the tab character that indents the rest of the header's body means the header is continued.

Related

I have a empty response of AT command sending by pyserial. How can I get "OK" response?

The response of print(msg) is b' ' and I'm expecting the "OK" response.
import serial
ser = serial.Serial(port='COM57')
if not ser.isOpen():
ser.open()
print('COM57 is open', ser.isOpen())
at_cmd = 'AT'
ser.write(at_cmd.encode())
msg = ser.read(2)
print(msg)
print(type(msg))
ser.close()
There are several things you need to change here. As already mentioned in the comments sending just "AT" will do nothing. You need to fill your holes in AT command knowledge and distinguish between an AT command and an AT command line. The best place to start is reading all of chapter 5 in the standard V.250 which is the fundamental, basic standard for AT command handling. Do not panic if there is something you do not quite get, but make sure you really get the syntax part (prefix + body + termination).
Note that despite the suggestions in the comments, an AT command line command line should be terminated by \r only and not \r\n ("The termination character may be selected by a user option (parameter S3), the default being CR (IA5 0/13).", and S3 should absolutely not be changed from its default value 13, so in practice you never have to deal with that register).
And with regards to reading and parsing the response you need to put in a proper algorithm. You need to read one by one character and combine those characters into response lines before you even think about trying to interpret the meaning of those (aka "framing" in data protocols). Following that, you need to detect whether you have received a final result code or not (and possible handle intermediate result codes/information text).
There are some more details in these two answers.

Why is my script is not consistently detecting contents in email bodies?

I've setup a sieve filter which invokes a Python script when it detects a postal service email about package deliveries. The sieve filter works fine and invokes the Python script reliably. However, the Python script does not reliably do its work. Here is my Python script, reduced to the relevant parts:
#!/usr/bin/env python3
import sys
from email import message_from_file
from email import policy
import subprocess
msg = message_from_file(sys.stdin, policy=policy.default)
if " out for delivery " in str(msg.get_body(("html"))):
print("It is out for delivery")
I get email messages that have the string " out for delivery " in the body of the message but the script does not print out "It is out for delivery". I've already checked the HTML in the messages to make sure it is consistent and it is 100% consistent. The frustrating thing though is that if I save the message from my mail reader that should have triggered the script, and I feed it to sieve-test manually, then the script works 100% of the time!
How come my script never works during actual mail delivery but always works whenever I test it with sieve-test?
Notes:
The email contains only a single part, which is HTML, so I have to use the HTML part.
I know I can do a body test in sieve. I'm doing it in Python for reasons outside the scope of this question.
The problem is that you use str(msg.get_body(("html"))), which is unreliable for your purpose. What you get is the body of the message as a string, but it is encoded for inclusion inside an email message. You're dealing with MIME part, which may be encoded with quoted-printable, in which case the string you test for (" out for delivery ") could be split across multiple lines when encoded. The string against which you test could have the text you are looking for encoded like this:
[other text] out for=
delivery [more text]
The = sign is part of the encoding and indicates that the newline that follows is there because of the encoding rather than because it was there prior to encoding.
Ok, but why does it always work when you use sieve-test? What happens is that your mail reader encodes the message differently, and the way it encodes it, the text you are looking for is not split across lines, and your script works! It is perfectly correct for the mail reader to save the message with a different encoding so long as once the email is decoded its content has not changed.
What you should do is use msg.get_body(("html")).get_content(). This gets the body in decoded form exactly byte-for-byte the same as when the postal service composed the email.

Extract Text from a Binary File (using Python 2.7 on Windows 7)

I have a binary file of size about 5MB.. which has lots of interspersed text.. and control characters..
This is actually an equivalent of an outlook .pst file for SITATEX Application (from SITA).
The file contains all the TEXT MESSAGES sent and received to and from outside world...(but the text has to be extracted through the binary control characters).. all the text messages are clearly available... with line ending ^M characters... etc.
for example: assume ^# ^X are control characters... \xaa with HEX aa, etc. loads of them around my required text extraction.
^#^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X
^X^X^X
MVT^M
EA1123 TEXT TEXT TEXT^M
END^M
\xaa^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
^#^#^#^#^#^#^#^#^#^#^#TTBBTT^X^X^X^X^X^X^X^X^X
^X^X^X blah blah blah... of control characters.. and then the message comes..
MVT MESSAGE 2
ED1123
etc.
and so on.. for several messages.
Using Perl.. it is easy to do:
while (<>) {
use regular expression to split messages
m/ /
}
How would one do this in python easily..
How to read the file? binary and text interspersed
Eliminate unnecessary control characters
parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
print out the required stuff
Loop through all the lines.. and more files.
In the text file sample... I am interested in seeing.. BLLBBCC... and MVT and EA1123 and so on.
Please assist... If it is going to be very difficult in python.. I will have to think through the logic in perl itself.. as it (perl) doesn't throw lots of errors at me at least for the looping part of binary and text stuff.. and the regex.
Thanks.
Update 02Jan after reading your answers/comments
After going through S.Lott's comments and others... This is where I am at.. and it is working 80% ok.
import fileinput
import sys
import re
strfile = r'C:\Users\' \
r'\Learn\python\mvt\sitatex_test.msgs'
f = open(strfile, 'rb')
contents = f.read() # read whole file in contents
#extract the string between two \xaaU.. multiline pattern match
#with look ahead assertion
#and this is stored in a list with all msgs
msgs = re.findall(r'\xaaU.*?(?=\xaaU)', contents, re.I|re.DOTALL|re.M)
for msg in msgs:
#loop through msgs.. to find the first msg then next and so on.
print "## NEW MESSAGE STARTS HERE ##"
#for each msg split the lines.. to read line by line
# stored as list in msglines
msglines = msg.splitlines()
line = 0
#then process each msgline with a message
for msgline in msglines:
line += 1
#msgline = re.sub(r'[\x00]+', r' ', msgline)
mystr = msgline
print mystr
textstrings = re.findall(r'[\x00\x20-\x7E]+', msgline)
So far so good.. still I am not completely done.. because I need to parse the text line by line and word by word.. to pickup (as an example) the origin address and headers, subject line, message body... by parsing the message through the control characters.
Now I am stuck with... how to print line by line with the control characters converted to \x00\x02.. etc (using the \xHH format).. but leave the normal readable text alone.
For example.. say I have this: assume ^# and ^X are some control characters
line1 = '^#UG^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X' (on the first line).
When I print the line as it is on IDLE.. print line1.. it prints only say the first 2 or 3 characters.. and ignores the rest due to the control characters get choked.
However, when I print with this: print re.findall(r'.*', line1)
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11', '']
It prints nicely with all the control characters converted to \xHH format.. and ascii text intact.. (just as I want it)..with one catch.. the list has two items.. with '' in the end.
What is the explanation for the empty string in the end?
How to avoid it... I just want the line converted nicely to a string (not a list). i.e. one line of binary/text to be converted to a string with \xHH codes.. leave the ASCII TEXT alone.
Is using re.findall(r'.*', line1) is the only easy solution.. to do this conversion.. or are there any other straightforward method.. to convert a '\x00string' to \xHH and TEXT (where it is a printable character or whitespace).
Also.. any other useful comments to get the lines out nicely.
Thanks.
Update 2Jan2011 - Part 2
I have found out that re.findall(r'.+', line1) strips to
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11']
without the extra blank '' item in the list. This finding after numerous trial and errors.
Still I will need assistance to eliminate the list altogether but return just a string.
like this:
'\xaaUG\x02\x05\x00\x04..BLLBBCC..002 010 180000 DEC 11'
Added Info on 05Jan:
#John Machin
1) \xaaU is the delimiter between messages.. In the example.. I may have just left out in the samples. Please see below for one actual message that ends with \xaaU (but left out).
Following text is obtained from repr(msg between r'\xaaU.*?(?=\xaaU)')
I am trying to understand the binary format.. this is a typical message which is sent out
the first 'JJJOWXH' is the sender address.. anything that follows that has 7 alphanumeric is the receiver addresses.. Based on the sender address.. I can know whether this is a 'SND' or 'RCV'.. as the source is 'JJJOWXH'... This msg is a 'SND' as we are 'JJJOWXH'.
The message is addressed to: JJJKLXH.... JJJKRXH.... and so on.
As soon as all the.. \x00000000 finishes..
the sita header and subject starts
In this particular case... "\x00QN\x00HX\x00180001 \x00" this is the header.. and I am only interested all the stuff between \x00.
and the body comes next.. after the final \x00 or any other control character... In this case... it is:
COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE PROBLEM
once the readable text ends... the first control character that appears until the end \xaaU is to be ignored... In above cases.. "SPACE PROBLEM".. is the last one.. then control characters starts... so to be ignored... sometimes the control characters are not there till the next \xaaU.
This is one complete message.
"\xaaU\x1c\x04\x02\x00\x05\x06\x1f\x00\x19\x00\x00\x00\xc4\x9d\xedN\x1a\x00?\x02\x02\x00B\x02\x02\x00E\x02\x07\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00M\x02\xec\x00\xff\xff\x00\x00\x00\x00?\x02M\x02\xec\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00:\x03\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\x00JJJOWXH\x00\x05w\x01x\x01\x00\x01JJJKLXH\x00\x00\x7f\x01\x80\x01\x00\x01JJJKRXH\x00F\x87\x01\x88\x01\x00\x01JJJFFXH\x00\xff\x8f\x01\x90\x01\x00\x01JJJFCXH\x00\xff\x97\x01\x98\x01\x00\x01JJJFAXH\x00\x00\x9f\x01\xa0\x01\x00\x01JJJKPXH\x00\x00\xa7\x01\xa8\x01\x00\x01HAKUOHU\x00\x00\xaf\x01\xb0\x01\x00\x01BBBHRXH\x00\x00\xb7\x01\xb8\x01\x00\x01BBBFFHX\x00\x00\xbf\x01\xc0\x01\x00\x01BBBOMHX\x00\x00\xc7\x01\xc8\x01\x00\x01BBBFMXH\x00\x00\xcf\x01\xd0\x01\x00\x01JJJHBER\x00\x00\xd7\x01\xd8\x01\x00\x01BBBFRUO\x00\x00\xdf\x01\xe0\x01\x00\x01BBBKKHX\x00\x00\xe7\x01\xe8\x01\x00\x01JJJLOTG\x00\x01\xef\x01\xf0\x01\x00\x01JJJLCTG\x00\x00\xf7\x01\xf8\x01\x00\x01HDQOMTG\x005\xff\x01\x00\x02\x00\x01CHACSHX\x00K\x07\x02\x08\x02\x00\x01JJJKZXH\x00F\x0f\x02\x10\x02\x00\x01BBBOMUO\x00
\x17\x02\x18\x02\x00\x01BBBORXH\x00 \x1f\x02
\x02\x00\x01BBBOPXH\x00W'\x02(\x02\x00\x01CHACSHX\x00
/\x020\x02\x00\x01JJJDBXH\x0007\x028\x02\x00010000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00QN\x00HX\x00180001
\x00COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE
PROBLEM\x00D-\xedN\x00\x04\x1a\x00t<\x93\x01x\x00M_\x00"
2) I am not using .+ anymore after the 'repr' is known.
3) each Message is multiline.. and i need to preserve all the control characters to make some sense of this proprietary format.. that is why i needed repr to see it up close.
Hope this explains... This is just 1 message out of 1000s with in the file... and some are 'SND' and some are 'RCV'... and for 'RCV' there will not be '000000'.. and occasionally there are minor exceptions to the rule... but usually that is okay.
Any further suggestions anyone.. I am still working with the file.. to retrieve the text out intact... with sender and receiver addresses.
Thank you.
Python supports regexes too. I don't speak Perl, so I don't know exactly what your Perl code does, but this Python program might help you:
import re
with open('yourfile.pst') as f:
contents = f.read()
textstrings = re.findall(r'[\x20-\x7E]+', contents)
That will get you a list of all strings of one or more ASCII printable characters in the file. That may not be exactly what you want, but you might be able to tweak it from there.
Note that if you're using Python 3, then you have to worry about the distinction between binary and textual data and it becomes a bit more complicated. I'm assuming you're in Python 2.
Q: How to read the file? binary and text interspersed
A: Don't bother, just read it as normal text and you'll be able to keep your binary/text dichotomy (otherwise you won't be able to regex it as easily)
fh = open('/path/to/my/file.ext', 'r')
fh.read()
Just in case you want to read binary later for some reason, you just add a b to the second input of the open:
fh = open('/path/to/my/file.ext', 'rb')
Q: Eliminate unnecessary control characters
A: Use the python re module. Your next question sorta ask how
Q: parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
A: re module has a findall function that works as you (mostly) expect.
import re
mytext = '\xaaUseful text that I want to keep\xaa^X^X^X\xaaOther text i like\xaa'
usefultext = re.findall('\xaa([a-zA-Z^!-~0-9 ]+)\xaa', mytext)
Q: print out the required stuff
*A: There's a print function...
print usefultext
Q: Loop through all the lines.. and more files.
fh = open('/some/file.ext','r')
for lines in fh.readlines():
#do stuff
I'll let you figure out the os module to figure out what files exist/how to iterate through them.
You say:
Still I will need assistance to eliminate the list altogether but return just a string. like this
In other words, you have foo = [some_string] and you are doing print foo which as a side does repr(some_string) but encloses it in square brackets which you don't want. So just do print repr(foo[0]).
There seem to be several things unexplained:
You say the useful text is bracketed by \xaaU but in the sample file instead of 2 occurrences of that delimiter there is only \xaa (missingU) near the start, and nothing else.
You say
I have found out that re.findall(r'.+', line1) strips to ...
That in effect is stripping out \n (but not \r!!) -- I thought line breaks would be worth preserving when attempting to recover an email message.
>>> re.findall(r'.+', 'abc\r\ndef\r\n\r\n')
['abc\r', 'def\r', '\r']
What you you done with the \r characters? Have you tested a multi-line message? Have you tested a multi-message file?
One is left to guess who or what is intended to consume your output; you write
I need to parse the text line by line and word by word
but you seem overly concerned with printing the message "legibly" with e.g. \xab instead of gibberish.
It looks like the last 6 or so lines in your latest code (for msgline in msglines: etc etc) should be indented one level.
Is it possible to clarify all of the above?

Python Emailing - Use of colon causes no output

Really odd experience with this which took me over an hour to figure out. I have a cgi script written in python which takes form data and forwards it to an email, the problem was, if there was a colon in the string it would cause python to output nothing to the email.
Does anyone know why this is?
For example:
output = "Print- Customer"
works, though:
output = "Print: Customer"
prints no output.
My email function works essentially:
server.sendmail(fromaddr, toaddrs, msg)
where msg = output
Just wondering if the colon is a special character in python string output
Colon isn't a special character in python string output, but it is special to email headers. Try inserting a blank line in the output:
output = "\nPrint: Customer"
Allow me to make a few guesses:
The mail is actually being sent, but the body appears to be empty (You question doesn't say this).
You're not using the builtin python mailing library.
If you open the mail in your mail reader, and look at the headers, the "print:" line will be present.
If so, the problem is that you're not ending the mail headers with a "\r\n" pair, and the mail reader thinks that "print:" is a mail header, while "print -" is part of the body of a mal-formed email.
If you add the "\r\n" after your headers, everything should be fine.

How to distinguish sender-generated carriage returns from word wrap auto-generated carriage returns in email body?

How can I distinguish sender-generated carriage returns from word wrap auto-generated carriage returns in an email body? I'm using Python imaplib to access Gmail and download message bodies like so:
user='whoever#gmail.com'
pwd='password'
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
m.select("INBOX")
resp, items = m.search(None, "ALL")
items = items[0].split()
messages = []
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)")
email_body = data[0][1]
mail = email.message_from_string(email_body)
for part in mail.walk():
if part.get_content_type() == 'text/plain':
body = part.get_payload(decode=1)
messages.append(body)
I'm focusing on the case of messages received from another Gmail user. The message body text has a number of carriage returns ('\r\n') in it. These fall into two classes: 1) those inserted by the sender of the email, the "true" returns, 2) those created by Gmail word wrapping at ~78 characters, the "false" returns. I want to remove the second class of carriage returns only. I'm sure I could come up with a programmatic approximation that searches for the '\r\n' at a window around every 78th character but that wouldn't be bulletproof and isn't what I want. Interestingly, I notice that when the message displays in Gmail in the web browser, there are not returns for the second class of carriage returns. Gmail somehow knows to remove/not display these specifically. How? Is there some special encoding I'm missing?
Gmail sends messages in both the MIME multipart format, in both a text/plain version (what you are grabbing) and a text/html version. The latter version is what contains fancy formatting like bold, italic, links, etc., and is what Gmail displays. While the text/html version is also line-broken at 78 characters (a part of the e-mail standard -- the underlying text must never have a line exceeding 78 characters), the "real" line breaks that you are looking for are embedded therein as HTML <br> tags. You can see this yourself if you send yourself a message and then, using the little down-arrow next to the Reply button, click "Show original".
You cannot distinguish between "fake" and "real" line-breaks in the text/plain version of the message, at least not reliably (as you obviously know). You can, however, pull the text/html version instead, knowing then that the "real" line-breaks are the <br> tags, however you then have to deal with the additional HTML (as well as first correctly processing the "Content-Transfer-Encoding" used therein).
I don't know how many email clients interpret or generate this correctly, but RFC 3676 includes the following:
When creating flowed text, the generating agent wraps, that is,
inserts 'soft' line breaks as needed. Soft line breaks are added at
natural wrapping points, such as between words. A soft line break is
a SP CRLF sequence.
So if the previous line has a space at the end of it, the current line should be interpreted as a continuation of the previous line. I suggest reviewing the entire RFC.

Categories

Resources