I am parsing E-Mails with the Python email module.
If I parse it with the Python E-Mail parser, it does not remove the tab in front of the header items:
from email.parser import Parser
from email.policy import default
testmail = """Date: Wed, 26 Jan 2022 10:45:29 +0100
Message-ID:
<123123123123123123123123123123123123123.testinst.themultiverse.com>
Subject:
=?iso-8859-1?Q?Auftragsbest=E4tigung_blablabla?=
=?iso-8859-1?Q?_one nice thing?=
Content Body Whatnot"""
message = Parser(policy=default).parsestr(testmail)
print(repr(message["Message-Id"]))
print(repr(message["Subject"]))
results in:
'\t<123123123123123123123123123123123123123.testinst.themultiverse.com>'
'\tAuftragsbestätigung blablabla one nice thing'
I have tried the different policies of the email parser, but I do not manage to remove the tab in the beginning. I saw the header_source_parse method of the EmailPolicy class does strip the whitespace, but only in combination with a space in the beginning.
<pythonlib>/email/policy.py:
[...]
value = value.lstrip(' \t') + ''.join(sourcelines[1:])
[...]
Not sure if that is intended behavior or a bug.
My question now: Is there a way in the standard library to do this, or do I need to write a custom policy? The E-Mails are unchanged from an IMAP Server (exchange) and it feels strange that the standard tools do not cover this.
Something let me think that the message is not strictly conformant to RFC5322.
We can see at 3.2.2. Folding White Space and Comments:
However, where CFWS occurs in this specification, it MUST NOT
be inserted in such a way that any line of a folded header field is
made up entirely of WSP characters and nothing else.
But for the Subject and Message-ID fields, the first line will only contain spaces before the first newline.
IIUC, it correspond to an obsolete syntax, because we find at 4. Obsolete Syntax:
Another key difference between the obsolete and the current syntax is
that the rule in section 3.2.2 regarding lines composed entirely of
white space in comments and folding white space does not apply.
The doc for EmailPolicy from Python Standard Library is even more explicit on what happens:
header_source_parse(sourcelines)
The name is parsed as everything up to the ‘:’ and returned unmodified. The value is determined by stripping leading whitespace off the remainder of the first line, joining all subsequent lines together, and stripping any trailing carriage return or linefeed characters.
As the tab occurs on the second line, it is not stripped.
I am unsure whether this interpretation is correct, but a possible workaround is to specialize a subclass or EmailPolicy to strip that initial line:
class ObsoletePolicy(email.policy.EmailPolicy):
def header_source_parse(self, sourcelines):
header, value = super().header_source_parse(sourcelines)
value = value.lstrip(' \t\r\n')
return header, value
If you use:
message = Parser(policy=ObsoletePolicy()).parsestr(testmail)
you will now get for print(repr(message['Subject'])):
'Auftragsbestätigung blablabla one nice thing'
Related
So here is an example of a multiline string in vscode/python:
Cursor is after the p , and then you press enter, and end up like this:
i.e. the string ends up indented, which seems what you almost never want - why have an arbitratly amount of whitespace on the next line of this string ?
Is there any way change this in vscode, i.e. for multiline strings, it should end up with this:
I think this problem is related to different coding styles of different people.
For example,
def example(x):
if x:
a = '''
This is help
'''
def example(x):
if x:
a = '''This is help
'''
The automatic indenting of vscode line breaks is based on code blocks. If you want Vscode can identify multiline string, I think it would be better to submit future request in github. I've submitted this issue for you.
I am not 100% sure if what OP meant is just to refer to the indentation in the editor (namely, VSC) or if, by this:
i.e. the string ends up indented, which seems what you almost never want - why have an arbitrary amount of white space on the next line of this string?
...they also meant to refer to the actual output of the multi-line string,
(or also, just in case anybody else finds this post looking for a way to avoid this affecting the actual output of the multi-line string), I'd like to add as a complementary answer (cannot comment yet) that this was already beautifully answered here.
If that's the case and you're reading this for that reason, in short, all you want is to import the standard lib 'inspect' and post-process your string with it, using the cleandoc method.
Without breaking the indentation in your IDE, this method makes sure to give you the string output you actually expected:
All leading whitespace is removed from the first line. Any leading whitespace that can be uniformly removed from the second line onwards is removed. Empty lines at the beginning and end are subsequently removed. Also, all tabs are expanded to spaces.
(From the docs link above)
Hope that helps anyone.
I'm using Python to parse a .csv file that contains line breaks in most values. This isn't an issue, since values are delimited by ".
However, I've noticed that during the construction of the .csv file at one point in time, long values were split into multiple lines (but kept within the same value), with an = character put at the end of one line to signify "the following line break is actually a concatenation". A minimal working example: the value
Hello, world!
How are you today?
could be represented as
"Hello, world!\n
How are you t=\n
oday?"
where \n denotes the one-byte line break character.
Does CSV have the concept of "line continuation characters"? The documentation of Python's csv library does not mention anything about it under the formatting section, and hence I wonder if this is common practice and if Python nevertheless has support. I know how to write a parser that concatenates these lines (a simple str.replace(v,"=\n","") probably suffices), but I'm just curious whether this is an idiosyncrasy of my file.
This seems to be not a feature of CSV, but rather of MIME (and since my dataset consists of e-mails, this solves my question).
This usage of equals characters is part of quoted-printable encoding, and can be handled by the quopri Python module. See this answer for more details.
Using this module is better than a simple str.replace(v, "=\n", ""), because e-mails can contain other quoted-printable tokens that need decoding and do not appear on line ends (e.g. =09 to represent a horizontal tab). With quopri, you would write:
import quopri
v = ...
original = quopri.decodestring(v.encode("utf-8")).decode("utf-8")
I have a similar problem to this question that I need to insert newlines in a YAML mapping value string and prefer not to insert \n myself. The answer suggests using:
Data: |
Some data, here and a special character like ':'
Another line of data on a separate line
instead of
Data: "Some data, here and a special character like ':'\n
Another line of data on a separate line"
which also adds a newline at the end, that is unacceptable.
I tried using Data: > but that showed to give completely different results. I have been stripping the final newline after reading in the yaml file, and of course that works but that is not elegant. Any better way to preserve newlines without adding an extra one at the end?
I am using python 2.7 fwiw
If you use | This makes a scalar into a literal block style scalar. But the default behaviour of |, is clipping and that doesn't get you the string you wanted (as that leaves the final newline).
You can "modify" the behaviour of | by attaching block chomping indicators
Strip
Stripping is specified by the “-” chomping indicator. In this case, the final line break and any trailing empty lines are excluded from the scalar’s content.
Clip
Clipping is the default behavior used if no explicit chomping indicator is specified. In this case, the final line break character is preserved in the scalar’s content. However, any trailing empty lines are excluded from the scalar’s content.
Keep
Keeping is specified by the “+” chomping indicator. In this case, the final line break and any trailing empty lines are considered to be part of the scalar’s content. These additional lines are not subject to folding.
By adding the stripchomping operator '-' to '|', you can prevent/strip the final newline:¹
import ruamel.yaml as yaml
yaml_str = """\
Data: |-
Some data, here and a special character like ':'
Another line of data on a separate line
"""
data = yaml.load(yaml_str)
print(data)
gives:
{'Data': "Some data, here and a special character like ':'\nAnother line of data on a separate line"}
¹ This was done using ruamel.yaml of which I am the author. You should get the same result with PyYAML (of which ruamel.yaml is a superset, preserving comments and literal scalar blocks on round-trip).
Could somebody tell me which character is a non-ASCII character in the following:
Columns(str) – comma-seperated list of values. Works only if format is tab or xls. For UnitprotKB, some possible columns are: id, entry name, length, organism. Some column names must be followed by a database name (i.e. ‘database(PDB)’). Again see uniprot website for more details. See also _valid_columns for the full list of column keyword.
Essentially I am defining a class and trying to give it a comment to define how it works:
def test(self,uniprot_id):
'''
Same as the UniProt.search() method arguments:
search(query, frmt='tab', columns=None, include=False, sort='score', compress=False, limit=None, offset=None, maxTrials=10)
query (str) -- query must be a valid uniprot query. See http://www.uniprot.org/help/text-search, http://www.uniprot.org/help/query-fields See also example below
frmt (str) -- a valid format amongst html, tab, xls, asta, gff, txt, xml, rdf, list, rss. If tab or xls, you can also provide the columns argument. (default is tab)
include (bool) -- include isoform sequences when the frmt parameter is fasta. Include description when frmt is rdf.
sort (str) -- by score by default. Set to None to bypass this behaviour
compress (bool) -- gzip the results
limit (int) -- Maximum number of results to retrieve.
offset (int) -- Offset of the first result, typically used together with the limit parameter.
maxTrials (int) -- this request is unstable, so we may want to try several time.
Columns(str) -- comma-seperated list of values. Works only if format is tab or xls. For UnitprotKB, some possible columns are: id, entry name, length, organism. Some column names must be followed by a database name (i.e. ‘database(PDB)’). Again see uniprot website for more details. See also _valid_columns for the full list of column keyword. '
'''
u = UniProt()
uniprot_entry = u.search(uniprot_id)
return uniprot_entry
Without the line 52, i.e. the one beginning with 'columns' in the quoted out comment block, this works as expected but as soon as I describe what 'columns' is I get the following error:
SyntaxError: Non-ASCII character '\xe2' in file /home/cw00137/Documents/Python/Identify_gene.py on line 52, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Does anybody know what is going on?
You are using 'fancy' curly quotes in that line:
>>> u'‘database(PDB)’'
u'\u2018database(PDB)\u2019'
That's a U+2018 LEFT SINGLE QUOTATION MARK at the start and U+2019 RIGHT SINGLE QUOTATION MARK at the end.
Use ASCII quotes (U+0027 APOSTROPHE or U+0022 QUOTATION MARK) or declare an encoding other than ASCII for your source.
You are also using an U+2013 EN DASH:
>>> u'Columns(str) –'
u'Columns(str) \u2013'
Replace that with a U+002D HYPHEN-MINUS.
All three characters encode to UTF-8 with a leading E2 byte:
>>> u'\u2013 \u2018 \u2019'.encode('utf8')
'\xe2\x80\x93 \xe2\x80\x98 \xe2\x80\x99'
which you then see reflected in the SyntaxError exception message.
You may want to avoid using these characters in the first place. It could be that your OS is replacing these as you type, or you are using a word processor instead of a plain text editor to write your code and it is replacing these for you. You probably want to switch that feature off.
Previously encountered the same problem and same error, python2 will default to using ASCII encoding.
You can try to declare the following comment in the py file's first or second line:
# -*- coding: utf-8 -*-
I have a binary file of size about 5MB.. which has lots of interspersed text.. and control characters..
This is actually an equivalent of an outlook .pst file for SITATEX Application (from SITA).
The file contains all the TEXT MESSAGES sent and received to and from outside world...(but the text has to be extracted through the binary control characters).. all the text messages are clearly available... with line ending ^M characters... etc.
for example: assume ^# ^X are control characters... \xaa with HEX aa, etc. loads of them around my required text extraction.
^#^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X
^X^X^X
MVT^M
EA1123 TEXT TEXT TEXT^M
END^M
\xaa^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
^#^#^#^#^#^#^#^#^#^#^#TTBBTT^X^X^X^X^X^X^X^X^X
^X^X^X blah blah blah... of control characters.. and then the message comes..
MVT MESSAGE 2
ED1123
etc.
and so on.. for several messages.
Using Perl.. it is easy to do:
while (<>) {
use regular expression to split messages
m/ /
}
How would one do this in python easily..
How to read the file? binary and text interspersed
Eliminate unnecessary control characters
parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
print out the required stuff
Loop through all the lines.. and more files.
In the text file sample... I am interested in seeing.. BLLBBCC... and MVT and EA1123 and so on.
Please assist... If it is going to be very difficult in python.. I will have to think through the logic in perl itself.. as it (perl) doesn't throw lots of errors at me at least for the looping part of binary and text stuff.. and the regex.
Thanks.
Update 02Jan after reading your answers/comments
After going through S.Lott's comments and others... This is where I am at.. and it is working 80% ok.
import fileinput
import sys
import re
strfile = r'C:\Users\' \
r'\Learn\python\mvt\sitatex_test.msgs'
f = open(strfile, 'rb')
contents = f.read() # read whole file in contents
#extract the string between two \xaaU.. multiline pattern match
#with look ahead assertion
#and this is stored in a list with all msgs
msgs = re.findall(r'\xaaU.*?(?=\xaaU)', contents, re.I|re.DOTALL|re.M)
for msg in msgs:
#loop through msgs.. to find the first msg then next and so on.
print "## NEW MESSAGE STARTS HERE ##"
#for each msg split the lines.. to read line by line
# stored as list in msglines
msglines = msg.splitlines()
line = 0
#then process each msgline with a message
for msgline in msglines:
line += 1
#msgline = re.sub(r'[\x00]+', r' ', msgline)
mystr = msgline
print mystr
textstrings = re.findall(r'[\x00\x20-\x7E]+', msgline)
So far so good.. still I am not completely done.. because I need to parse the text line by line and word by word.. to pickup (as an example) the origin address and headers, subject line, message body... by parsing the message through the control characters.
Now I am stuck with... how to print line by line with the control characters converted to \x00\x02.. etc (using the \xHH format).. but leave the normal readable text alone.
For example.. say I have this: assume ^# and ^X are some control characters
line1 = '^#UG^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X' (on the first line).
When I print the line as it is on IDLE.. print line1.. it prints only say the first 2 or 3 characters.. and ignores the rest due to the control characters get choked.
However, when I print with this: print re.findall(r'.*', line1)
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11', '']
It prints nicely with all the control characters converted to \xHH format.. and ascii text intact.. (just as I want it)..with one catch.. the list has two items.. with '' in the end.
What is the explanation for the empty string in the end?
How to avoid it... I just want the line converted nicely to a string (not a list). i.e. one line of binary/text to be converted to a string with \xHH codes.. leave the ASCII TEXT alone.
Is using re.findall(r'.*', line1) is the only easy solution.. to do this conversion.. or are there any other straightforward method.. to convert a '\x00string' to \xHH and TEXT (where it is a printable character or whitespace).
Also.. any other useful comments to get the lines out nicely.
Thanks.
Update 2Jan2011 - Part 2
I have found out that re.findall(r'.+', line1) strips to
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11']
without the extra blank '' item in the list. This finding after numerous trial and errors.
Still I will need assistance to eliminate the list altogether but return just a string.
like this:
'\xaaUG\x02\x05\x00\x04..BLLBBCC..002 010 180000 DEC 11'
Added Info on 05Jan:
#John Machin
1) \xaaU is the delimiter between messages.. In the example.. I may have just left out in the samples. Please see below for one actual message that ends with \xaaU (but left out).
Following text is obtained from repr(msg between r'\xaaU.*?(?=\xaaU)')
I am trying to understand the binary format.. this is a typical message which is sent out
the first 'JJJOWXH' is the sender address.. anything that follows that has 7 alphanumeric is the receiver addresses.. Based on the sender address.. I can know whether this is a 'SND' or 'RCV'.. as the source is 'JJJOWXH'... This msg is a 'SND' as we are 'JJJOWXH'.
The message is addressed to: JJJKLXH.... JJJKRXH.... and so on.
As soon as all the.. \x00000000 finishes..
the sita header and subject starts
In this particular case... "\x00QN\x00HX\x00180001 \x00" this is the header.. and I am only interested all the stuff between \x00.
and the body comes next.. after the final \x00 or any other control character... In this case... it is:
COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE PROBLEM
once the readable text ends... the first control character that appears until the end \xaaU is to be ignored... In above cases.. "SPACE PROBLEM".. is the last one.. then control characters starts... so to be ignored... sometimes the control characters are not there till the next \xaaU.
This is one complete message.
"\xaaU\x1c\x04\x02\x00\x05\x06\x1f\x00\x19\x00\x00\x00\xc4\x9d\xedN\x1a\x00?\x02\x02\x00B\x02\x02\x00E\x02\x07\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00M\x02\xec\x00\xff\xff\x00\x00\x00\x00?\x02M\x02\xec\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00:\x03\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\x00JJJOWXH\x00\x05w\x01x\x01\x00\x01JJJKLXH\x00\x00\x7f\x01\x80\x01\x00\x01JJJKRXH\x00F\x87\x01\x88\x01\x00\x01JJJFFXH\x00\xff\x8f\x01\x90\x01\x00\x01JJJFCXH\x00\xff\x97\x01\x98\x01\x00\x01JJJFAXH\x00\x00\x9f\x01\xa0\x01\x00\x01JJJKPXH\x00\x00\xa7\x01\xa8\x01\x00\x01HAKUOHU\x00\x00\xaf\x01\xb0\x01\x00\x01BBBHRXH\x00\x00\xb7\x01\xb8\x01\x00\x01BBBFFHX\x00\x00\xbf\x01\xc0\x01\x00\x01BBBOMHX\x00\x00\xc7\x01\xc8\x01\x00\x01BBBFMXH\x00\x00\xcf\x01\xd0\x01\x00\x01JJJHBER\x00\x00\xd7\x01\xd8\x01\x00\x01BBBFRUO\x00\x00\xdf\x01\xe0\x01\x00\x01BBBKKHX\x00\x00\xe7\x01\xe8\x01\x00\x01JJJLOTG\x00\x01\xef\x01\xf0\x01\x00\x01JJJLCTG\x00\x00\xf7\x01\xf8\x01\x00\x01HDQOMTG\x005\xff\x01\x00\x02\x00\x01CHACSHX\x00K\x07\x02\x08\x02\x00\x01JJJKZXH\x00F\x0f\x02\x10\x02\x00\x01BBBOMUO\x00
\x17\x02\x18\x02\x00\x01BBBORXH\x00 \x1f\x02
\x02\x00\x01BBBOPXH\x00W'\x02(\x02\x00\x01CHACSHX\x00
/\x020\x02\x00\x01JJJDBXH\x0007\x028\x02\x00010000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00QN\x00HX\x00180001
\x00COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE
PROBLEM\x00D-\xedN\x00\x04\x1a\x00t<\x93\x01x\x00M_\x00"
2) I am not using .+ anymore after the 'repr' is known.
3) each Message is multiline.. and i need to preserve all the control characters to make some sense of this proprietary format.. that is why i needed repr to see it up close.
Hope this explains... This is just 1 message out of 1000s with in the file... and some are 'SND' and some are 'RCV'... and for 'RCV' there will not be '000000'.. and occasionally there are minor exceptions to the rule... but usually that is okay.
Any further suggestions anyone.. I am still working with the file.. to retrieve the text out intact... with sender and receiver addresses.
Thank you.
Python supports regexes too. I don't speak Perl, so I don't know exactly what your Perl code does, but this Python program might help you:
import re
with open('yourfile.pst') as f:
contents = f.read()
textstrings = re.findall(r'[\x20-\x7E]+', contents)
That will get you a list of all strings of one or more ASCII printable characters in the file. That may not be exactly what you want, but you might be able to tweak it from there.
Note that if you're using Python 3, then you have to worry about the distinction between binary and textual data and it becomes a bit more complicated. I'm assuming you're in Python 2.
Q: How to read the file? binary and text interspersed
A: Don't bother, just read it as normal text and you'll be able to keep your binary/text dichotomy (otherwise you won't be able to regex it as easily)
fh = open('/path/to/my/file.ext', 'r')
fh.read()
Just in case you want to read binary later for some reason, you just add a b to the second input of the open:
fh = open('/path/to/my/file.ext', 'rb')
Q: Eliminate unnecessary control characters
A: Use the python re module. Your next question sorta ask how
Q: parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
A: re module has a findall function that works as you (mostly) expect.
import re
mytext = '\xaaUseful text that I want to keep\xaa^X^X^X\xaaOther text i like\xaa'
usefultext = re.findall('\xaa([a-zA-Z^!-~0-9 ]+)\xaa', mytext)
Q: print out the required stuff
*A: There's a print function...
print usefultext
Q: Loop through all the lines.. and more files.
fh = open('/some/file.ext','r')
for lines in fh.readlines():
#do stuff
I'll let you figure out the os module to figure out what files exist/how to iterate through them.
You say:
Still I will need assistance to eliminate the list altogether but return just a string. like this
In other words, you have foo = [some_string] and you are doing print foo which as a side does repr(some_string) but encloses it in square brackets which you don't want. So just do print repr(foo[0]).
There seem to be several things unexplained:
You say the useful text is bracketed by \xaaU but in the sample file instead of 2 occurrences of that delimiter there is only \xaa (missingU) near the start, and nothing else.
You say
I have found out that re.findall(r'.+', line1) strips to ...
That in effect is stripping out \n (but not \r!!) -- I thought line breaks would be worth preserving when attempting to recover an email message.
>>> re.findall(r'.+', 'abc\r\ndef\r\n\r\n')
['abc\r', 'def\r', '\r']
What you you done with the \r characters? Have you tested a multi-line message? Have you tested a multi-message file?
One is left to guess who or what is intended to consume your output; you write
I need to parse the text line by line and word by word
but you seem overly concerned with printing the message "legibly" with e.g. \xab instead of gibberish.
It looks like the last 6 or so lines in your latest code (for msgline in msglines: etc etc) should be indented one level.
Is it possible to clarify all of the above?