I have a script which parses raw emails. It works fine for multipart emails, but how do I parse non multipart emails?
mail = email.message_from_string(raw_message)
if mail.is_multipart():
data = extract(mail)
else:
payload = mail.get_payload(decode=True)
Raw Email:
Return-Path: <>
X-Original-To: bounces#mydomain.com
Delivered-To: bounces#mydomain.com
Received: from inmumg01.tcs.com (inmumg01.tcs.com [219.64.33.12])
by smtp.mydomain.com (Postfix) with ESMTP id 603693FE11
for <bounces#mydomain.com>; Tue, 15 Mar 2016 04:39:36 -0400 (EDT)
Received: from localhost by inmumg01.tcs.com;
15 Mar 2016 14:09:38 +0530
Message-Id: <5aaa80$2543de#inmumg01.tcs.com>
Date: 15 Mar 2016 14:09:38 +0530
To: bounces#mydomain.com
From: "Mail Delivery System" <mail.notification#tcs.com>
Subject: Undeliverable Message
The following message to <vipul4.j#tcs.com> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 550-'vipul4.j#tcs.com... No such user'
The IP address of the MTA to which the message could not be sent:
172.17.9.35
---------- A copy of the message begins below this line ----------
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IPAS-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IronPort-AV: E=Sophos;i="5.24,338,1454956200";
d="scan'208,217";a="72486315"
X-Amp-Result: Clean
X-Amp-File-Uploaded: False
Received: from smtp.mydomain.com ([139.59.240.124])
by inmumg01.tcs.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 15 Mar 2016 14:09:37 +0530
Received: from 128.199.202.14 (unknown [128.199.202.14])
(Authenticated sender: mailsender)
by smtp.mydomain.com (Postfix) with ESMTPA id 0D41F3FE11
for <vipul4.j#tcs.com>; Tue, 15 Mar 2016 04:39:33 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kitemailer.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=Xxaf++WE0B7HL+FN28O76df7gYNEIKzk8eE9VpxrnMBCpGWPKWBMMfVDfCyie3NBJ
GJiMxn/Yhn+ey6Mr5R5AK5JO5n72yWlytLm0RepMEydaeHHVQPx7bE+LMDMlORSFin
bWdnz58lNMuZ3w9qtqjCXt22Sk5yXfCO71tRgfus=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=FiGdkSE9LCjYkfYyWq65GbZoMZVCQs5OXXJA35CyGQtjPWbvwIKvx7Z6Ff39EBRLf
Vu+6PUrvwyZLFh/1CW0NGOHDgUDjWWQ2jHfnNpJ9QEbHgOwomuMty10HDeZnIr0zM7
8mFCgeCbiiyusQkhmXh5aYqqD+Q/1wFcrpLpkBZc=
Date: Tue, 15 Mar 2016 04:39:31 -0400 (EDT)
From: Kitemailer Newsletter <info#kitemailer.com>
To: vipul4.j#tcs.com
Message-ID: <15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>
Subject: KiteMailer | New Features this Week
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_44_1398250960.1458031171306"
List-Unsubscribe: <http://example.com/unsubscribe/dmlwdWw0LmpAdGNzLmNvbSM5Ng==>
Feedback-ID: 19:96:1520615:MyDomain
Now in the else statement, I want to extract information, if I try payload['to'] it throws me an error TypeError: string indices must be integers, not str
Ok, let's say there is no way you can do it with the mail library (which I don't know), you may convert your raw message to a dictionary and get your elements:
this is is your raw message:
raw_message='''Return-Path: <>
X-Original-To: bounces#mydomain.com
Delivered-To: bounces#mydomain.com
Received: from inmumg01.tcs.com (inmumg01.tcs.com [219.64.33.12])
by smtp.mydomain.com (Postfix) with ESMTP id 603693FE11
for <bounces#mydomain.com>; Tue, 15 Mar 2016 04:39:36 -0400 (EDT)
Received: from localhost by inmumg01.tcs.com;
15 Mar 2016 14:09:38 +0530
Message-Id: <5aaa80$2543de#inmumg01.tcs.com>
Date: 15 Mar 2016 14:09:38 +0530
To: bounces#mydomain.com
From: "Mail Delivery System" <mail.notification#tcs.com>
Subject: Undeliverable Message
The following message to <vipul4.j#tcs.com> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 550-'vipul4.j#tcs.com... No such user'
The IP address of the MTA to which the message could not be sent:
172.17.9.35
---------- A copy of the message begins below this line ----------
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IPAS-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IronPort-AV: E=Sophos;i="5.24,338,1454956200";
d="scan'208,217";a="72486315"
X-Amp-Result: Clean
X-Amp-File-Uploaded: False
Received: from smtp.mydomain.com ([139.59.240.124])
by inmumg01.tcs.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 15 Mar 2016 14:09:37 +0530
Received: from 128.199.202.14 (unknown [128.199.202.14])
(Authenticated sender: mailsender)
by smtp.mydomain.com (Postfix) with ESMTPA id 0D41F3FE11
for <vipul4.j#tcs.com>; Tue, 15 Mar 2016 04:39:33 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kitemailer.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=Xxaf++WE0B7HL+FN28O76df7gYNEIKzk8eE9VpxrnMBCpGWPKWBMMfVDfCyie3NBJ
GJiMxn/Yhn+ey6Mr5R5AK5JO5n72yWlytLm0RepMEydaeHHVQPx7bE+LMDMlORSFin
bWdnz58lNMuZ3w9qtqjCXt22Sk5yXfCO71tRgfus=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=FiGdkSE9LCjYkfYyWq65GbZoMZVCQs5OXXJA35CyGQtjPWbvwIKvx7Z6Ff39EBRLf
Vu+6PUrvwyZLFh/1CW0NGOHDgUDjWWQ2jHfnNpJ9QEbHgOwomuMty10HDeZnIr0zM7
8mFCgeCbiiyusQkhmXh5aYqqD+Q/1wFcrpLpkBZc=
Date: Tue, 15 Mar 2016 04:39:31 -0400 (EDT)
From: Kitemailer Newsletter <info#kitemailer.com>
To: vipul4.j#tcs.com
Message-ID: <15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>
Subject: KiteMailer | New Features this Week
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_44_1398250960.1458031171306"
List-Unsubscribe: <http://example.com/unsubscribe/dmlwdWw0LmpAdGNzLmNvbSM5Ng==>
Feedback-ID: 19:96:1520615:MyDomain'''
I'm using your code to get payload:
#in case it is not multipart
import email
mail = email.message_from_string(raw_message)
payload = mail.get_payload(decode=True)
mail_dico = { elt.split(":",1)[0].strip():elt.split(":", 1)[1].strip() for elt in payload.split("\n") if ":" in elt and " " not in elt.split(':')[0].strip()}
here is your dictionary:
{'Content-Type': 'multipart/mixed;',
'DKIM-Signature': 'v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;',
'Date': 'Tue, 15 Mar 2016 04',
'Feedback-ID': '19',
'From': 'Kitemailer Newsletter <info#kitemailer.com>',
'List-Unsubscribe': '<http',
'MIME-Version': '1.0',
'Message-ID': '<15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>',
'Received': 'from 128.199.202.14 (unknown [128.199.202.14])',
'Subject': 'KiteMailer | New Features this Week',
'To': 'vipul4.j#tcs.com',
'X-Amp-File-Uploaded': 'False',
'X-Amp-Result': 'Clean',
'X-IPAS-Result': 'A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE',
'X-IronPort-AV': 'E=Sophos;i="5.24,338,1454956200";',
'X-IronPort-Anti-Spam-Filtered': 'true',
'X-IronPort-Anti-Spam-Result': 'A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE',
'h=Date': 'From'}
now you can access your elements:
print(mail_dico["To"])
>> 'vipul4.j#tcs.com'
print(mail_dico["Subject"])
>> 'KiteMailer | New Features this Week'
This is probably not the best way to do this but I hope it helped.
Related
Hey i'm trying to request a website with parameters using post method, Here are the following Form Data of the website
First: name
Last: lastname
Email: email#gmail.com
CountryID: 53
City: Misgav Regional Council
Postcode: 4848484
petition_id: 613908
action: sign
Is_shared_on_fb: 0
cid: 1508
skin: community_petitions
lang: he
supports_history_api: true
secure_validation: Thu Dec 06 2018 21:45:24 GMT+0200 (Israel Standard Time)
used_js: Thu Dec 06 2018 21:45:24 GMT+0200 (Israel Standard Time)
postaction_data: eyJ1c2VyX2lkIjo5OTg0NTUwNSwic3RvcmFnZSI6eyJwb3N0YWN0aW9uX3BhZ2UiOiJwb3N0YWN0aW9uLXVzZXItcGV0aXRpb24iLCJwZXRpdGlvbklEIjoiNjEzOTA4In19
cookie_id:
The problem is I am not sure what data does requests() needs. I've tried the following with no success, thank you for your help!
import requests
p = requests.post('https://secure.avaaz.org/he/community_petitions/brykh_bbyt_spr_hmnhlt_shl_byt_spr/?asljNcb', data={'First': 'name', 'Last': 'lastname', 'Email': 'email#gmail.com' , 'CountryID': '53', 'City': 'Misgav Regional Council' , 'PostCode': '4848484', 'petition_id': '613908', 'action': 'sign'})
print(p.text)
I am working with an AWS Lambda function written in python 2.7x which downloads, saves to /tmp , then uploads the image file back to bucket.
My image meta data starts out in original bucket with http headers like Content-Type= image/jpeg, and others.
After saving my image with PIL, all headers are gone and I am left with Content-Type = binary/octet-stream
From what I can tell, image.save is loosing the headers due to the way PIL works. How do I either preserve metadata or at least apply it to the new saved image?
I have seen post suggesting that this metadata is in exif but I tried to get exif info from original file and apply to saved file with no luck. I am not clear of it's in exif data anyway.
Partial code to give idea of what I am doing:
def resize_image(image_path):
with Image.open(image_path) as image:
image.save(upload_path, optimize=True)
def handler(event, context):
global upload_path
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode("utf8"))
download_path = '/tmp/{}{}'.format(uuid.uuid4(), file_name)
upload_path = '/tmp/resized-{}'.format(file_name)
s3_client.download_file(bucket, key, download_path)
resize_image(download_path)
s3_client.upload_file(upload_path, '{}resized'.format(bucket), key)
Thanks to Sergey, I changed to using get_object but response is missing Metadata:
response = s3_client.get_object(Bucket=bucket,Key=key)
response= {u'Body': , u'AcceptRanges': 'bytes', u'ContentType': 'image/jpeg', 'ResponseMetadata': {'HTTPStatusCode': 200, 'RetryAttempts': 0, 'HostId': 'au30hBMN37/ti0WCfDqlb3t9ehainumc9onVYWgu+CsrHtvG0u/zmgcOIvCCBKZgQrGoooZoW9o=', 'RequestId': '1A94D7F01914A787', 'HTTPHeaders': {'content-length': '84053', 'x-amz-id-2': 'au30hBMN37/ti0WCfDqlb3t9ehainumc9onVYWgu+CsrHtvG0u/zmgcOIvCCBKZgQrGoooZoW9o=', 'accept-ranges': 'bytes', 'expires': 'Sun, 01 Jan 2034 00:00:00 GMT', 'server': 'AmazonS3', 'last-modified': 'Fri, 23 Dec 2016 15:21:56 GMT', 'x-amz-request-id': '1A94D7F01914A787', 'etag': '"9ba59e5457da0dc40357f2b53715619d"', 'cache-control': 'max-age=2592000,public', 'date': 'Fri, 23 Dec 2016 15:21:58 GMT', 'content-type': 'image/jpeg'}}, u'LastModified': datetime.datetime(2016, 12, 23, 15, 21, 56, tzinfo=tzutc()), u'ContentLength': 84053, u'Expires': datetime.datetime(2034, 1, 1, 0, 0, tzinfo=tzutc()), u'ETag': '"9ba59e5457da0dc40357f2b53715619d"', u'CacheControl': 'max-age=2592000,public', u'Metadata': {}}
If I use:
metadata = response['ResponseMetadata']['HTTPHeaders']
metadata = {'content-length': '84053', 'x-amz-id-2': 'f5UAhWzx7lulo3cMVF8hdVRbHnhdnjHWRDl+LDFkYm9pubjL0A01L5yWjgDjWRE4TjRnjqDeA0U=', 'accept-ranges': 'bytes', 'expires': 'Sun, 01 Jan 2034 00:00:00 GMT', 'server': 'AmazonS3', 'last-modified': 'Fri, 23 Dec 2016 15:47:09 GMT', 'x-amz-request-id': '4C69DF8A58EF3380', 'etag': '"9ba59e5457da0dc40357f2b53715619d"', 'cache-control': 'max-age=2592000,public', 'date': 'Fri, 23 Dec 2016 15:47:10 GMT', 'content-type': 'image/jpeg'}
Saving with put_object
s3_client.put_object(Bucket=bucket+'resized',Key=key, Metadata=metadata, Body=downloadfile)
creates a whole lot of extra metadata in s3 including the fact that it does not save content-type as image/jpeg but rather as binary/octet-stream and it does create metadata x-amz-meta-content-type = image/jpeg
You are confusing S3 metadata, stored by AWS S3 along with an object, and EXIF metadata, stored inside the file itself.
download_file() doesn't get object attributes from S3. You should use get_object() instead: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.get_object
Then you can use put_objects() with the same attributes to upload new file: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object
Content type information is not on the file you upload, it has to be guessed or extracted somehow. This is something you must do manually or using tools. With a fairly small dictionary you can guess most file types.
When you upload a file or object, you have the chance to specify its content type. Otherwise S3 defaults to application/octet-stream.
Using the boto3 python package for instance:
s3client.upload_file(
Filename=local_path,
Bucket=bucket,
Key=remote_path,
ExtraArgs={
"ContentType": "image/jpeg"
}
)
I'm trying to parse the html into dictionary
My current code has lots of logic in it.
It smells bad, I use the lxml to help me to parse it.
Any recommend method to parse the kind of html without too much well-formed DOM ?
Thanks so much
original html
<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>
<p><strong>Flight duration:</strong> 3h 45m</p>
<p><strong>Operated by:</strong> NokScoot</p>
expected result
{
Departs: "5:15:00AM, Sat, Nov 28, 2015",
Arrives: "5:15:00AM, Sat, Nov 28, 2015",
Flight duration: "3h 45m"
...
}
current code (implementing)
doc_root = html.document_fromstring(resp.text)
for ele in doc_root.xpath('//ul[#class="tb_body"]'):
if has_stops(ele.xpath('.//li[#class="tb_body_flight"]//span[#class="has_cuspopup"]')):
continue
set_trace()
from_city = ele.xpath('.//li[#class="tb_body_city"]')[0]
set_trace()
sub_ele = ele.xpath('.//li[#class="tb_body_flight"]//span[#class="has_cuspopup"]')
set_trace()
I created example for html you provided. It uses popular Beautiful Soup.
from bs4 import BeautifulSoup
data = '<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>\
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>\
<p><strong>Flight duration:</strong> 3h 45m</p>\
<p><strong>Operated by:</strong> NokScoot</p>'
soup = BeautifulSoup(data, 'html.parser')
res = {p.contents[0].text: p.contents[1].split(' - ')[0].strip() for p in soup.find_all('p')}
print(res)
Output:
{
'Departs:': '5:15:00AM, Sat, Nov 28, 2015',
'Flight duration:': '3h 45m',
'Operated by:': 'NokScoot',
'Arrives:': '8:00:00AM, Sat, Nov 28, 2015'
}
I think you should avoid of using attributes if you want to make your code compact.
Here's an example email header,
header = """
From: Media Temple user (mt.kb.user#gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user#example.com
Return-Path: <mt.kb.user#gmail.com>
Envelope-To: user#example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""
The header is stored as a string, how do I parse this header, so that i can map it to a dictionary as the header fields be the key and the values be the values in the dictionary?
I want a dictionary like this,
header_dict = {
'From': 'Media Temple user (mt.kb.user#gmail.com)',
'Subject': article: 'A sample header',
'Date': 'January 25, 2011 3:30:58 PM PDT'
'and so on': .. . . . .. . . .. .
. . . . .. . . . .. . . . . .
}
I made a list of fields required,
header_reqd = ['From:','Subject:','Date:','To:','Return-Path:','Envelope-To:','Delivery-Date:','Received:','Dkim-Signature:','Domainkey-Signature:','Message-Id:','Mime-Version:','Content-Type:','X-Spam-Status:','X-Spam-Level:','Message Body:']
This can list items can likely be the keys for the dictionary.
It seems most of these answers have overlooked the Python email parser and the output results are not correct with prefix spaces in the values. Also the OP has perhaps made a typo by including a preceding newline in the header string which requires stripped for the email parser to work.
from email.parser import HeaderParser
header = header.strip() # Fix incorrect formatting
email_message = HeaderParser().parsestr(header)
dict(email_message)
Output (truncated):
>>> from pprint import pprint
>>> pprint(dict(email_message))
{'Content-Type': 'multipart/alternative; '
'boundary="----=_Part_3927_12044027.1214951458678"',
'Date': 'January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': 'Tue, 25 Jan 2011 15:31:01 -0700',
...
'Subject': 'article: A sample header',
'To': 'user#example.com',
'X-Spam-Level': '***',
'X-Spam-Status': 'score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
Duplicate header keys
Be aware that email message headers can contain duplicate keys as mentioned in the Python documentation for email.message
Headers are stored and returned in case-preserving form, but field names are matched case-insensitively. Unlike a real dict, there is an ordering to the keys, and there can be duplicate keys. Additional methods are provided for working with headers that have duplicate keys.
For example converting the following email message to a Python dict only the first Received key would be retained.
headers = HeaderParser().parsestr("""Received: by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example#example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example#example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)""")
dict(headers)
{'Received': 'by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)'}
Use the get_all method to check for duplicates:
headers.get_all('Received')
['by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example#example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example#example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)']
you can split string on newline, then split each line on ":"
>>> my_header = {}
>>> for x in header.strip().split("\n"):
... x = x.split(":", 1)
... my_header[x[0]] = x[1]
...
header = """From: Media Temple user (mt.kb.user#gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user#example.com
Return-Path: <mt.kb.user#gmail.com>
Envelope-To: user#example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""
Split into individual lines then split each line once on :
from pprint import pprint as pp
pp(dict(line.split(":",1) for line in header.splitlines()))
Output:
{'Content-Type': ' multipart/alternative; '
'boundary="----=_Part_3927_12044027.1214951458678"',
'Date': ' January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; '
's=gamma; '
'h=domainkey-signature:received:received:message-id:date:from:to '
':subject:mime-version:content-type; '
'bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; '
'b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea '
'LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m '
'CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; '
'h=message-id:date:from:to:subject:mime-version:content-type; '
'b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH '
'36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB '
'6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
'Envelope-To': ' user#example.com',
'From': ' Media Temple user (mt.kb.user#gmail.com)',
'Message Body': ' **The email message body**',
'Message-Id': ' '
'<c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>',
'Mime-Version': ' 1.0',
'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by '
'cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from '
'<mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for '
'user#example.com; Tue, 25 Jan 2011 15:31:01 -0700',
'Return-Path': ' <mt.kb.user#gmail.com>',
'Subject': ' article: A sample header',
'To': ' user#example.com',
'X-Spam-Level': ' ***',
'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
line.split(":",1) makes sure we only split once on : so if there are any : in the values we won't end up splitting that also. You end up with sublists that are key/value pairings so calling dict creates the dict create from each pairing.
split will work for you:
Demo:
>>> result = {}
>>> for i in header.split("\n"):
... i = i.strip()
... if i :
... k, v = i.split(":", 1)
... result[k] = v
output:
>>> import pprint
>>> pprint.pprint(result)
{'Content-Type': ' multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"',
'Date': ' January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
'Envelope-To': ' user#example.com',
'From': ' Media Temple user (mt.kb.user#gmail.com)',
'Message Body': ' **The email message body**',
'Message-Id': ' <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>',
'Mime-Version': ' 1.0',
'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700',
'Return-Path': ' <mt.kb.user#gmail.com>',
'Subject': ' article: A sample header',
'To': ' user#example.com',
'X-Spam-Level': ' ***',
'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
When using the Python swiftclient module I can POST to an object with a header of X-Delete-At/After and an epoch, but how do I show the expiration time of the object? I was doing some testing and it seems that the file is always being expired immediately, e.g. where I set the time for 100 days in the future:
>>> swift.put_object('container1','test_file01.txt','This is line 1 in the file test_file01.txt done at: %s' % datetime.now().strftime('%Y-%m-%d %H:%M:%S,%f'))
'4b3faf0b79d97f5e478949e7d6c4c575'
>>> swift.head_object('container1','test_file01.txt')
{'content-length': '78', 'server': 'Jetty(7.6.4.v20120524)', 'last-modified': 'Wed, 23 Apr 2014 17:09:55 GMT', 'etag': '4b3faf0b79d97f5e478949e7d6c4c575', 'x-timestamp': '1398272995', 'date': 'Wed, 23 Apr 2014 17:09:59 GMT', 'content-type': 'application/octet-stream'}
>>> swift.post_object('container1','test_file01.txt',headers={'X-Delete-At':(datetime.now(pytz.timezone('GMT')) + timedelta(days=100)).strftime('%s')})
>>> swift.head_object('container1','test_file01.txt')
Traceback (most recent call last):
File "<pyshell#121>", line 1, in <module>
swift.head_object('container1','test_file01.txt')
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 1279, in head_object
return self._retry(None, head_object, container, obj)
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 1189, in _retry
rv = func(self.url, self.token, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 853, in head_object
http_response_content=body)
ClientException: Object HEAD failed: http://10.249.238.135:9024:9024/v1/rjm-vnx-namespace01/container1/test_file01.txt 404 Not Found
So it seems it was expired immediately. My questions are:
Am I setting the expiration correctly? I would like to be able to do it to an existing object rather than at object creation time, but perhaps I HAVE to do it when it's being created???
Is there a way to see the expiration time? Obviously if it's not working correctly than there's no good way to see it, but if it were, does head_object() return that information?
Thanks,
Rob
Never mind, I figured it out. By setting an "after" I realized that the value apparently needs to be in milliseconds. So when I changed it to:
>>> swift.post_object('container1','test_file01.txt',headers={'X-Delete-At':int((datetime.now(pytz.timezone('GMT')) + timedelta(days=100)).strftime('%s'))*1000})
>>> swift.head_object('container1','test_file01.txt')
{'content-length': '78', 'x-delete-at':'1406932148000', 'server': 'Jetty(7.6.4.v20120524)', 'last-modified': 'Wed, 23 Apr 2014 17:29:06 GMT', 'etag': '0baf8b37f374c94e59a05a7f7b339811', 'x-timestamp': '1398274146', 'date': 'Wed, 23 Apr 2014 17:29:08 GMT', 'content-type': 'application/octet-stream'}
Then it worked as expected.
Rob