I'm trying to parse the html into dictionary
My current code has lots of logic in it.
It smells bad, I use the lxml to help me to parse it.
Any recommend method to parse the kind of html without too much well-formed DOM ?
Thanks so much
original html
<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>
<p><strong>Flight duration:</strong> 3h 45m</p>
<p><strong>Operated by:</strong> NokScoot</p>
expected result
{
Departs: "5:15:00AM, Sat, Nov 28, 2015",
Arrives: "5:15:00AM, Sat, Nov 28, 2015",
Flight duration: "3h 45m"
...
}
current code (implementing)
doc_root = html.document_fromstring(resp.text)
for ele in doc_root.xpath('//ul[#class="tb_body"]'):
if has_stops(ele.xpath('.//li[#class="tb_body_flight"]//span[#class="has_cuspopup"]')):
continue
set_trace()
from_city = ele.xpath('.//li[#class="tb_body_city"]')[0]
set_trace()
sub_ele = ele.xpath('.//li[#class="tb_body_flight"]//span[#class="has_cuspopup"]')
set_trace()
I created example for html you provided. It uses popular Beautiful Soup.
from bs4 import BeautifulSoup
data = '<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>\
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>\
<p><strong>Flight duration:</strong> 3h 45m</p>\
<p><strong>Operated by:</strong> NokScoot</p>'
soup = BeautifulSoup(data, 'html.parser')
res = {p.contents[0].text: p.contents[1].split(' - ')[0].strip() for p in soup.find_all('p')}
print(res)
Output:
{
'Departs:': '5:15:00AM, Sat, Nov 28, 2015',
'Flight duration:': '3h 45m',
'Operated by:': 'NokScoot',
'Arrives:': '8:00:00AM, Sat, Nov 28, 2015'
}
I think you should avoid of using attributes if you want to make your code compact.
Related
Can I check how do we convert the below to a dictionary?
code.py
message = event['Records'][0]['Sns']['Message']
print(message)
# this gives the below and the type is <class 'str'>
{
"created_at":"Sat Jun 26 12:25:21 +0000 2021",
"id":1408763311479345152,
"text":"#test I\'m planning to buy the car today \ud83d\udd25\n\n",
"language":"en",
"author_details":{
"author_id":1384883875822907397,
"author_name":"\u1d04\u0280\u028f\u1d18\u1d1b\u1d0f\u1d04\u1d1c\u0299 x NFTs \ud83d\udc8e",
"author_username":"cryptocurrency_x009",
"author_profile_url":"https://xxxx.com",
"author_created_at":"Wed Apr 21 14:57:11 +0000 2021"
},
"id_displayed":"1",
"counter_emoji":{
}
}
I would need to add in additional field called "status" : 1 such that it looks like this:
{
"created_at":"Sat Jun 26 12:25:21 +0000 2021",
"id":1408763311479345152,
"text":"#test I\'m planning to buy the car today \ud83d\udd25\n\n",
"language":"en",
"author_details":{
"author_id":1384883875822907397,
"author_name":"\u1d04\u0280\u028f\u1d18\u1d1b\u1d0f\u1d04\u1d1c\u0299 x NFTs \ud83d\udc8e",
"author_username":"cryptocurrency_x009",
"author_profile_url":"https://xxxx.com",
"author_created_at":"Wed Apr 21 14:57:11 +0000 2021"
},
"id_displayed":"1",
"counter_emoji":{
},
"status": 1
}
Wanted to know what is the best way of doing this?
Update: I managed to do it for some reason.
I used ast.literal_eval(data) like below.
D2= ast.literal_eval(message)
D2["status"] =1
print(D2)
#This gives the below
{
"created_at":"Sat Jun 26 12:25:21 +0000 2021",
"id":1408763311479345152,
"text":"#test I\'m planning to buy the car today \ud83d\udd25\n\n",
"language":"en",
"author_details":{
"author_id":1384883875822907397,
"author_name":"\u1d04\u0280\u028f\u1d18\u1d1b\u1d0f\u1d04\u1d1c\u0299 x NFTs \ud83d\udc8e",
"author_username":"cryptocurrency_x009",
"author_profile_url":"https://xxxx.com",
"author_created_at":"Wed Apr 21 14:57:11 +0000 2021"
},
"id_displayed":"1",
"counter_emoji":{
},
"status": 1
}
Is there any better way to do this? Im not sure so wanted to check...
Can I check how do we convert the below to a dictionary?
As far as I can tell, the data = { } asigns a dictionary with content to the variable data.
I would need to add an additional field called "status" : 1 such that it looks like this
A simple update should do the trick.
data.update({"status": 1})
I found two issues when trying to deserialise the string as JSON
invalid escape I\\'m
unescaped newlines
These can worked around with
data = data.replace("\\'", "'")
data = re.sub('\n\n"', '\\\\n\\\\n"', data, re.MULTILINE)
d = json.loads(data)
There are also surrogate pairs in the data which may cause problems down the line. These can be fixed by doing
data = data.encode('utf-16', 'surrogatepass').decode('utf-16')
before calling json.loads.
Once the data has been deserialised to a dict you can insert the new key/value pair.
d['status'] = 1
I have a script which parses raw emails. It works fine for multipart emails, but how do I parse non multipart emails?
mail = email.message_from_string(raw_message)
if mail.is_multipart():
data = extract(mail)
else:
payload = mail.get_payload(decode=True)
Raw Email:
Return-Path: <>
X-Original-To: bounces#mydomain.com
Delivered-To: bounces#mydomain.com
Received: from inmumg01.tcs.com (inmumg01.tcs.com [219.64.33.12])
by smtp.mydomain.com (Postfix) with ESMTP id 603693FE11
for <bounces#mydomain.com>; Tue, 15 Mar 2016 04:39:36 -0400 (EDT)
Received: from localhost by inmumg01.tcs.com;
15 Mar 2016 14:09:38 +0530
Message-Id: <5aaa80$2543de#inmumg01.tcs.com>
Date: 15 Mar 2016 14:09:38 +0530
To: bounces#mydomain.com
From: "Mail Delivery System" <mail.notification#tcs.com>
Subject: Undeliverable Message
The following message to <vipul4.j#tcs.com> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 550-'vipul4.j#tcs.com... No such user'
The IP address of the MTA to which the message could not be sent:
172.17.9.35
---------- A copy of the message begins below this line ----------
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IPAS-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IronPort-AV: E=Sophos;i="5.24,338,1454956200";
d="scan'208,217";a="72486315"
X-Amp-Result: Clean
X-Amp-File-Uploaded: False
Received: from smtp.mydomain.com ([139.59.240.124])
by inmumg01.tcs.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 15 Mar 2016 14:09:37 +0530
Received: from 128.199.202.14 (unknown [128.199.202.14])
(Authenticated sender: mailsender)
by smtp.mydomain.com (Postfix) with ESMTPA id 0D41F3FE11
for <vipul4.j#tcs.com>; Tue, 15 Mar 2016 04:39:33 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kitemailer.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=Xxaf++WE0B7HL+FN28O76df7gYNEIKzk8eE9VpxrnMBCpGWPKWBMMfVDfCyie3NBJ
GJiMxn/Yhn+ey6Mr5R5AK5JO5n72yWlytLm0RepMEydaeHHVQPx7bE+LMDMlORSFin
bWdnz58lNMuZ3w9qtqjCXt22Sk5yXfCO71tRgfus=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=FiGdkSE9LCjYkfYyWq65GbZoMZVCQs5OXXJA35CyGQtjPWbvwIKvx7Z6Ff39EBRLf
Vu+6PUrvwyZLFh/1CW0NGOHDgUDjWWQ2jHfnNpJ9QEbHgOwomuMty10HDeZnIr0zM7
8mFCgeCbiiyusQkhmXh5aYqqD+Q/1wFcrpLpkBZc=
Date: Tue, 15 Mar 2016 04:39:31 -0400 (EDT)
From: Kitemailer Newsletter <info#kitemailer.com>
To: vipul4.j#tcs.com
Message-ID: <15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>
Subject: KiteMailer | New Features this Week
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_44_1398250960.1458031171306"
List-Unsubscribe: <http://example.com/unsubscribe/dmlwdWw0LmpAdGNzLmNvbSM5Ng==>
Feedback-ID: 19:96:1520615:MyDomain
Now in the else statement, I want to extract information, if I try payload['to'] it throws me an error TypeError: string indices must be integers, not str
Ok, let's say there is no way you can do it with the mail library (which I don't know), you may convert your raw message to a dictionary and get your elements:
this is is your raw message:
raw_message='''Return-Path: <>
X-Original-To: bounces#mydomain.com
Delivered-To: bounces#mydomain.com
Received: from inmumg01.tcs.com (inmumg01.tcs.com [219.64.33.12])
by smtp.mydomain.com (Postfix) with ESMTP id 603693FE11
for <bounces#mydomain.com>; Tue, 15 Mar 2016 04:39:36 -0400 (EDT)
Received: from localhost by inmumg01.tcs.com;
15 Mar 2016 14:09:38 +0530
Message-Id: <5aaa80$2543de#inmumg01.tcs.com>
Date: 15 Mar 2016 14:09:38 +0530
To: bounces#mydomain.com
From: "Mail Delivery System" <mail.notification#tcs.com>
Subject: Undeliverable Message
The following message to <vipul4.j#tcs.com> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 550-'vipul4.j#tcs.com... No such user'
The IP address of the MTA to which the message could not be sent:
172.17.9.35
---------- A copy of the message begins below this line ----------
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IPAS-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IronPort-AV: E=Sophos;i="5.24,338,1454956200";
d="scan'208,217";a="72486315"
X-Amp-Result: Clean
X-Amp-File-Uploaded: False
Received: from smtp.mydomain.com ([139.59.240.124])
by inmumg01.tcs.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 15 Mar 2016 14:09:37 +0530
Received: from 128.199.202.14 (unknown [128.199.202.14])
(Authenticated sender: mailsender)
by smtp.mydomain.com (Postfix) with ESMTPA id 0D41F3FE11
for <vipul4.j#tcs.com>; Tue, 15 Mar 2016 04:39:33 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kitemailer.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=Xxaf++WE0B7HL+FN28O76df7gYNEIKzk8eE9VpxrnMBCpGWPKWBMMfVDfCyie3NBJ
GJiMxn/Yhn+ey6Mr5R5AK5JO5n72yWlytLm0RepMEydaeHHVQPx7bE+LMDMlORSFin
bWdnz58lNMuZ3w9qtqjCXt22Sk5yXfCO71tRgfus=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=FiGdkSE9LCjYkfYyWq65GbZoMZVCQs5OXXJA35CyGQtjPWbvwIKvx7Z6Ff39EBRLf
Vu+6PUrvwyZLFh/1CW0NGOHDgUDjWWQ2jHfnNpJ9QEbHgOwomuMty10HDeZnIr0zM7
8mFCgeCbiiyusQkhmXh5aYqqD+Q/1wFcrpLpkBZc=
Date: Tue, 15 Mar 2016 04:39:31 -0400 (EDT)
From: Kitemailer Newsletter <info#kitemailer.com>
To: vipul4.j#tcs.com
Message-ID: <15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>
Subject: KiteMailer | New Features this Week
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_44_1398250960.1458031171306"
List-Unsubscribe: <http://example.com/unsubscribe/dmlwdWw0LmpAdGNzLmNvbSM5Ng==>
Feedback-ID: 19:96:1520615:MyDomain'''
I'm using your code to get payload:
#in case it is not multipart
import email
mail = email.message_from_string(raw_message)
payload = mail.get_payload(decode=True)
mail_dico = { elt.split(":",1)[0].strip():elt.split(":", 1)[1].strip() for elt in payload.split("\n") if ":" in elt and " " not in elt.split(':')[0].strip()}
here is your dictionary:
{'Content-Type': 'multipart/mixed;',
'DKIM-Signature': 'v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;',
'Date': 'Tue, 15 Mar 2016 04',
'Feedback-ID': '19',
'From': 'Kitemailer Newsletter <info#kitemailer.com>',
'List-Unsubscribe': '<http',
'MIME-Version': '1.0',
'Message-ID': '<15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>',
'Received': 'from 128.199.202.14 (unknown [128.199.202.14])',
'Subject': 'KiteMailer | New Features this Week',
'To': 'vipul4.j#tcs.com',
'X-Amp-File-Uploaded': 'False',
'X-Amp-Result': 'Clean',
'X-IPAS-Result': 'A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE',
'X-IronPort-AV': 'E=Sophos;i="5.24,338,1454956200";',
'X-IronPort-Anti-Spam-Filtered': 'true',
'X-IronPort-Anti-Spam-Result': 'A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE',
'h=Date': 'From'}
now you can access your elements:
print(mail_dico["To"])
>> 'vipul4.j#tcs.com'
print(mail_dico["Subject"])
>> 'KiteMailer | New Features this Week'
This is probably not the best way to do this but I hope it helped.
{"0":{"posted_date":"25 Jun 2015"},"1":{"posted_date":"26 Jun 2015"}}
Note:
that '0' and '1' are variable - 'count', the variable is generate through repeat/loop
"posted_date" is a string
"25 jun 2015" and "26 jun 2015" are also variable - 'date'
How to create a JSON output like above with python?
[edit-not working code]
import json
final = []
count = 0
postID = 224
while postID < 1200:
final.append({count: {"posted_ID":postID}})
count = count + 1
postID = postID * 2
print str(json.dumps(final))
import json
dates = ["25 Jun 2015", "26 Jun 2015", "27 Jun 2015"]
result = {}
for each, date in enumerate(dates):
result.update({each: {"posted_data": date}})
jsoned = json.dumps(result)
You don't need to use the "count" variable
First create the map the way you want it:
outMap = {}
outMap["0"]={}
outMap["0"]["posted_date"]="25 Jun 2015"
outMap["1"]={}
outMap["1"]["posted_date"]="26 Jun 2015"
Then use json.dumps() to get the json
import json
outjson = json.dumps(outMap)
print(outjson)
Here's an example email header,
header = """
From: Media Temple user (mt.kb.user#gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user#example.com
Return-Path: <mt.kb.user#gmail.com>
Envelope-To: user#example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""
The header is stored as a string, how do I parse this header, so that i can map it to a dictionary as the header fields be the key and the values be the values in the dictionary?
I want a dictionary like this,
header_dict = {
'From': 'Media Temple user (mt.kb.user#gmail.com)',
'Subject': article: 'A sample header',
'Date': 'January 25, 2011 3:30:58 PM PDT'
'and so on': .. . . . .. . . .. .
. . . . .. . . . .. . . . . .
}
I made a list of fields required,
header_reqd = ['From:','Subject:','Date:','To:','Return-Path:','Envelope-To:','Delivery-Date:','Received:','Dkim-Signature:','Domainkey-Signature:','Message-Id:','Mime-Version:','Content-Type:','X-Spam-Status:','X-Spam-Level:','Message Body:']
This can list items can likely be the keys for the dictionary.
It seems most of these answers have overlooked the Python email parser and the output results are not correct with prefix spaces in the values. Also the OP has perhaps made a typo by including a preceding newline in the header string which requires stripped for the email parser to work.
from email.parser import HeaderParser
header = header.strip() # Fix incorrect formatting
email_message = HeaderParser().parsestr(header)
dict(email_message)
Output (truncated):
>>> from pprint import pprint
>>> pprint(dict(email_message))
{'Content-Type': 'multipart/alternative; '
'boundary="----=_Part_3927_12044027.1214951458678"',
'Date': 'January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': 'Tue, 25 Jan 2011 15:31:01 -0700',
...
'Subject': 'article: A sample header',
'To': 'user#example.com',
'X-Spam-Level': '***',
'X-Spam-Status': 'score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
Duplicate header keys
Be aware that email message headers can contain duplicate keys as mentioned in the Python documentation for email.message
Headers are stored and returned in case-preserving form, but field names are matched case-insensitively. Unlike a real dict, there is an ordering to the keys, and there can be duplicate keys. Additional methods are provided for working with headers that have duplicate keys.
For example converting the following email message to a Python dict only the first Received key would be retained.
headers = HeaderParser().parsestr("""Received: by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example#example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example#example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)""")
dict(headers)
{'Received': 'by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)'}
Use the get_all method to check for duplicates:
headers.get_all('Received')
['by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example#example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example#example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)']
you can split string on newline, then split each line on ":"
>>> my_header = {}
>>> for x in header.strip().split("\n"):
... x = x.split(":", 1)
... my_header[x[0]] = x[1]
...
header = """From: Media Temple user (mt.kb.user#gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user#example.com
Return-Path: <mt.kb.user#gmail.com>
Envelope-To: user#example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""
Split into individual lines then split each line once on :
from pprint import pprint as pp
pp(dict(line.split(":",1) for line in header.splitlines()))
Output:
{'Content-Type': ' multipart/alternative; '
'boundary="----=_Part_3927_12044027.1214951458678"',
'Date': ' January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; '
's=gamma; '
'h=domainkey-signature:received:received:message-id:date:from:to '
':subject:mime-version:content-type; '
'bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; '
'b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea '
'LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m '
'CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; '
'h=message-id:date:from:to:subject:mime-version:content-type; '
'b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH '
'36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB '
'6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
'Envelope-To': ' user#example.com',
'From': ' Media Temple user (mt.kb.user#gmail.com)',
'Message Body': ' **The email message body**',
'Message-Id': ' '
'<c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>',
'Mime-Version': ' 1.0',
'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by '
'cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from '
'<mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for '
'user#example.com; Tue, 25 Jan 2011 15:31:01 -0700',
'Return-Path': ' <mt.kb.user#gmail.com>',
'Subject': ' article: A sample header',
'To': ' user#example.com',
'X-Spam-Level': ' ***',
'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
line.split(":",1) makes sure we only split once on : so if there are any : in the values we won't end up splitting that also. You end up with sublists that are key/value pairings so calling dict creates the dict create from each pairing.
split will work for you:
Demo:
>>> result = {}
>>> for i in header.split("\n"):
... i = i.strip()
... if i :
... k, v = i.split(":", 1)
... result[k] = v
output:
>>> import pprint
>>> pprint.pprint(result)
{'Content-Type': ' multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"',
'Date': ' January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
'Envelope-To': ' user#example.com',
'From': ' Media Temple user (mt.kb.user#gmail.com)',
'Message Body': ' **The email message body**',
'Message-Id': ' <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>',
'Mime-Version': ' 1.0',
'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700',
'Return-Path': ' <mt.kb.user#gmail.com>',
'Subject': ' article: A sample header',
'To': ' user#example.com',
'X-Spam-Level': ' ***',
'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
When using the Python swiftclient module I can POST to an object with a header of X-Delete-At/After and an epoch, but how do I show the expiration time of the object? I was doing some testing and it seems that the file is always being expired immediately, e.g. where I set the time for 100 days in the future:
>>> swift.put_object('container1','test_file01.txt','This is line 1 in the file test_file01.txt done at: %s' % datetime.now().strftime('%Y-%m-%d %H:%M:%S,%f'))
'4b3faf0b79d97f5e478949e7d6c4c575'
>>> swift.head_object('container1','test_file01.txt')
{'content-length': '78', 'server': 'Jetty(7.6.4.v20120524)', 'last-modified': 'Wed, 23 Apr 2014 17:09:55 GMT', 'etag': '4b3faf0b79d97f5e478949e7d6c4c575', 'x-timestamp': '1398272995', 'date': 'Wed, 23 Apr 2014 17:09:59 GMT', 'content-type': 'application/octet-stream'}
>>> swift.post_object('container1','test_file01.txt',headers={'X-Delete-At':(datetime.now(pytz.timezone('GMT')) + timedelta(days=100)).strftime('%s')})
>>> swift.head_object('container1','test_file01.txt')
Traceback (most recent call last):
File "<pyshell#121>", line 1, in <module>
swift.head_object('container1','test_file01.txt')
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 1279, in head_object
return self._retry(None, head_object, container, obj)
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 1189, in _retry
rv = func(self.url, self.token, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 853, in head_object
http_response_content=body)
ClientException: Object HEAD failed: http://10.249.238.135:9024:9024/v1/rjm-vnx-namespace01/container1/test_file01.txt 404 Not Found
So it seems it was expired immediately. My questions are:
Am I setting the expiration correctly? I would like to be able to do it to an existing object rather than at object creation time, but perhaps I HAVE to do it when it's being created???
Is there a way to see the expiration time? Obviously if it's not working correctly than there's no good way to see it, but if it were, does head_object() return that information?
Thanks,
Rob
Never mind, I figured it out. By setting an "after" I realized that the value apparently needs to be in milliseconds. So when I changed it to:
>>> swift.post_object('container1','test_file01.txt',headers={'X-Delete-At':int((datetime.now(pytz.timezone('GMT')) + timedelta(days=100)).strftime('%s'))*1000})
>>> swift.head_object('container1','test_file01.txt')
{'content-length': '78', 'x-delete-at':'1406932148000', 'server': 'Jetty(7.6.4.v20120524)', 'last-modified': 'Wed, 23 Apr 2014 17:29:06 GMT', 'etag': '0baf8b37f374c94e59a05a7f7b339811', 'x-timestamp': '1398274146', 'date': 'Wed, 23 Apr 2014 17:29:08 GMT', 'content-type': 'application/octet-stream'}
Then it worked as expected.
Rob