Related
I am carrying out a API search on scaleserp and for each search I do i want to put the output in the column in my dataframe.
#Matches the GET request
api_result = requests.get('https://api.scaleserp.com/search', params, verify=False)
print(type(api_result))
#stores the result in JSON
result = api_result.json()
print(type(result))
#Creates a new DataFrame with 'Organic_Results' From the JSON output.
Results_df = (result['organic_results'])
#FOR loop to look at each result and select which output from the JSON is wanted.
for res in Results_df:
StartingDataFrame['JSONDump'] = res
api_result is a requests.models.Response
result is a dict.
RES is a dict.
I want the RES to be put into the column Dump. is this possible?
Updated Code
#Matches the GET request
api_result = requests.get('https://api.scaleserp.com/search', params, verify=False)
#stores the result in JSON
result = api_result.json()
#Creates a new DataFrame with 'Organic_Results' From the JSON output.
Results_df = (result['organic_results'])
#FOR loop to look at each result and select which output from the JSON is wanted.
for res in Results_df:
Extracted_data = {key: res[key] for key in res.keys()
& {'title', 'link', 'snippet_matched', 'date', 'snippet'}}
Extracted_data is a dict and contains the info i need.
{'title': '25 Jun 1914 - Advertising - Trove', 'link': 'https://trove.nla.gov.au/newspaper/article/7280119', 'snippet_matched': ['()', 'charge', 'Dan Whit'], 'snippet': 'I Iron roof, riltibcd II (),. Line 0.139.5. wai at r ar ... Propertb-« entired free of charge. Line 2.130.0 ... AT Dan Whit",\'»\', 6il 02 sturt »L, Prlnce\'»~Brti\'»e,. Line 3.12.0.'}
{'snippet': "Mary Bardwell is in charge of ... I() •. Al'companit'd by: Choppf'd Chitkf'n Li\\f>r Palt·. 1h!iiSC'o Gret'n Salad g iii ... of the overtime as Dan Whit-.", 'title': 'October 16,1980 - Bethlehem Public Library',
'link': 'http://www.bethlehempubliclibrary.org/webapps/spotlight/years/1980/1980-10-16.pdf', 'snippet_matched': ['charge', '()', 'Dan Whit'], 'date': '16 Oct 1980'}
{'snippet': 'CONGRATULATIONS TO DAN WHIT-. TLE ON THE ... jailed and beaten dozens of times. In one of ... ern p()rts ceased. The MIF is not only\xa0...', 'title': 'extensions of remarks - US Government Publishing Office', 'link': 'https://www.gpo.gov/fdsys/pkg/GPO-CRECB-1996-pt5/pdf/GPO-CRECB-1996-pt5-7-3.pdf', 'snippet_matched': ['DAN WHIT', 'jailed', '()'], 'date': '26 Apr 1986'}
{'snippet': 'ILLUSTRATION BY DAN WHIT! By Matt Manning ... ()n the one hand, there are doctors on both ... self-serving will go to jail at the beginning of\xa0...', 'title': 'The BG News May 23, 2007 - ScholarWorks#BGSU - Bowling ...', 'link': 'https://scholarworks.bgsu.edu/cgi/viewcontent.cgi?article=8766&context=bg-news', 'snippet_matched': ['DAN WHIT', '()', 'jail'], 'date': '23 May 2007'}
{'snippet': '$19.95 Charge card number SERVICE HOURS: ... Explorer Advisor Dan Whit- ... lhrr %(OnrwflC or ()utuflrueonlinelfmarketing (arnpaigfl%? 0I - .',
'title': '<%BANNER%> TABLE OF CONTENTS HIDE Section A: Main ...', 'link': 'https://ufdc.ufl.edu/UF00028295/00194', 'snippet_matched': ['Charge', 'Dan Whit', '()'], 'date': 'Listings 1 - 800'}
{'title': 'Lledo Promotional,Bull Nose Morris,Dandy,Desperate Dan ...', 'link': 'https://www.ebay.co.uk/itm/Lledo-Promotional-Bull-Nose-Morris-Dandy-Desperate-Dan-White-Van-/233817683840', 'snippet_matched': ['charges'], 'snippet': 'No additional import charges on delivery. This item will be sent through the Global Shipping Programme and includes international tracking. Learn more- opens\xa0...'}
The problem looks like the length of your organic_results is not the same as the length of your dataframe.
StartingDataFrame['Dump'] = (result['organic_results'])
Here your setting the whole column dump to equal the organic_results which is smaller or larger than your already defined dataframe. I'm not sure what your dataframe already has but you could do this by iterating through the rows and stashing them in that row if you have values you want them to add up with like this:
StartingDataFrame['Dump'] = []*len(StartingDataFrame)
for i,row in StartingDataFrame.iterrows():
StartingDataFrame.at[i,'Dump'] = result['organic_results']
Depending on what your data looks like you could maybe just append it to the dataframe
StartingDataFrame = StartingDataFrame.append(result['organic_results'],ignore_index=True)
Could you show us a sample of what both data source looks like?
I am trying to return a particular string value after getting response from request URL.
Ex.
response =
{
'assets': [
{
'VEG': True,
'CONTACT': '12345',
'CLASS': 'SIX',
'ROLLNO': 'A101',
'CITY': 'CHANDI',
}
],
"body": "**Trip**: 2017\r\n** Date**: 15th Jan 2015\r\n**Count**: 501\r\n\r\n"
}
This is the response which i am getting, from this I need only Date: 15th Jan 2015. I am not sure how to do it.
Any help would be appreciated.
assuming it is a dictionary
a={'body': '**Trip**: 2017\r\n** Date**: 15th Jan 2015\r\n**Count**: 501\r\n\r\n', 'assets': [{'VEG': True, 'CONTACT': '12345', 'CLASS': 'SIX', 'ROLLNO': 'A101', 'CITY': 'CHANDI'}]}
then
required=a['body'].split('\r\n')[1].replace('**','')
print required
result:
' Date: 15th Jan 2015'
Access the key body of the dictionary a
split through \r\n to get a list ['**Trip**: 2017', '** Date**:
15th Jan 2015', '**Count**: 501', '', '']
access it's first index and replace ** with empty('')
a = str(yourdictionary)
print([e for e in a.split("\r\n") if "Date" in e][0].remove("**").strip())
Try this
>>> body = response['body']
>>> bodylist = body.split('\r\n')
>>> for value in bodylist:
... value = value.split(':')
...
>>> for i,value in enumerate(bodylist):
... bodylist[i] = value.split(':')
...
>>> for i, value in enumerate(bodylist):
... if bodylist[i][0] == '** Date**':
... print(bodylist[i][1])
...
15th Jan 2015
I have trown it in the interpreter and this works. I don't know if it is the best code around, but it works ;-)
I have a script which parses raw emails. It works fine for multipart emails, but how do I parse non multipart emails?
mail = email.message_from_string(raw_message)
if mail.is_multipart():
data = extract(mail)
else:
payload = mail.get_payload(decode=True)
Raw Email:
Return-Path: <>
X-Original-To: bounces#mydomain.com
Delivered-To: bounces#mydomain.com
Received: from inmumg01.tcs.com (inmumg01.tcs.com [219.64.33.12])
by smtp.mydomain.com (Postfix) with ESMTP id 603693FE11
for <bounces#mydomain.com>; Tue, 15 Mar 2016 04:39:36 -0400 (EDT)
Received: from localhost by inmumg01.tcs.com;
15 Mar 2016 14:09:38 +0530
Message-Id: <5aaa80$2543de#inmumg01.tcs.com>
Date: 15 Mar 2016 14:09:38 +0530
To: bounces#mydomain.com
From: "Mail Delivery System" <mail.notification#tcs.com>
Subject: Undeliverable Message
The following message to <vipul4.j#tcs.com> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 550-'vipul4.j#tcs.com... No such user'
The IP address of the MTA to which the message could not be sent:
172.17.9.35
---------- A copy of the message begins below this line ----------
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IPAS-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IronPort-AV: E=Sophos;i="5.24,338,1454956200";
d="scan'208,217";a="72486315"
X-Amp-Result: Clean
X-Amp-File-Uploaded: False
Received: from smtp.mydomain.com ([139.59.240.124])
by inmumg01.tcs.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 15 Mar 2016 14:09:37 +0530
Received: from 128.199.202.14 (unknown [128.199.202.14])
(Authenticated sender: mailsender)
by smtp.mydomain.com (Postfix) with ESMTPA id 0D41F3FE11
for <vipul4.j#tcs.com>; Tue, 15 Mar 2016 04:39:33 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kitemailer.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=Xxaf++WE0B7HL+FN28O76df7gYNEIKzk8eE9VpxrnMBCpGWPKWBMMfVDfCyie3NBJ
GJiMxn/Yhn+ey6Mr5R5AK5JO5n72yWlytLm0RepMEydaeHHVQPx7bE+LMDMlORSFin
bWdnz58lNMuZ3w9qtqjCXt22Sk5yXfCO71tRgfus=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=FiGdkSE9LCjYkfYyWq65GbZoMZVCQs5OXXJA35CyGQtjPWbvwIKvx7Z6Ff39EBRLf
Vu+6PUrvwyZLFh/1CW0NGOHDgUDjWWQ2jHfnNpJ9QEbHgOwomuMty10HDeZnIr0zM7
8mFCgeCbiiyusQkhmXh5aYqqD+Q/1wFcrpLpkBZc=
Date: Tue, 15 Mar 2016 04:39:31 -0400 (EDT)
From: Kitemailer Newsletter <info#kitemailer.com>
To: vipul4.j#tcs.com
Message-ID: <15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>
Subject: KiteMailer | New Features this Week
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_44_1398250960.1458031171306"
List-Unsubscribe: <http://example.com/unsubscribe/dmlwdWw0LmpAdGNzLmNvbSM5Ng==>
Feedback-ID: 19:96:1520615:MyDomain
Now in the else statement, I want to extract information, if I try payload['to'] it throws me an error TypeError: string indices must be integers, not str
Ok, let's say there is no way you can do it with the mail library (which I don't know), you may convert your raw message to a dictionary and get your elements:
this is is your raw message:
raw_message='''Return-Path: <>
X-Original-To: bounces#mydomain.com
Delivered-To: bounces#mydomain.com
Received: from inmumg01.tcs.com (inmumg01.tcs.com [219.64.33.12])
by smtp.mydomain.com (Postfix) with ESMTP id 603693FE11
for <bounces#mydomain.com>; Tue, 15 Mar 2016 04:39:36 -0400 (EDT)
Received: from localhost by inmumg01.tcs.com;
15 Mar 2016 14:09:38 +0530
Message-Id: <5aaa80$2543de#inmumg01.tcs.com>
Date: 15 Mar 2016 14:09:38 +0530
To: bounces#mydomain.com
From: "Mail Delivery System" <mail.notification#tcs.com>
Subject: Undeliverable Message
The following message to <vipul4.j#tcs.com> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 550-'vipul4.j#tcs.com... No such user'
The IP address of the MTA to which the message could not be sent:
172.17.9.35
---------- A copy of the message begins below this line ----------
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IPAS-Result: A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE
X-IronPort-AV: E=Sophos;i="5.24,338,1454956200";
d="scan'208,217";a="72486315"
X-Amp-Result: Clean
X-Amp-File-Uploaded: False
Received: from smtp.mydomain.com ([139.59.240.124])
by inmumg01.tcs.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 15 Mar 2016 14:09:37 +0530
Received: from 128.199.202.14 (unknown [128.199.202.14])
(Authenticated sender: mailsender)
by smtp.mydomain.com (Postfix) with ESMTPA id 0D41F3FE11
for <vipul4.j#tcs.com>; Tue, 15 Mar 2016 04:39:33 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kitemailer.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=Xxaf++WE0B7HL+FN28O76df7gYNEIKzk8eE9VpxrnMBCpGWPKWBMMfVDfCyie3NBJ
GJiMxn/Yhn+ey6Mr5R5AK5JO5n72yWlytLm0RepMEydaeHHVQPx7bE+LMDMlORSFin
bWdnz58lNMuZ3w9qtqjCXt22Sk5yXfCO71tRgfus=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;
s=kitemail; t=1458031173;
bh=1KFZSL77mNYuQ3iTjpNMdGcBOp2a4pGQnLVlq49ZrGg=;
h=Date:From:To:Subject:List-Unsubscribe;
b=FiGdkSE9LCjYkfYyWq65GbZoMZVCQs5OXXJA35CyGQtjPWbvwIKvx7Z6Ff39EBRLf
Vu+6PUrvwyZLFh/1CW0NGOHDgUDjWWQ2jHfnNpJ9QEbHgOwomuMty10HDeZnIr0zM7
8mFCgeCbiiyusQkhmXh5aYqqD+Q/1wFcrpLpkBZc=
Date: Tue, 15 Mar 2016 04:39:31 -0400 (EDT)
From: Kitemailer Newsletter <info#kitemailer.com>
To: vipul4.j#tcs.com
Message-ID: <15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>
Subject: KiteMailer | New Features this Week
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_44_1398250960.1458031171306"
List-Unsubscribe: <http://example.com/unsubscribe/dmlwdWw0LmpAdGNzLmNvbSM5Ng==>
Feedback-ID: 19:96:1520615:MyDomain'''
I'm using your code to get payload:
#in case it is not multipart
import email
mail = email.message_from_string(raw_message)
payload = mail.get_payload(decode=True)
mail_dico = { elt.split(":",1)[0].strip():elt.split(":", 1)[1].strip() for elt in payload.split("\n") if ":" in elt and " " not in elt.split(':')[0].strip()}
here is your dictionary:
{'Content-Type': 'multipart/mixed;',
'DKIM-Signature': 'v=1; a=rsa-sha256; c=relaxed/simple; d=mydomain.com;',
'Date': 'Tue, 15 Mar 2016 04',
'Feedback-ID': '19',
'From': 'Kitemailer Newsletter <info#kitemailer.com>',
'List-Unsubscribe': '<http',
'MIME-Version': '1.0',
'Message-ID': '<15106466-1b13-4d64-b220-15f05f4815b7-1458031171312#smtp.mydomain.com>',
'Received': 'from 128.199.202.14 (unknown [128.199.202.14])',
'Subject': 'KiteMailer | New Features this Week',
'To': 'vipul4.j#tcs.com',
'X-Amp-File-Uploaded': 'False',
'X-Amp-Result': 'Clean',
'X-IPAS-Result': 'A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE',
'X-IronPort-AV': 'E=Sophos;i="5.24,338,1454956200";',
'X-IronPort-Anti-Spam-Filtered': 'true',
'X-IronPort-Anti-Spam-Result': 'A0CGBgCEyedW/3zwO4tdHgEBAg4BgklMUm2nXoJekBMBDYFmBxUFAQ2HGwI4FAEBAQEBAQFkJ4RLIAoTAQEECCwGSQMBCQICMTsFHASHJ10FCatgZ4RBAQSLKQaBD4REgkIBhlERAWqCNBOBJ5MJhEuBLwKEPogSAoIuhnOFYo1YgUUBAUKBNgwBgj5OB4kqgTIBAQE',
'h=Date': 'From'}
now you can access your elements:
print(mail_dico["To"])
>> 'vipul4.j#tcs.com'
print(mail_dico["Subject"])
>> 'KiteMailer | New Features this Week'
This is probably not the best way to do this but I hope it helped.
I'm trying to parse the html into dictionary
My current code has lots of logic in it.
It smells bad, I use the lxml to help me to parse it.
Any recommend method to parse the kind of html without too much well-formed DOM ?
Thanks so much
original html
<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>
<p><strong>Flight duration:</strong> 3h 45m</p>
<p><strong>Operated by:</strong> NokScoot</p>
expected result
{
Departs: "5:15:00AM, Sat, Nov 28, 2015",
Arrives: "5:15:00AM, Sat, Nov 28, 2015",
Flight duration: "3h 45m"
...
}
current code (implementing)
doc_root = html.document_fromstring(resp.text)
for ele in doc_root.xpath('//ul[#class="tb_body"]'):
if has_stops(ele.xpath('.//li[#class="tb_body_flight"]//span[#class="has_cuspopup"]')):
continue
set_trace()
from_city = ele.xpath('.//li[#class="tb_body_city"]')[0]
set_trace()
sub_ele = ele.xpath('.//li[#class="tb_body_flight"]//span[#class="has_cuspopup"]')
set_trace()
I created example for html you provided. It uses popular Beautiful Soup.
from bs4 import BeautifulSoup
data = '<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>\
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>\
<p><strong>Flight duration:</strong> 3h 45m</p>\
<p><strong>Operated by:</strong> NokScoot</p>'
soup = BeautifulSoup(data, 'html.parser')
res = {p.contents[0].text: p.contents[1].split(' - ')[0].strip() for p in soup.find_all('p')}
print(res)
Output:
{
'Departs:': '5:15:00AM, Sat, Nov 28, 2015',
'Flight duration:': '3h 45m',
'Operated by:': 'NokScoot',
'Arrives:': '8:00:00AM, Sat, Nov 28, 2015'
}
I think you should avoid of using attributes if you want to make your code compact.
Here's an example email header,
header = """
From: Media Temple user (mt.kb.user#gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user#example.com
Return-Path: <mt.kb.user#gmail.com>
Envelope-To: user#example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""
The header is stored as a string, how do I parse this header, so that i can map it to a dictionary as the header fields be the key and the values be the values in the dictionary?
I want a dictionary like this,
header_dict = {
'From': 'Media Temple user (mt.kb.user#gmail.com)',
'Subject': article: 'A sample header',
'Date': 'January 25, 2011 3:30:58 PM PDT'
'and so on': .. . . . .. . . .. .
. . . . .. . . . .. . . . . .
}
I made a list of fields required,
header_reqd = ['From:','Subject:','Date:','To:','Return-Path:','Envelope-To:','Delivery-Date:','Received:','Dkim-Signature:','Domainkey-Signature:','Message-Id:','Mime-Version:','Content-Type:','X-Spam-Status:','X-Spam-Level:','Message Body:']
This can list items can likely be the keys for the dictionary.
It seems most of these answers have overlooked the Python email parser and the output results are not correct with prefix spaces in the values. Also the OP has perhaps made a typo by including a preceding newline in the header string which requires stripped for the email parser to work.
from email.parser import HeaderParser
header = header.strip() # Fix incorrect formatting
email_message = HeaderParser().parsestr(header)
dict(email_message)
Output (truncated):
>>> from pprint import pprint
>>> pprint(dict(email_message))
{'Content-Type': 'multipart/alternative; '
'boundary="----=_Part_3927_12044027.1214951458678"',
'Date': 'January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': 'Tue, 25 Jan 2011 15:31:01 -0700',
...
'Subject': 'article: A sample header',
'To': 'user#example.com',
'X-Spam-Level': '***',
'X-Spam-Status': 'score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
Duplicate header keys
Be aware that email message headers can contain duplicate keys as mentioned in the Python documentation for email.message
Headers are stored and returned in case-preserving form, but field names are matched case-insensitively. Unlike a real dict, there is an ordering to the keys, and there can be duplicate keys. Additional methods are provided for working with headers that have duplicate keys.
For example converting the following email message to a Python dict only the first Received key would be retained.
headers = HeaderParser().parsestr("""Received: by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example#example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example#example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)""")
dict(headers)
{'Received': 'by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)'}
Use the get_all method to check for duplicates:
headers.get_all('Received')
['by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example#example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example#example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)']
you can split string on newline, then split each line on ":"
>>> my_header = {}
>>> for x in header.strip().split("\n"):
... x = x.split(":", 1)
... my_header[x[0]] = x[1]
...
header = """From: Media Temple user (mt.kb.user#gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user#example.com
Return-Path: <mt.kb.user#gmail.com>
Envelope-To: user#example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""
Split into individual lines then split each line once on :
from pprint import pprint as pp
pp(dict(line.split(":",1) for line in header.splitlines()))
Output:
{'Content-Type': ' multipart/alternative; '
'boundary="----=_Part_3927_12044027.1214951458678"',
'Date': ' January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; '
's=gamma; '
'h=domainkey-signature:received:received:message-id:date:from:to '
':subject:mime-version:content-type; '
'bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; '
'b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea '
'LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m '
'CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; '
'h=message-id:date:from:to:subject:mime-version:content-type; '
'b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH '
'36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB '
'6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
'Envelope-To': ' user#example.com',
'From': ' Media Temple user (mt.kb.user#gmail.com)',
'Message Body': ' **The email message body**',
'Message-Id': ' '
'<c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>',
'Mime-Version': ' 1.0',
'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by '
'cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from '
'<mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for '
'user#example.com; Tue, 25 Jan 2011 15:31:01 -0700',
'Return-Path': ' <mt.kb.user#gmail.com>',
'Subject': ' article: A sample header',
'To': ' user#example.com',
'X-Spam-Level': ' ***',
'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
line.split(":",1) makes sure we only split once on : so if there are any : in the values we won't end up splitting that also. You end up with sublists that are key/value pairings so calling dict creates the dict create from each pairing.
split will work for you:
Demo:
>>> result = {}
>>> for i in header.split("\n"):
... i = i.strip()
... if i :
... k, v = i.split(":", 1)
... result[k] = v
output:
>>> import pprint
>>> pprint.pprint(result)
{'Content-Type': ' multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"',
'Date': ' January 25, 2011 3:30:58 PM PDT',
'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
'Envelope-To': ' user#example.com',
'From': ' Media Temple user (mt.kb.user#gmail.com)',
'Message Body': ' **The email message body**',
'Message-Id': ' <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752#mail.gmail.com>',
'Mime-Version': ' 1.0',
'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user#gmail.com>) id 1KDoNH-0000f0-RL for user#example.com; Tue, 25 Jan 2011 15:31:01 -0700',
'Return-Path': ' <mt.kb.user#gmail.com>',
'Subject': ' article: A sample header',
'To': ' user#example.com',
'X-Spam-Level': ' ***',
'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}