payload of an email in string format, python - python

I got payload as a string instance using get_payload() method. But I want my payload in a way where I could access it word by word
I tried several things like as_string() method, flatten() method, get_charset() method , but every time there is some problem.
I got my payload using the following code
import email
from email import *
f=open('mail.txt','r')
obj=email.parser.Parser()
fp=obj.parse(f)
payload=fp.get_payload()

Just tested your snippet with a couple of my own raw emails. Works fine...
get_payload() returns either a list or string, so you need to check that first
if isinstance(payload, list):
for m in payload:
print str(m).split()
else:
print str(m).split()
Edit
Per our discussion, your issue was that you were not checking is_multipart() on the fp object, which actually is a message instance. If fp.is_multipart() == True, then get_payload() will return a list of message instances. In your case, based on your example mail message, it was NOT multipart, and fp was actually the object you were interesting in.

I got my payload as a string as my fp was not multipart
If it could have been a multipart, it would have returned a list of strings
so now I can just use the following code
payload=fp.get_payload()
abc=payload.split(" ")
it gives me the output as follows
['good', 'day\nhttp://72.167.116.186/image/bdfedx.php?iqunin=3D41\n\n', '', '', '', '', '', '', '', '', '', '', '', 'Sun,', '18', 'Dec', '2011', '10:53:43\n_________________\n"She', 'wiped', 'him', 'dry', 'with', 'soft', 'flannel,', 'and', 'gave', 'him', 'some', 'clean,', 'dry', 'clothes,=\n', 'and', 'made', 'him', 'very', 'comfortable', 'again."', '(c)', 'Lyrica', 'wa946758\n']
thanks to jdi :)
p.s. couldnt post it as an answer yesterday, as there was some restriction with points

Related

Python splitting text with line breaks into a list

I'm trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces.
exerpt from text:
I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the&lt I
Currently I'm using this line to split each word into an item in the list:
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) \
for k in content.split(" ")]
print(text_list)
This code is leaving in spaces and combining words in each item of the list like below
Result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between St ', 'Petersburgh', 'and', 'Archangel ', ' lt the lt I']
I would like to split the words into individual items of the list and remove the string ' lt ' and numbers from my list items.
Expected result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post', 'road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the' 'I']
Please help me resolve this issue.
Thanks
Since it looks like you're parsing html text, it's likely all entities are enclosed in & and ;. Removing those makes matching the rest quite easy.
import re
content = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I'
# first, remove entities, the question mark makes sure the expression isn't too greedy
content = re.sub(r'&[^ ]+?;', '', content)
# then just match anything that meets your rules
text_list = re.findall(r"[a-zA-Z0-9]+", content)
print(text_list)
Note that 'St Petersburg' likely got matched together because the character between the 't' and 'P' probably isn't a space, but a non-breaking space. If this were just html, I'd expect there to be or something of the sort, but it's possible that in your case there's some UTF non-breaking space character there.
That should not matter with the code above, but if you use a solution using .split(), it likely won't see that character as a space.
In case the &lt is not your mistake, but in the original, this works as a replacement for the .sub() statement:
content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)
Clearly a bit more complicated: it substitutes any string that starts with & [&], followed by one or more characters that are not a space or ;, taking as little as possible [[^ ;]+?], but only if they are then followed by a space or a ; [(?=[ ;])], and in that case that ; is also matched [;?].
Here is what can can be done. You just need to replace any known code of syntax in advance
import re
# define some special syntax that want to remove
special_syntax = r"&(lt|nbsp|gt|amp|quot|apos|cent|pound|yen|euro|copy|reg|)[; ]"
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k).strip() \
# Here I remove the syntax before split them and substitue special char again
for k in re.sub(special_syntax, ' ', content).split(" ")]
# remove empty string from the list
filter_object = filter(lambda x: x != "", text_list)
list(filter_object)
Output
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the', 'I']

Python unable to pass defaultdict values to function

I have a complex code which reads some values into nested defaultdict.
Then there is a cycle going through the keys in the dictionary and working with them - basically assigning them to another nested defaultdict.
Problem is, when I want to use the values from the dictionary and access them and pass them as values to a function.... I get either empty {} or something like this: defaultdict(<function tree at 0x2aff774309d8>
I have tried to write the dict so I can see if it is really empty. Part of my code;
if (not families_data[family]['cell_db']['output']):
print(rf"Output for {family} is empty.")
print(dict(families_data[family]['celldb']))
The really fun part is, when this "if" is true, then I get the following output:
Output for adfull is empty.
{'name': 'adfullx05_b', 'family': 'adfull', 'drive_strength': 0.5, 'template': 'adfull', 'category': '', 'pinmap': '', 'output': 'CO S', 'inout': '', 'input': 'A B CI', 'rail_supply': 'VDD VSS', 'well_supply': '', 'description': ''}
if I change the second line in the if to
print(families_data[family]['celldb'])
I get the following output:
defaultdict(<function tree at 0x2b45844059d8>, {'name': 'adfullx05_b', 'family': 'adfull', 'drive_strength': 0.5, 'template': 'adfull', 'category': '', 'pinmap': '', 'output': 'CO S', 'inout': '', 'input': 'A B CI', 'rail_supply': 'VDD VSS', 'well_supply': '', 'description': ''})
Why is the "if" even true, when there is a value 'CO S' in the output key?
Why am I getting {} when trying to access any value like families_data[family]['cell_db']['input'] and passing it to function as a parameter?
What the heck am I doing wrong?
The "cell_db" key in the if statement has an underscore while it does not in the print statement.
This should fix it:
if (not families_data[family]['celldb']['output']):
print(rf"Output for {family} is empty.")
print(dict(families_data[family]['celldb']))

How Do I Split A String Using Multiple Delimiters (Python)

I am trying to further split an already split string to further clean it up and remove unnecessary bits of info. This is a URL split by '/'
['https:', '', 'expressjs.com', 'en', 'starter', 'hello-world.html']
I would like to be able to make it:
['https:', '', 'expressjs','com', 'en', 'starter', 'hello-world','html']
Any thoughts?
re.split can split a string on every match for your regex
>>> re.split('[/\.]', 'https://expressjs.com/en/starter/hello-world.html')
['https:', '', 'expressjs', 'com', 'en', 'starter', 'hello-world', 'html']
[/\.] matches any forward-slash or period character
Try this:
L = ['https:', '', 'expressjs.com', 'en', 'starter', 'hello-world.html']
L = [subitem for item in L for subitem in item.split('.')]
print(L)
Output:
['https:', '', 'expressjs', 'com', 'en', 'starter', 'hello-world', 'html']

I am trying to print a few key lines from a directory of like files using line.split('\n) --> not recognizing lines

So this input file already has line breaks. It's the natural setting in which it's created. When I attempt to identify certain lines so that I can go back and call the values from said lines i get,
name = line[2]
IndexError: list index out of range
Any thoughts? I know there has to be an easy solution as this is fairly basic but I have sifted through every entry on splitting and splitting with ('\n') and nothing has worked. Any help from you folks would be greatly appreciated!
-Ut prosim
Input:
ID rpmI_bact
AC TIGR00001
DE ribosomal protein L35
Script
for i in info.readlines():
line = i.split('\n')
id_hit = line[0]
ac = line[1]
name = line[2]
print(name)
Error
name = line[2]
IndexError: list index out of range
First of all, when you do readlines, you will get back a list of all the lines your file, which will probably look something like this:
[' ID rpmI_bact', ' AC TIGR00001', ' DE ribosomal protein L35']
If you take one of these values and then try to split on newlines, you won't get anything split:
' ID rpmI_bact'.split('\n')
[' ID rpmI_bact']
Notice that the return value is a list with one element, so that's why you get your IndexError.
Now, it seems like you want to take each line and split on whitespace? If so, the way to do that is to use split(' '), but this is going to give you potentially unreliable content back:
In [8]: for line in lines:
...: print(line.split(' '))
...:
['', '', '', '', 'ID', '', 'rpmI_bact']
['', '', '', '', 'AC', '', 'TIGR00001']
['', '', '', '', 'DE', '', 'ribosomal', 'protein', 'L35']
Notice how it's not obvious where the "content" is? We can solve this in a few ways. One is to introduce regexes, while the other way is to simply take the values that are not empty strings (note that empty strings in Python are Falsey values):
In [9]: bool("")
Out[9]: False
In [10]: for line in lines:
...: print([elem for elem in line.split(' ') if elem])
...:
['ID', 'rpmI_bact']
['AC', 'TIGR00001']
['DE', 'ribosomal', 'protein', 'L35']
Now you have to figure out what you want to do with those lists. I didn't really get that from the question, though.
I'd probably consider just making it a dictionary. Then you can query it by the 2 letter key. No need for the .readlines() either.
d = dict(line.strip().split(' ', 2) for line in info)
That should give you a dictionary looking like
{'AC': 'TIGR00001', 'DE': 'ribosomal protein L35', 'ID': 'rpmI_bact'}
Then you can just access the ID you're interested in
name = d['DE']

turn key="value" string into a dict

I have a string with the following format:
author="PersonsName" date="1183050420" format="1.1" version="1.2"
I want to turn it in to a Python dict, a la:
{'author': 'PersonsName', 'date': '1183050420', 'format': '1.1', 'version': '1.2'}
I have tried to do so using re.split on the string as so:
attribs = (re.split('(=?" ?)', twikiattribs))
thinking I would get a list back like:
['author', 'PersonsName', 'date', '1183050420', 'format', '1.1', 'version', '1.2']
that then I could turn into a dict, but instead I'm getting:
['author', '="', 'PersonsName', '" ', 'date', '="', '1183050420', '" ', 'format', '="', '1.1', '" ', 'version', '="', '1.2', '"', '']
So, before I follow the re.split line further, is there generally a better way to achieve what I'm trying to do, and/or if the solution involves re.split, how can I write a regex that will split on any of the strings =", "_ (where "_" is a space char) or just " to just yield a list with the keys in the odd indices, and values in the even?
Use re.findall():
dict(re.findall(r'(\w+)="([^"]+)"', twikiattribs))
re.findall(), when presented with a pattern with multiple capturing groups, returns a list of tuples, each nested tuple containing the captured groups. dict() happily takes that output and interprets each nested tuple as a key-value pair.
Demo:
>>> import re
>>> twikiattribs = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> re.findall(r'(\w+)="([^"]+)"', twikiattribs)
[('author', 'PersonsName'), ('date', '1183050420'), ('format', '1.1'), ('version', '1.2')]
>>> dict(re.findall(r'(\w+)="([^"]+)"', twikiattribs))
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
re.split() also behaves differently based on capturing groups; the text on which you split is included in the output if grouped. Compare the output with and without the capturing group:
>>> re.split('(=?" ?)', twikiattribs)
['author', '="', 'PersonsName', '" ', 'date', '="', '1183050420', '" ', 'format', '="', '1.1', '" ', 'version', '="', '1.2', '"', '']
>>> re.split('=?" ?', twikiattribs)
['author', 'PersonsName', 'date', '1183050420', 'format', '1.1', 'version', '1.2', '']
The re.findall() output is far easier to convert to a dictionary however.
you can also do it without re in one line:
>>> data = '''author="PersonsName" date="1183050420" format="1.1" version="1.2"'''
>>> {k:v.strip('"') for k,v in [i.split("=",1) for i in data.split(" ")]}
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
if whitespaces are allowed inside the values you can use this line:
>>> {k:v.strip('"') for k,v in [i.split("=",1) for i in data.split('" ')]}
The way I'd personally parse it:
import shlex
s = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
dict(x.split('=') for x in shlex.split(s))
Out[12]:
{'author': 'PersonsName',
'date': '1183050420',
'format': '1.1',
'version': '1.2'}
A non-regex list comprehension one liner:
>>> s = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> print dict([tuple(x.split('=')) for x in s.split()])
{'date': '"1183050420"', 'format': '"1.1"', 'version': '"1.2"', 'author': '"PersonsName"'}
The problem is that you included parenthesis in your regex, which turns it into a captured group and includes it in the split. Assign attribs like this
attribs = (re.split('=?" ?', twikiattribs))
and it will work as expected. This does return a blank string (due to the final " in your input string), so you'll want to use attribs[:-1] when creating the dictionary.
Try
>>> str = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> eval ('dict(' + str.replace(" ",",") + ')')
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
assuming as earlier the values have no space in them.
Beware of using eval() though. Bad things may happen for funny input. Don't use it on user input.
This might help some other people that re.findall() doesn't.
# grabbing input
input1 = dict,list,ect
# creating a phantom variable
Phantom = 'variable_name = ' + input1
# executing the phantom
phenomenon = exec(Phantom)
# storing the phantom variable in a live one
output = variable_name
# printing the stored phantom variable
print(output)
What it essentially does is adds a variable name to your input and creates that variable.
For example, if your list returns as "[[1,2][list][3,4]]" this executes as variable_name = [[1,2][list][3,4]]
In which activates it's original function.
It does create a PEP 8 error since the variable doesn't exist until it runs.

Categories

Resources