Python unable to pass defaultdict values to function - python

I have a complex code which reads some values into nested defaultdict.
Then there is a cycle going through the keys in the dictionary and working with them - basically assigning them to another nested defaultdict.
Problem is, when I want to use the values from the dictionary and access them and pass them as values to a function.... I get either empty {} or something like this: defaultdict(<function tree at 0x2aff774309d8>
I have tried to write the dict so I can see if it is really empty. Part of my code;
if (not families_data[family]['cell_db']['output']):
print(rf"Output for {family} is empty.")
print(dict(families_data[family]['celldb']))
The really fun part is, when this "if" is true, then I get the following output:
Output for adfull is empty.
{'name': 'adfullx05_b', 'family': 'adfull', 'drive_strength': 0.5, 'template': 'adfull', 'category': '', 'pinmap': '', 'output': 'CO S', 'inout': '', 'input': 'A B CI', 'rail_supply': 'VDD VSS', 'well_supply': '', 'description': ''}
if I change the second line in the if to
print(families_data[family]['celldb'])
I get the following output:
defaultdict(<function tree at 0x2b45844059d8>, {'name': 'adfullx05_b', 'family': 'adfull', 'drive_strength': 0.5, 'template': 'adfull', 'category': '', 'pinmap': '', 'output': 'CO S', 'inout': '', 'input': 'A B CI', 'rail_supply': 'VDD VSS', 'well_supply': '', 'description': ''})
Why is the "if" even true, when there is a value 'CO S' in the output key?
Why am I getting {} when trying to access any value like families_data[family]['cell_db']['input'] and passing it to function as a parameter?
What the heck am I doing wrong?

The "cell_db" key in the if statement has an underscore while it does not in the print statement.
This should fix it:
if (not families_data[family]['celldb']['output']):
print(rf"Output for {family} is empty.")
print(dict(families_data[family]['celldb']))

Related

Removing duplicate pairs in python only where the keys are the same

I am looking to remove duplicates in a python dictionary but only where the keys are the same. Here is an example.
original_dict = {'question a' : 'pizza', 'question b' : 'apple', 'question a': 'banana'}
I want to remove the 'question a' item so there would only be one 'question a'. The problem I am facing is that the values are not the same. Any way to do this easily in Python 3.x?
By definition, the dictionary keeps only 1 value per key, so you will not have duplicates. In your example, the last value for the duplicate key is the one that will be kept (that's what "the old value associated with that key is forgotten" means below):
original_dict = {'question a' : 'pizza', 'question b' : 'apple', 'question a': 'banana'}
print(original_dict)
# {'question a': 'banana', 'question b': 'apple'}
From the docs:
It is best to think of a dictionary as a set of key: value pairs, with the requirement that the keys are unique (within one dictionary). [...] If you store using a key that is already in use, the old value associated with that key is forgotten.
Python dictionary won't allow duplicates at the first place. If you created a dictionary containing two or more same keys, it will consider the last occurrence irrespective of the value and drop the other(s).
In this case,
'question a': 'pizza' will be dropped and 'question a': 'banana' will be considered as it is the last occurrence here.

How to prevent duplicate UUID's on duplicate texts (in list of dicts) in python?

I have to filter texts that I process by checking if people's names appear in the text (texts). If they do appear, the texts are appended as nested list of dictionaries to the existing list of dictionaries containing people's names (people). However, since in some texts more than one person's name appears, the child document containing the texts will be repeated and added again. As a result, the child document does not contain a unique ID and this unique ID is very important, regardless of the texts being repeated.
Is there a smarter way of adding a unique ID even if the texts are repeated?
My code:
import uuid
people = [{'id': 1,
'name': 'Bob',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]},
{'id': 2,
'name': 'Kate',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]},
{'id': 3,
'name': 'Joe',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]}]
texts = ['this text has the name Bob and Kate',
'this text has the name Kate only ']
for text in texts:
childDoc={'id': str(uuid.uuid1()), #the id will duplicate when files are repeated
'text': text}
for person in people:
if person['name'] in childDoc['text']:
person['_childDocuments_'].append(childDoc)
Current output:
[{'id': 1,
'name': 'Bob',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'},
{'id': '7752597f-410f-11eb-9341-9cb6d0897972', #duplicate ID here
'text': 'this text has the name Bob and Kate'}]},
{'id': 2,
'name': 'Kate',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'},
{'id': '7752597f-410f-11eb-9341-9cb6d0897972', #duplicate ID here
'text': 'this text has the name Bob and Kate'},
{'id': '77525980-410f-11eb-b667-9cb6d0897972',
'text': 'this text has the name Kate only '}]},
{'id': 3,
'name': 'Joe',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]}]
As you can see in the current output, the ID for the text 'this text has the name Bob and Kate' has the same identifier: '7752597f-410f-11eb-9341-9cb6d0897972' , because it is appended twice. But I would like each identifier to be different.
Desired output:
Same as current output, except we want every ID to be different for every appended text even if these texts are the same/duplicates.
Move the generation of the UUID inside the inner loop:
for text in texts:
for person in people:
if person['name'] in text:
childDoc={'id': str(uuid.uuid1()),
'text': text}
person['_childDocuments_'].append(childDoc)
This does not actually ensure that the UUID are unique. For that you need to have a set of used UUID, and when generating a new one you check if it is already used and if it is you generate another. And test that one and repeat until you have either exhausted the UUID space or have found an unused UUID.
There is a 1 in 2**61 chance that duplicates are generated. I can't accept collisions as they result in data loss. So when I use UUID I have a loop around the generator that looks like this:
used = set()
while True:
identifier = str(uuid.uuid1())
if identifier not in used:
used.add(identifier)
break
The used set is actually stored persistently. I don't like this code although I have a program that uses it as it ends up in an infinite loop when it can't find a unused UUID.
Some document databases provide automatic UUID assignment and they do this for you internally to ensure that a given database instance never ends up with two documents with the same UUID.

Regex: pairing up numbers and '#'-preceded comments in a list of strings

So I have some lines of text that are stored in a list as follows:
lines = ['1.9 #comment 1.11* 1.5 # another comment',
'1.23',
'3.10.3* #commennnnnt 1.2 ']
I want to create:
[{'1.9': 'comment'},
{'1.11*': ''},
{'1.5': 'another comment'},
{'1.23': ''},
{'3.10.3*': 'commennnnnt'},
{'1.2': ''} ]
In other words, I want to take the list apart and pair each decimal number with either the comment (starting with '#'; we can assume that no numbers occur in it) that appears right after it on the same line, or with an empty string if there is no comment (e.g., the next thing after it is another number).
Specifically, a 'decimal number' can be a single digit, followed by a dot and then either one or two digits, optionally followed by a dot and one or two more digits. A '*' may appear at the very end. So like this(?): r'\d\.\d{1,2}(\.\d{1,2})?\*?')
I've tried a few things with re.split() to get started. For example, splitting the first list item on either the crazy decimal regex or #, before worrying about the dict pairings:
>>> crazy=r'\d\.\d{1,2}(\.\d{1,2})?\*?'
>>> re.split(r'({0})|#'.format(crazy), results[0])
Result:
[u'',
u'1.9',
None,
u' ',
None,
None,
u'comment ',
u'1.11',
None,
u' ',
u'1.5',
None,
u' ',
None,
None,
u' test comment']
This looks like something I can filter and work with, but is there a better way? (also, wow...it seems the parentheses in my crazy regex allow me to keep the decimal number delimiters as desired!)
The following seems to work:
lines = ['1.9 #comment 1.11* 1.5 # another comment',
'1.23',
'3.10.3* #commennnnnt 1.2 ']
entries = re.findall(r'([0-9.]+\*?)\s+((?:[\# ])?[a-zA-Z ]*)', " ".join(lines))
ldict = [{k: v.strip(" #")} for k,v in entries]
print ldict
This displays:
[{'1.9': 'comment'}, {'1.11*': ''}, {'1.5': 'another comment'}, {'1.23': ''}, {'3.10.3*': 'commennnnnt'}, {'1.2': ''}]

turn key="value" string into a dict

I have a string with the following format:
author="PersonsName" date="1183050420" format="1.1" version="1.2"
I want to turn it in to a Python dict, a la:
{'author': 'PersonsName', 'date': '1183050420', 'format': '1.1', 'version': '1.2'}
I have tried to do so using re.split on the string as so:
attribs = (re.split('(=?" ?)', twikiattribs))
thinking I would get a list back like:
['author', 'PersonsName', 'date', '1183050420', 'format', '1.1', 'version', '1.2']
that then I could turn into a dict, but instead I'm getting:
['author', '="', 'PersonsName', '" ', 'date', '="', '1183050420', '" ', 'format', '="', '1.1', '" ', 'version', '="', '1.2', '"', '']
So, before I follow the re.split line further, is there generally a better way to achieve what I'm trying to do, and/or if the solution involves re.split, how can I write a regex that will split on any of the strings =", "_ (where "_" is a space char) or just " to just yield a list with the keys in the odd indices, and values in the even?
Use re.findall():
dict(re.findall(r'(\w+)="([^"]+)"', twikiattribs))
re.findall(), when presented with a pattern with multiple capturing groups, returns a list of tuples, each nested tuple containing the captured groups. dict() happily takes that output and interprets each nested tuple as a key-value pair.
Demo:
>>> import re
>>> twikiattribs = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> re.findall(r'(\w+)="([^"]+)"', twikiattribs)
[('author', 'PersonsName'), ('date', '1183050420'), ('format', '1.1'), ('version', '1.2')]
>>> dict(re.findall(r'(\w+)="([^"]+)"', twikiattribs))
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
re.split() also behaves differently based on capturing groups; the text on which you split is included in the output if grouped. Compare the output with and without the capturing group:
>>> re.split('(=?" ?)', twikiattribs)
['author', '="', 'PersonsName', '" ', 'date', '="', '1183050420', '" ', 'format', '="', '1.1', '" ', 'version', '="', '1.2', '"', '']
>>> re.split('=?" ?', twikiattribs)
['author', 'PersonsName', 'date', '1183050420', 'format', '1.1', 'version', '1.2', '']
The re.findall() output is far easier to convert to a dictionary however.
you can also do it without re in one line:
>>> data = '''author="PersonsName" date="1183050420" format="1.1" version="1.2"'''
>>> {k:v.strip('"') for k,v in [i.split("=",1) for i in data.split(" ")]}
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
if whitespaces are allowed inside the values you can use this line:
>>> {k:v.strip('"') for k,v in [i.split("=",1) for i in data.split('" ')]}
The way I'd personally parse it:
import shlex
s = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
dict(x.split('=') for x in shlex.split(s))
Out[12]:
{'author': 'PersonsName',
'date': '1183050420',
'format': '1.1',
'version': '1.2'}
A non-regex list comprehension one liner:
>>> s = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> print dict([tuple(x.split('=')) for x in s.split()])
{'date': '"1183050420"', 'format': '"1.1"', 'version': '"1.2"', 'author': '"PersonsName"'}
The problem is that you included parenthesis in your regex, which turns it into a captured group and includes it in the split. Assign attribs like this
attribs = (re.split('=?" ?', twikiattribs))
and it will work as expected. This does return a blank string (due to the final " in your input string), so you'll want to use attribs[:-1] when creating the dictionary.
Try
>>> str = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> eval ('dict(' + str.replace(" ",",") + ')')
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
assuming as earlier the values have no space in them.
Beware of using eval() though. Bad things may happen for funny input. Don't use it on user input.
This might help some other people that re.findall() doesn't.
# grabbing input
input1 = dict,list,ect
# creating a phantom variable
Phantom = 'variable_name = ' + input1
# executing the phantom
phenomenon = exec(Phantom)
# storing the phantom variable in a live one
output = variable_name
# printing the stored phantom variable
print(output)
What it essentially does is adds a variable name to your input and creates that variable.
For example, if your list returns as "[[1,2][list][3,4]]" this executes as variable_name = [[1,2][list][3,4]]
In which activates it's original function.
It does create a PEP 8 error since the variable doesn't exist until it runs.

payload of an email in string format, python

I got payload as a string instance using get_payload() method. But I want my payload in a way where I could access it word by word
I tried several things like as_string() method, flatten() method, get_charset() method , but every time there is some problem.
I got my payload using the following code
import email
from email import *
f=open('mail.txt','r')
obj=email.parser.Parser()
fp=obj.parse(f)
payload=fp.get_payload()
Just tested your snippet with a couple of my own raw emails. Works fine...
get_payload() returns either a list or string, so you need to check that first
if isinstance(payload, list):
for m in payload:
print str(m).split()
else:
print str(m).split()
Edit
Per our discussion, your issue was that you were not checking is_multipart() on the fp object, which actually is a message instance. If fp.is_multipart() == True, then get_payload() will return a list of message instances. In your case, based on your example mail message, it was NOT multipart, and fp was actually the object you were interesting in.
I got my payload as a string as my fp was not multipart
If it could have been a multipart, it would have returned a list of strings
so now I can just use the following code
payload=fp.get_payload()
abc=payload.split(" ")
it gives me the output as follows
['good', 'day\nhttp://72.167.116.186/image/bdfedx.php?iqunin=3D41\n\n', '', '', '', '', '', '', '', '', '', '', '', 'Sun,', '18', 'Dec', '2011', '10:53:43\n_________________\n"She', 'wiped', 'him', 'dry', 'with', 'soft', 'flannel,', 'and', 'gave', 'him', 'some', 'clean,', 'dry', 'clothes,=\n', 'and', 'made', 'him', 'very', 'comfortable', 'again."', '(c)', 'Lyrica', 'wa946758\n']
thanks to jdi :)
p.s. couldnt post it as an answer yesterday, as there was some restriction with points

Categories

Resources