How to extract from nested dictionaries using list comprehension - python

I'm trying to extract some data from XML. I'm using xmltodict to load the data into a dictionary, then using list comprehensions to pull out individual parts into separate lists. I will later be plotting these using matplotlib.
XML:
<?xml version="1.0" ?>
<MYDATA>
<SESSION ID="1234">
<INFO>
<BEGIN LOAD="23"/>
</INFO>
<TRANSACTION ID="2103645570">
<ANSWER>Hello</ANSWER>
</TRANSACTION>
<TRANSACTION ID="4315547431">
<ANSWER>This is an answer</ANSWER>
</TRANSACTION>
</SESSION>
<SESSION ID="5678">
<INFO>
<BEGIN LOAD="28"/>
</INFO>
<TRANSACTION ID="4099381642">
<ANSWER>Hello</ANSWER>
</TRANSACTION>
<TRANSACTION ID="1220404184">
<ANSWER>A Different answer</ANSWER>
</TRANSACTION>
<TRANSACTION ID="201506542">
<ANSWER>Yet another one</ANSWER>
</TRANSACTION>
</SESSION>
</MYDATA>
My code:
from collections import OrderedDict
# doc contains the xml exactly as loaded by xmltodict
doc = OrderedDict([(u'MYDATA', OrderedDict([(u'SESSION', [OrderedDict([(u'#ID', u'1234'), (u'INFO', OrderedDict([(u'BEGIN', OrderedDict([(u'#LOAD', u'23')]))])), (u'TRANSACTION', [OrderedDict([(u'#ID', u'2103645570'), (u'ANSWER', u'Hello')]), OrderedDict([(u'#ID', u'4315547431'), (u'ANSWER', u'This is an answer')])])]), OrderedDict([(u'#ID', u'5678'), (u'INFO', OrderedDict([(u'BEGIN', OrderedDict([(u'#LOAD', u'28')]))])), (u'TRANSACTION', [OrderedDict([(u'#ID', u'4099381642'), (u'ANSWER', u'Hello')]), OrderedDict([(u'#ID', u'1220404184'), (u'ANSWER', u'A Different answer')]), OrderedDict([(u'#ID', u'201506542'), (u'ANSWER', u'Yet another one')])])])])]))])
sess_ids = [i['#ID'] for i in doc['MYDATA']['SESSION']]
print sess_ids
sess_loads = [i['INFO']['BEGIN']['#LOAD'] for i in doc['MYDATA']['SESSION']]
print sess_loads
trans_ids = [[j['#ID'] for j in i['TRANSACTION']] for i in doc['MYDATA']['SESSION']]
print trans_ids
Output:
sess_ids: [u'1234', u'5678']
sess_loads: [u'23', u'28']
trans_ids: [[u'2103645570', u'4315547431'], [u'4099381642', u'1220404184', u'201506542']]
You can see that I'm able to access the ID attributes from the SESSION elements and also the LOAD attributes from the BEGIN elements.
I need to get the ID attributes from the TRANSACTION elements as a single list. Currently I'm getting a list of lists in variable trans_ids.
How can I get just a flat list of the values?
I have tried:
[j['#ID'] for j in i['TRANSACTION'] for i in doc['MYDATA']['SESSION']]
but that just repeats the second session twice, giving:
[u'4099381642',
u'4099381642',
u'1220404184',
u'1220404184',
u'201506542',
u'201506542']

Is there a reason you need to go to a dictionary? This sort of thing is fairly straightforward in XML:
import xml.etree.ElementTree as etree
txml = etree.parse('xml string above')
txml.findall('SESSION/TRANSACTION')
[<Element TRANSACTION at 0x4064f9d8>,
<Element TRANSACTION at 0x4064fa20>,
<Element TRANSACTION at 0x4064f990>,
<Element TRANSACTION at 0x4064fa68>,
<Element TRANSACTION at 0x4064fab0>]
[x.get('ID') for x in txml.findall('SESSION/TRANSACTION')]
['2103645570', '4315547431', '4099381642', '1220404184', '201506542']
At least, it seems more compact to me.

I have tried:
[j['#ID'] for j in i['TRANSACTION'] for i in doc['MYDATA']['SESSION']]
You nearly had it. Just reverse the inner for..in parts:
>>> [j['#ID'] for i in doc['MYDATA']['SESSION'] for j in i['TRANSACTION']]
[u'2103645570', u'4315547431', u'4099381642', u'1220404184', u'201506542']
To understand this, take a look at this example:
>>> a = [[1, 2, 3], [4, 5, 6]]
>>> [j for j in i for i in a]
[4, 4, 5, 5, 6, 6]
>>> [j for i in a for j in i]
[1, 2, 3, 4, 5, 6]
When there are multiple for..in parts in a list comprehension, they are evaluated from left to right. So if your look would like this:
for i in a:
for j in i
j
Then you have to specify it in the same order, instead of from inner to outer:
[j for i in a for j in i]

from itertools import chain
list(chain(*trans_ids))

Related

Algorithm to split the values of a list into a specific format

Can you help me with my algorithm in Python to parse a list, please?
List = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA', 'PPPP_TOTO_EHEH_IIII_SSSS_RRRR']
In this list, I have to get the last two words (PARENT_CHILD). For example for PPPP_TOTO_TATA_TITI_TUTU, I only get TITI_TUTU
In the case where there are duplicates, that is to say that in my list, I have: PPPP_TOTO_TATA_TITI_TUTU and PPPP_TOTO_EHEH_TITI_TUTU, I would have two times TITI_TUTU, I then want to recover the GRANDPARENT for each of them, that is: TATA_TITI_TUTU and EHEH_TITI_TUTU
As long as the names are duplicated, we take the level above.
But in this case, if I added the GRANDPARENT for EHEH_TITI_TUTU, I also want it to be added for all those who have EHEH in the name so instead of having OOOO_AAAAA, I would like to have EHEH_OOO_AAAAA and EHEH_IIII_SSSS_RRRR
My final list =
['ZZZZ_XXXX', 'TATA_TITI_TUTU', 'MMMM_TITI_TUTU', 'EHEH_TITI_TUTU', 'EHEH_OOOO_AAAAA', 'EHEH_IIII_SSSS_RRRR']
Thank you in advance.
Here is the code I started to write:
json_paths = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU',
'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA']
cols_name = []
for path in json_paths:
acc=2
col_name = '_'.join(path.split('_')[-acc:])
tmp = cols_name
while col_name in tmp:
acc += 1
idx = tmp.index(col_name)
cols_name[idx] = '_'.join(json_paths[idx].split('_')[-acc:])
col_name = '_'.join(path.split('_')[-acc:])
tmp = ['_'.join(item.split('_')[-acc:]) for item in json_paths].pop()
cols_name.append(col_name)
print(cols_name.index(col_name), col_name)
cols_name
help ... with ... algorithm
use a dictionary for the initial container while iterating
keys will be PARENT_CHILD's and values will be lists containing grandparents.
>>> s = 'PPPP_TOTO_TATA_TITI_TUTU'
>>> d = collections.defaultdict(list)
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA']})
>>> s = 'PPPP_TOTO_EHEH_TITI_TUTU'
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA', 'EHEH']})
>>>
after iteration determine if there are multiple grandparents in a value
if there are, join/append the parent_child to each grandparent
additionally find all the parent_child's with these grandparents and prepend their grandparents. To facilitate build a second dictionary during iteration - {grandparent:[list_of_children],...}.
if the parent_child only has one grandparent use as-is
Instead of splitting each string the info could be extracted with a regular expression.
pattern = r'^.*?_([^_]*)_([^_]*_[^_]*)$'

Breadth First Search traversal for LXML Files Python

I am working on performing a breadth-first search (BFS) traversal on an XML File.
A Depth First Search algorithm is shown in the https://lxml.de/3.3/api.html#lxml-etre. However, I need help with applying the BFS Search based on this code.
Below is the code given in the documentation:
>>> root = etree.XML('<root><a><b/><c/></a><d><e/></d></root>')
>>> print(etree.tostring(root, pretty_print=True, encoding='unicode'))
<root>
<a>
<b/>
<c/>
</a>
<d>
<e/>
</d>
</root>
>>> queue = deque([root])
>>> while queue:
... el = queue.popleft() # pop next element
... queue.extend(el) # append its children
... print(el.tag)
root
a
d
b
c
e
I need help with trying to append it to make it suitable for BFS Traversal. Below is an example of the code I tried to write but it doesn't work correctly. Can someone please help.
My Code:
from collections import deque
>>> d = deque([root])
>>> while d:
>>> el = d.pop()
>>> d.extend(el)
>>> print(el.tag)
Thank You
Your implementation of BFS is currently popping from the wrong end of your queue. You should use popleft() rather than pop().
d = deque([root])
while d:
el = d.popleft()
d.extend(el)
print(el.tag)
Can be implemented with xpath also
>>> root = etree.XML('<root><a><b/><c><f/></c></a><d><e/></d></root>')
>>> queue = deque([root])
>>> while queue:
... el = queue.popleft()
... queue.extend(el.xpath('./child::*'))
... print(el.tag)
...
root
a
d
b
c
e
f

Text of xml element not reassigned to new value

I have an xml defined as string my_xml.
Then I tried to increase amount of strings and change some values.
my_xml = """<root><foo><bar>spamm.xml</bar></foo></root>"""
from xml.etree import ElementTree as et
tree = et.fromstring(my_xml )
el = list(tree)[0].copy()
tree.insert(0, el)
tree.insert(0, el)
cnt = 0
elements = [elem for elem in tree.iter() if elem.text is not None]
for elem in elements:
if cnt !=0:
print elem.text[:4]+str(cnt)+elem.text[5:]
elem.text= elem.text[:4]+str(cnt)+elem.text[5:] # strange behavour
cnt +=1
print et.tostring(tree)
Why elem.text= elem.text[:4]+str(cnt)+elem.text[5:] string does not reassigned elem.text to new value?
Expected output
<root>
<foo><bar>spamm.xml</bar></foo>
<foo><bar>spamm1.xml</bar></foo>
<foo><bar>spamm2.xml</bar></foo>
</root>
Actual output
<root>
<foo><bar>spam2.xml</bar></foo>
<foo><bar>spam2.xml</bar></foo>
<foo><bar>spam2.xml</bar></foo>
</root>
The problem is in your copy phase:
you should do it for each el or both el share the same ref
you should use copy.deepcopy() because shallow copy doesn't cut it here
I use python 3, so the copy() method doesn't exist. I had to use the copy module, using deepcopy and on both items (or you're copying only once) to make sure all references are duplicated
part of the code I changed (better with a loop):
import copy
tree = et.fromstring(my_xml)
for _ in range(2):
el = copy.deepcopy(list(tree)[0])
tree.insert(0, el)
result:
<root><foo><bar>spam1.xml</bar></foo><foo><bar>spam1.xml</bar></foo><foo><bar>spam2.xml</bar></foo></root>

Refresh a list content with another list in Python

How would I extend the content of a given list with another given list without using the method .extend()? I imagine that I could use something with dictionaries.
Code
>>> tags =['N','O','S','Cl']
>>> itags =[1,2,4,3]
>>> anew =['N','H']
>>> inew =[2,5]
I need a function which returns the refreshed lists
tags =['N','O','S','Cl','H']
itags =[3,2,4,3,5]
When an element is already in the list, the number in the other list is added. If I use the extend() method, the the element N will appear in list tags twice:
>>> tags.extend(anew)
>>>itags.extend(inew)
>>> print tags,itags
['N','O','S','Cl','N','H'] [1,2,4,3,5,2,5]
You probably want a Counter for this.
from collections import Counter
tags = Counter({"N":1, "O":2, "S": 4, "Cl":3})
new = Counter({"N": 2, "H": 5})
tags = tags + new
print tags
output:
Counter({'H': 5, 'S': 4, 'Cl': 3, 'N': 3, 'O': 2})
If the order of elements matters, I'd use collections.Counter like so:
from collections import Counter
tags = ['N','O','S','Cl']
itags = [1,2,4,3]
new = ['N','H']
inew = [2,5]
cnt = Counter(dict(zip(tags, itags))) + Counter(dict(zip(new, inew)))
out = tags + [el for el in new if el not in tags]
iout = [cnt[el] for el in out]
print(out)
print(iout)
If the order does not matter, there is a simpler way to obtain out and iout:
out = cnt.keys()
iout = cnt.values()
If you don't have to use a pair of lists, then working with Counter directly is a natural fit for your problem.
If you need to maintain the order, you may want to use an OrderedDict instead of a Counter:
from collections import OrderedDict
tags = ['N','O','S','Cl']
itags = [1,2,4,3]
new = ['N','H']
inew = [2,5]
od = OrderedDict(zip(tags, itags))
for x, i in zip(new, inew):
od[x] = od.setdefault(x, 0) + i
print od.keys()
print od.values()
On Python 3.x, use list(od.keys()) and list(od.values()).

Python NLTK - counting occurrence of word in brown corpora based on returning top results by tag

I'm trying to return the top occurring values from a corpora for specific tags. I can get the tag and the word themselves to return fine however I can't get the count to return within the output.
import itertools
import collections
import nltk
from nltk.corpus import brown
words = brown.words()
def findtags(tag_prefix, tagged_text):
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
if tag.startswith(tag_prefix))
return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())
tagdictNNS = findtags('NNS', nltk.corpus.brown.tagged_words())
This returns the following fine
for tag in sorted(tagdictNNS):
print tag, tagdictNNS[tag]
I have managed to return the count of every NN based word using this:
pluralLists = tagdictNNS.values()
pluralList = list(itertools.chain(*pluralLists))
for s in pluralList:
sincident = words.count(s)
print s
print sincident
That returns everything.
Is there a better way of inserting the occurrence into the a dict tagdictNN[tag]?
edit 1:
pluralLists = tagdictNNS.values()[:5]
pluralList = list(itertools.chain(*pluralLists))
returns them in size order from the for s loop. still not the right way to do it though.
edit 2: updated dictionaries so they actually search for NNS plurals.
I might not understand, but given your tagdictNNS:
>>> new = {}
>>> for k,v in tagdictNNS.items():
new[k] = len(tagdictNNS[k])
>>> new
{'NNS$-TL-HL': 1, 'NNS-HL': 5, 'NNS$-HL': 4, 'NNS-TL': 5, 'NNS-TL-HL': 5, 'NNS+MD': 2, 'NNS$-NC': 1, 'NNS-TL-NC': 1, 'NNS$-TL': 5, 'NNS': 5, 'NNS$': 5, 'NNS-NC': 5}
Then you can do something like:
>>> sorted(new.items(), key=itemgetter(1), reverse=True)[:2]
[('NNS-HL', 5), ('NNS-TL', 5)]

Categories

Resources