Python: Extract info from xml to dictionary - python

I need to extract information from an xml file, isolate it from the xml tags before and after, store the information in a dictionary, then loop through the dictionary to print a list. I am an absolute beginner so I'd like to keep it as simple as possible and I apologize if how I've described what I'd like to do doesn't make much sense.
here is what i have so far.
for line in open("/people.xml"):
if "name" in line:
print (line)
if "age" in line:
print(line)
Current Output:
<name>John</name>
<age>14</age>
<name>Kevin</name>
<age>10</age>
<name>Billy</name>
<age>12</age>
Desired Output
Name Age
John 14
Kevin 10
Billy 12
edit- So using the code below I can get the output:
{'Billy': '12', 'John': '14', 'Kevin': '10'}
Does anyone know how to get from this to a chart with headers like my desired output?

try xmldict (Convert xml to python dictionaries, and vice-versa.):
>>> xmldict.xml_to_dict('''
... <root>
... <persons>
... <person>
... <name first="foo" last="bar" />
... </person>
... <person>
... <name first="baz" last="bar" />
... </person>
... </persons>
... </root>
... ''')
{'root': {'persons': {'person': [{'name': {'last': 'bar', 'first': 'foo'}}, {'name': {'last': 'bar', 'first': 'baz'}}]}}}
# Converting dictionary to xml
>>> xmldict.dict_to_xml({'root': {'persons': {'person': [{'name': {'last': 'bar', 'first': 'foo'}}, {'name': {'last': 'bar', 'first': 'baz'}}]}}})
'<root><persons><person><name><last>bar</last><first>foo</first></name></person><person><name><last>bar</last><first>baz</first></name></person></persons></root>'
or try xmlmapper (list of python dictionary with parent-child relationship):
>>> myxml='''<?xml version='1.0' encoding='us-ascii'?>
<slideshow title="Sample Slide Show" date="2012-12-31" author="Yours Truly" >
<slide type="all">
<title>Overview</title>
<item>Why
<em>WonderWidgets</em>
are great
</item>
<item/>
<item>Who
<em>buys</em>
WonderWidgets1
</item>
</slide>
</slideshow>'''
>>> x=xml_to_dict(myxml)
>>> for s in x:
print s
>>>
{'text': '', 'tail': None, 'tag': 'slideshow', 'xmlinfo': {'ownid': 1, 'parentid': 0}, 'xmlattb': {'date': '2012-12-31', 'author': 'Yours Truly', 'title': 'Sample Slide Show'}}
{'text': '', 'tail': '', 'tag': 'slide', 'xmlinfo': {'ownid': 2, 'parentid': 1}, 'xmlattb': {'type': 'all'}}
{'text': 'Overview', 'tail': '', 'tag': 'title', 'xmlinfo': {'ownid': 3, 'parentid': 2}, 'xmlattb': {}}
{'text': 'Why', 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 4, 'parentid': 2}, 'xmlattb': {}}
{'text': 'WonderWidgets', 'tail': 'are great', 'tag': 'em', 'xmlinfo': {'ownid': 5, 'parentid': 4}, 'xmlattb': {}}
{'text': None, 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 6, 'parentid': 2}, 'xmlattb': {}}
{'text': 'Who', 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 7, 'parentid': 2}, 'xmlattb': {}}
{'text': 'buys', 'tail': 'WonderWidgets1', 'tag': 'em', 'xmlinfo': {'ownid': 8, 'parentid': 7}, 'xmlattb': {}}
above code will give generator. When you iterate over it; you will get information in dict keys; like tag, text, xmlattb,tail and addition information in xmlinfo. Here root element will have parentid information as 0.

Use an XML parser for this. For example,
import xml.etree.ElementTree as ET
doc = ET.parse('people.xml')
names = [name.text for name in doc.findall('.//name')]
ages = [age.text for age in doc.findall('.//age')]
people = dict(zip(names,ages))
print(people)
# {'Billy': '12', 'John': '14', 'Kevin': '10'}

It seems to me that this is an exercise in learning how to parse this XML manually rather than simply pulling a library out of the bag to do it for you. If I am wrong, I suggest watching the udacity video by Steve Huffman that can be found here: http://www.udacity.com/view#Course/cs253/CourseRev/apr2012/Unit/362001/Nugget/365002. He explains how to use the minidom module to parse lightweight xml files such as these.
Now, the first point I want to make in my answer, is that you don't want to create a python dictionary to print all of these values. A python dictionary is simply a set of keys that correspond to values. There is no ordering to them, and so traversal in the order they appeared in the file is a pain in the butt. You are trying to print out all of the names together with their corresponding ages, so a data structure like a list of tuples would probably be better suited to collating your data.
It seems like the structure of your xml file is such that each name tag is succeeded by an age tag that corresponds to it. There also seems to only be a single name tag per line. This makes matters fairly simple. I'm not going to write the most efficient or universal solution to this problem, but instead I will try to make the code as simple to understand as I can.
So let's first create a list to store the data:
Let's then create a list to store the data:
a_list = []
Now open your file, and initialize a couple of variables to hold each name and age:
from __future__ import with_statement
with open("/people.xml") as f:
name, age = None, None #initialize a name and an age variable to be used during traversals.
for line in f:
name = extract_name(line,name) # This function will be defined later.
age = extract_age(line) # So will this one.
if age: #We know that if age is defined, we can add a person to our list and reset our variables
a_list.append( (name,age) ) # and now we can re-initialize our variables.
name,age = None , None # otherwise simply read the next line until age is defined.
Now for each line in the file, we wanted to determine whether it contains a user. If it did, we wanted to extract the name. Let's create a function used to do this:
def extract_name(a_line,name): #we pass in the line as well as the name value that that we defined before beginning our traversal.
if name: # if the name is predefined, we simply want to keep the name at its current value. (we can clear it upon encountering the corresponding age.)
return name
if not "<name>" in a_line: #if no "<name>" in a_line, return. otherwise, extract new name.
return
name_pos = a_line.find("<name>")+6
end_pos = a_line.find("</name>")
return a_line[name_pos:end_pos]
Now, we must create a function to parse the line for a user's age. We can do this in a similar way to the previous function, but we know that once we have an age, it will be added into the list immediately. As such, we never need to concern ourselves with age's previous value. The function can therefore look like this:
def extract_age(a_line):
if not "<age>" in a_line: #if no "<age>" in a_line:
return
age_pos = a_line.find("<age>")+5 # else extract age from line and return it.
end_pos = a_line.find("</age>")
return a_line[age_pos:end_pos]
Finally, you want to print the list. You might do it as follows:
for item in a_list:
print '\t'.join(item)
Hope this helped. I haven't tested out my code, so it might still be slightly buggy. The concepts are there, though. :)

Here's another way using lxml library:
from lxml import objectify
def xml_to_dict(xml_str):
""" Convert xml to dict, using lxml v3.4.2 xml processing library, see http://lxml.de/ """
def xml_to_dict_recursion(xml_object):
dict_object = xml_object.__dict__
if not dict_object: # if empty dict returned
return xml_object
for key, value in dict_object.items():
dict_object[key] = xml_to_dict_recursion(value)
return dict_object
return xml_to_dict_recursion(objectify.fromstring(xml_str))
xml_string = """<?xml version="1.0" encoding="UTF-8"?><Response><NewOrderResp>
<IndustryType>Test</IndustryType><SomeData><SomeNestedData1>1234</SomeNestedData1>
<SomeNestedData2>3455</SomeNestedData2></SomeData></NewOrderResp></Response>"""
print xml_to_dict(xml_string)
To preserve the parent node, use this instead:
def xml_to_dict(xml_str):
""" Convert xml to dict, using lxml v3.4.2 xml processing library, see http://lxml.de/ """
def xml_to_dict_recursion(xml_object):
dict_object = xml_object.__dict__
if not dict_object: # if empty dict returned
return xml_object
for key, value in dict_object.items():
dict_object[key] = xml_to_dict_recursion(value)
return dict_object
xml_obj = objectify.fromstring(xml_str)
return {xml_obj.tag: xml_to_dict_recursion(xml_obj)}
And if you want to only return a subtree and convert it to dict, you can use Element.find() :
xml_obj.find('.//') # lxml.objectify.ObjectifiedElement instance
See lxml documentation.

Related

Dictionary object has no attribute split

My data file looks like this:
{'data': 'xyz', 'code': '<:c:605445> **[Code](https://traindata/35.6547,56475', 'time': '2021-12-30T09:56:53.547', 'value': 'True', 'stats': '96/23', 'dupe_id': 'S<:c-74.18'}
I'm trying to print this line:
35.6547,56475
Here is my code:
data = "above mentioned data"
for s in data.values():
print(s)
while data != "stop":
if data == "quit":
os.system("disconnect")
else:
x, y = s.split(',', 1)
The output is:
{'data': 'xyz', 'code': '<:c:605445> **[Code](https://traindata/35.6547,56475', 'time': '2021-12-30T09:56:53.547', 'value': 'True', 'stats': '95/23', 'dupe_id': 'S<:c-74.18'}
x, y = s.split(',', 1)
AttributeError: 'dict' object has no attribute 'split'
I've tried converting it into tuple, list but I'm getting the same error. The input in x,y should be the above mentioned expected output (35.6547,56475).
Any help will be highly appreciated.
You can do it like this:
x,y = d['code'].split('/')[-1].split(',')
That means, you need to access the dictionary by one of it's keys, here you want to go for 'code'. You retrieve the string '<:c:605445> **[Code](https://traindata/35.6547,56475' which you can now either parse via regex or you just do a split at the '/' and take the last element of it using [-1]. Then you can just split the remaining numbers, that you are actually looking for and write them to x and y respectively.
Of course, you might want to check your incoming data to be valid by catching the KeyError you mentioned in the comments:
try:
x,y = d['code'].split('/')[-1].split(',')
except KeyError:
print(f'Data invalid. Key "code" not found. Got: {data} instead')
Another option would be to use a simple regex on the code element - regex starting at the end of the string, find all digits to a . find all digits to a , find all digits.
import re
d = {'data': 'xyz', 'code': ':c: **[Code](https://traindata/35.6547,56475', 'time': '2021-12-30T09:56:53.547', 'value': 'True', 'stats': '96/23', 'dupe_id': 'S<:c-74.18'}
print(re.findall(r'\d+.\d+,\d+$', d['code'])[0])
you can only split text not a dictionary type
first get the text that you want to split
d['code']

Exclude part of string with Regex in python with web scraping

I'm trying to scrape some data from an e-commerce website for a personal project. I'm trying to build a nested list of strings from the html but am having an issue with one part of the html.
Each list item appears as the following:
<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>
What I have now is a regex that turns all the items in the data-impressions tag like so and splits them at the comma:
list_return = [re.findall('\{([^{]+[^}\'></div>])', i) for i in bathshower_impressions]
list_return = [re.split(',', list_return[i][0]) for i in range(0, len(list_return))]
Which gives me a list of lists of lists for each thing which will become a key:value pair in a dictionary. For the example above here is what the second level item would be:
[['"id"', '"01920"'],
['"name"', '"Sleepy"'],
['"price"', '12.95'],
['"brand"', '"Lush"'],
['"category"', '"Bubble Bar"'],
['"variant"', '"7 oz."'],
['"quantity"', '1'],
['"list"', '"/bath/bubble-bars/sleepy/9999901920.html"'],
['"dimension11"', '""'],
['"dimension12"', '"Naked'],
['Self Preserving'],
['Vegan"'],
['"dimension13"', '1'],
['"dimension14"', '1'],
['"dimension15"', 'true']]
My problem is with dimension12, I can't figure out how to exclude that dimension from splitting at the comma, so that that list would appear as:
['"dimension12"', '"Naked,Self Preserving,Vegan"']
Any help is appreciated, thanks.
I'd like to suggest a bit different approach. That attribute value looks like JSON, so why not use a json module? That way, you have a ready-made data structure, that you can modify further.
import json
from bs4 import BeautifulSoup
html_list = [
"""<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>""",
]
data_structures = []
for html_item in html_list:
soup = BeautifulSoup(html_item, "html.parser").find("div", {"class": "impressions"})
data_structures.append(json.loads(soup["data-impressions"]))
print(data_structures)
This outputs a list of dictionaries:
[{'id': '01920', 'name': 'Sleepy', 'price': 12.95, 'brand': 'Lush', 'category': 'Bubble Bar', 'variant': '7 oz.', 'quantity': 1, 'list': '/bath/bubble-bars/sleepy/9999901920.html', 'dimension11': '', 'dimension12': 'Naked,Self Preserving,Vegan', 'dimension13': 1, 'dimension14': 1, 'dimension15': True}]
To access the desired key, just do this:
for data_item in data_structures:
print(data_item["dimension12"])
Prints: Naked,Self Preserving,Vegan

How to create/read nested dictionary from file?

Here is the text file1 content
name = test1,description=None,releaseDate="2020-02-27"
name = test2,description=None,releaseDate="2020-02-28"
name = test3,description=None,releaseDate="2020-02-29"
I want a nested dictionary like this. How to create this?
{ 'test1': {'description':'None','releaseDate':'2020-02-27'},
'test2': {'description':'None','releaseDate':'2020-02-28'},
'test3': {'description':'None','releaseDate':'2020-02-29'}}
After this I want to append these values in the following line of code through "for" loop for a list of projects.
Example: For a project="IJTP2" want to go through each name in the dictionary like below
project.create(name="test1", project="IJTP2", description=None, releaseDate="2020-02-27")
project.create(name="test2", project="IJTP2", description=None, releaseDate="2020-02-28")
project.create(name="test3", project="IJTP2", description=None, releaseDate="2020-02-29")
Now to the next project:
List of projects is stored in another file as below
IJTP1
IJTP2
IJTP3
IJTP4
I just started working on Python and have never worked on the nested dictionaries.
I assume that:
each file line has comma-separated columns
each column has only one = and key on its left, value on its right
only first column is special(name)
Of course, as #Alex Hall mentioned, I recommend JSON or CSV, too.
Anyway, I wrote code for your case.
d = {}
with open('test-200229.txt') as f:
for line in f:
(_, name), *rest = (
tuple(value.strip() for value in column.split('='))
for column in line.split(',')
)
d[name] = dict(rest)
print(d)
output:
{'test1': {'description': 'None', 'releaseDate': '"2020-02-27"'}, 'test2': {'description': 'None', 'releaseDate': '"2020-02-28"'}, 'test3': {'description': 'None', 'releaseDate': '"2020-02-29"'}}

Export Dictionary with Both Simple Values and Nested Dictionaries to CSV

I'm accessing a third-party API that returns a dictionary that contains both simple values and nested (embedded?) dictionaries. I need to convert this to a CSV file, but I need help extracting and exporting specific values from the nested dictionaries.
Here's a simplified example of what I'm getting back:
accounts = {
'Id': '0131232',
'AccountName': 'CompanyX',
'Active': False,
'LastModifiedBy': {'type': 'User', 'Id': '', 'Name': 'Joe Smith'}
},
{
'Id': '987654',
'AccountName': 'CompanyY',
'Active': True,
'LastModifiedBy': {'type': 'User', 'Id': '', 'Name': 'Mary Johnson'}
}
I'm trying to export this to a CSV file with the following code:
with open('output.csv', 'w') as f:
dwriter = csv.DictWriter(f, accounts[0].keys())
dwriter.writeheader()
dwriter.writerows(accounts)
f.close()
What I want in the CSV file is the following:
Id,AccountName,Active,LastModifiedBy
0131232,CompanyX,False,Joe Smith
987654,CompanyY,True,Mary Johnson
What I'm getting with my code above is the following:
Id,AccountName,Active,LastModifiedBy
0131232,CompanyX,False,"{'type': 'User', 'Id': '', 'Name': 'Joe Smith'}"
987654,CompanyY,True,"{'type': 'User', 'Id': '', 'Name': 'Mary Johnson'}"
Obviously I need to extract the key-value pair I want from the nested dictionary and assign that value to the higher-level dictionary. My question is how do I do that while still handling the simple values as is?
It seems like this could be done with dictionary comprehension, but I'm not sure I can do a conditional with that.
Alternatively I could go through each record, check each value to see if it's a dictionary, and write out the values I want to a new dictionary, but that feels a little too heavy.
Full disclosure: I'm new to Python, so apologies if I'm missing something obvious.
Thanks!
- Chris
If you don't need accounts for anything else you can do:
for account in accounts:
account['LastModifiedBy'] = account['LastModifiedBy']['Name']
otherwise, .copy() it and do the same.

Making a raw dictionary sane

I have a dict brought in from a csv: {'0ca6f08e': '1111', '89b2e9ab': '2222', '0c2e5b6d': '3333', '07287d73': '4444'}
and what is needed is something like:
{'id' :'0ca6f08e', 'thing': '1111'}, {'id': '89b2e9ab', 'thing': '2222'}, {'id: '0c2e5b6d', 'thing': '3333'}
This is to bring order to the dict so I can operate later with sanity. I'm not clear on how to take a csv like:
0ca6f08e,1111
89b2e9ab,2222
0c2e5b6d,3333
an inject the keys for sanity and later use.
We can use a list comprehension to solve this:
>>> original = {'0ca6f08e': '1111', '89b2e9ab': '2222', '0c2e5b6d': '3333', '07287d73': '4444'}
>>> parsed = [{'id': key, 'thing': value} for key, value in a.items()]
>>> parsed
[{'thing': '1111', 'id': '0ca6f08e'}, {'thing': '2222', 'id': '89b2e9ab'}, {'thing': '3333', 'id': '0c2e5b6d'}, {'thing'
: '4444', 'id': '07287d73'}]
We're essentially grabbing each key and corresponding value in the original dict, and converting it into a list of dicts.
Note that it may be cleaner to just use the items method of a dict to grab the key and the value directly, and loop over that:
>>> original.items()
[('0ca6f08e', '1111'), ('89b2e9ab', '2222'), ('0c2e5b6d', '3333'), ('07287d73', '4444')]
If you are reading the file for the first time, you can fix the results like this:
with open('foo.csv') as f:
for line in f:
lines = [{'id': a, 'thing': b} for a,b in line.split(',')]
If you want to fix the results from the dictionary:
lines = [{'id': a, 'thing': b} for a,b in big_dict.iteritems()]
You can use the csv module's DictReader to read the csv file.
Here is an example:
import csv
with open('example.csv') as csvfile:
for csv_dict in csv.DictReader(csvfile, fieldnames=["id", "thing"])
# Now you can use the csv_dict as a normal dictionary
print csv_dict["id"]

Categories

Resources