How to parse XML data into a list in Python

How to parse XML data into a list in Python - python

I'm trying to take an API call response and parse the XML data into list, but I am struggling with the multiple child/parent relationships.
My hope is to export a new XML file that would line up each job ID and tracking number, which I could then import into Excel.
Here is what I have so far
The source XML file looks like this:
<project>
<name>October 2019</name>
<jobs>
<job>
<id>5654206</id>
<tracking>
<mailPiece>
<barCode>00270200802095682022</barCode>
<address>Accounts Payable,1661 Knott Ave,La Mirada,CA,90638</address>
<status>En Route</status>
<dateTime>2019-10-12 00:04:21.0</dateTime>
<statusLocation>PONTIAC,MI</statusLocation>
</mailPiece>
</tracking>...
Code:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement
tree = ET.parse('mailings.xml')
root = tree.getroot()
print(root.tag)
for x in root[1].findall('job'):
id=x.find('id').text
tracking=x.find('tracking').text
print(root[1].tag,id,tracking)
The script currently returns the following:
jobs 5654206 None
jobs 5654203 None

Debugging is your friend...
I am struggling with the multiple child/parent relationships.
The right way to resolve this yourself is through using a debugger. For example, with VS Code, after applying a breakpoint and running the script with the debugger, it will stop at the breakpoint and I can inspect all the variables in memory, and run commands at the debug console just as if they were in my script. The Variable windows output looks like this:
There are various ways to do this at the command-line, or with a REPL like iPython, etc., but I find using debugging in a modern IDE environment like VS Code or PyCharm are definitely the way to go. Their debuggers remove the need to pepper print statements everywhere to test out your code, rewriting your code to expose more variables that must be printed to the console, etc.
A debugger allows you to see all the variables as a snapshot, exactly how the Python interpreter sees them, at any point in your code execution. You can:
step through your code line-by-line and watch the variables changes in real-time in the window
setup a separate watch window with only the variables you care about
and setup breakpoints that will only happen if variables are set to particular values, etc.
Child Hierarchy with the XML find method
Inspecting the variables as I step through your code, it appears that the find() method was walking the children within an Element at all levels, not just at the top level. When you used x.find('tracking') it is finding the mailPiece nodes directly. If you print the tag property, instead of the text property, you will see it is 'mailPiece' (see the debug windows above).
So, one way to resolve your issue is to store each mailPiece element as a variable, then pull out the individual attributes you want from it (i.e. BarCode, address, etc.) using find.
Here is some code that pulls all of this into a combined hierarchy of lists and dictionaries that you can then use to build your Excel outputs.
Note: The most efficient way to do this is line-by-line as you read the xml, but this is better for readability, maintainability, and if you need to do any post-processing that requires knowledge of more than one node at a time.
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement
from types import SimpleNamespace
tree = ET.parse('mailings.xml')
root = tree.getroot()
jobs = []
for job in root[1].findall('job'):
jobdict = {}
jobdict['id'] = job.find('id').text
jobdict['trackingMailPieces'] = []
for tracking in job.find('tracking'):
if tracking.tag == 'mailPiece':
mailPieceDict = {}
mailPieceDict['barCode'] = tracking.find('barCode').text
mailPieceDict['address'] = tracking.find('address').text
mailPieceDict['status'] = tracking.find('status').text
mailPieceDict['dateTime'] = tracking.find('dateTime').text
mailPieceDict['statusLocation'] = tracking.find('statusLocation').text
jobdict['trackingMailPieces'].append(mailPieceDict)
jobs.append(jobdict)
for job in jobs:
print('Job ID: {}'.format(job['id']))
for mp in job['trackingMailPieces']:
print(' mailPiece:')
for key, value in mp.items():
print(' {} = {}'.format(key, value))
The result is:
Job ID: 5654206
mailPiece:
barCode = 00270200802095682022
address = Accounts Payable,1661 Knott Ave,La Mirada,CA,90638
status = En Route
dateTime = 2019-10-12 00:04:21.0
statusLocation = PONTIAC,MI
Output?
I didn't address what to do with the output as that is beyond the scope of this question, but consider writing out to a CSV file, or even directly to an Excel file, if you don't need to pass on the XML to another program for some reason. There are Python packages that handle writing CSV and Excel files.
No need to create an intermediate format that you then need to manipulate after bringing it into Excel, for example.

Related

How to extract relation members from .osm xml files

All,
I've been trying to build a website (in Django) which is to be an index of all MTB routes in the world. I'm a Pythonian so wherever I can I try to use Python.
I've successfully extracted data from the OSM API (Display relation (trail) in leaflet) but found that doing this for all MTB trails (tag: route=mtb) is too much data (processing takes very long). So I tried to do everything locally by downloading a torrent of the entire OpenStreetMap dataset (from Latest Weekly Planet XML File) and filtering for tag: route=mtb using osmfilter (part of osmctools in Ubuntu 20.04), like this:
osmfilter $unzipped_osm_planet_file --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.osm
This produces a file of about 1.2 GB and on closer inspection seems to contain all the data I need. My goal was to transform the file into a pandas.DataFrame() so I could do some further filtering en transforming before pushing relevant aspects into my Django DB. I tried to load the file as a regular XML file using Python Pandas but this crashed the Jupyter notebook Kernel. I guess the data is too big.
My second approach was this solution: How to extract and visualize data from OSM file in Python. It worked for me, at least, I can get some of the information, like the tags of the relations in the file (and the other specified details). What I'm missing is the relation members (the ways) and then the way members (the nodes) and their latitude/longitudes. I need these to achieve what I did here: Plotting OpenStreetMap relations does not generate continuous lines
I'm open to many solutions, for example one could break the file up into many different files containing 1 relation and it's members per file, using an osmium based script. Perhaps then I can move on with pandas.read_xml(). This would be nice for batch processing en filling the Database. Loading the whole OSM XML file into a pd.DataFrame would be nice but I guess this really is a lot of data. Perhaps this can also be done on a per-relation basis with pyosmium?
Any help is appreciated.

Ok, I figured out how to get what I want (all information per relation of the type "route=mtb" stored in an accessible way), it's a multi-step process, I'll describe it here.
First, I downloaded the world file (went to wiki.openstreetmap.org/wiki/Planet.osm, opened the xml of the pbf file and downloaded the world file as .pbf (everything on Linux, and this file is referred to as $osm_planet_file below).
I converted this file to o5m using osmconvert (available in Ubuntu 20.04 by doing apt install osmctools, on the Linux cli:
osmconvert --verbose --drop-version $osm_planet_file -o=$osm_planet_dir/planet.o5m
The next step is to filter all relations of interest out of this file (in my case I wanted all MTB routes: route=mtb) and store them in a new file, like this:
osmfilter $osm_planet_dir/planet.o5m --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.o5m
This creates a much smaller file that contains all information on the relations that are MTB routes.
From there on I switched to a Jupyter notebook and used Python3 to further divide the file into useful, manageable chunks. I first installed osmium using conda (in the env I created first but that can be skipped):
conda install -c conda-forge osmium
Then I made a recommended osm.SimpleHandle class, this class iterates through the large o5m file and while doing this it can do actions. This is the way to deal with these files because they are far to big for memory. I made the choice to iterate through the file and store everything I needed into separate json files. This does generate more than 12.000 json files but it can be done on my laptop with 8 GB of memory. This is the class:
import osmium as osm
import json
import os
data_dump_dir = '../data'
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = []
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
data = dict()
data['version'] = elem.version,
data['members'] = [int(member.ref) for member in elem.members if member.type == 'w'], # filter nodes from waylist => could be a mistake
data['visible'] = elem.visible,
data['timestamp'] = str(elem.timestamp),
data['uid'] = elem.uid,
data['user'] = elem.user,
data['changeset'] = elem.changeset,
data['num_tags'] = len(elem.tags),
data['key'] = tag.k,
data['value'] = tag.v,
data['deleted'] = elem.deleted
with open(os.path.join(data_dump_dir, str(elem.id)+'.json'), 'w') as f:
json.dump(data, f)
def relation(self, r):
self.tag_inventory(r, "relation")
Run the class like this:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
Now we have json files with the relation number as their filename and with all metadata, and a list of the ways. But we want a list of the ways and then also all the nodes per way, so we can plot the full relations (the MTB routes). To achieve this, we parse the o5m file again (using a class build on the osm.SimpleHandler class) and this time we extract all way members (the nodes), and create a dictionary:
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = dict()
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
self.osm_data[int(elem.id)] = dict()
# self.osm_data[int(elem.id)]['is_closed'] = str(elem.is_closed)
self.osm_data[int(elem.id)]['nodes'] = [str(n) for n in elem.nodes]
def way(self, w):
self.tag_inventory(w, "way")
Execute the class:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
ways = osmhandler.osm_data
This gives is dict (called ways) of all ways as keys and the node IDs (!Meaning we need some more steps!) as values.
len(ways.keys())
>>> 337597
In the next (and almost last) step we add the node IDs for all ways to our relation jsons, so they become part of the files:
all_data = dict()
for relation_file in [
os.path.join(data_dump_dir,file) for file in os.listdir(data_dump_dir) if file.endswith('.json')
]:
with open(relation_file, 'r') as f:
data = json.load(f)
if 'members' in data: # Make sure these steps are never performed twice
try:
data['ways'] = dict()
for way in data['members'][0]:
data['ways'][way] = ways[way]['nodes']
del data['members']
with open(relation_file, 'w') as f:
json.dump(data, f)
except KeyError as err:
print(err, relation_file) # Not sure why some relations give errors?
So now we have relation jsons with all ways and all ways have all node IDs, the last thing to do is to replace the node IDs with their values (latitude and longitude). I also did this in 2 steps, first I build a nodeID:lat/lon dictionary, again using an osmium.SimpleHandler based class :
import osmium
class CounterHandler(osmium.SimpleHandler):
def __init__(self):
osmium.SimpleHandler.__init__(self)
self.osm_data = dict()
def node(self, n):
self.osm_data[int(n.id)] = [n.location.lat, n.location.lon]
Execute the class:
h = CounterHandler()
h.apply_file("../data/world_mtb_routes.o5m")
nodes = h.osm_data
This gives us dict with a latitude/longitude pair for every node ID. We can use this on our json files to fill the ways with coordinates (where there are now still only node IDs), I create these final json files in a new directory (data/with_coords in my case) because if there is an error, my original (input) json file is not affected and I can try again:
import os
relation_files = [file for file in os.listdir('../data/') if file.endswith('.json')]
for relation in relation_files:
relation_file = os.path.join('../data/',relation)
relation_file_with_coords = os.path.join('../data/with_coords',relation)
with open(relation_file, 'r') as f:
data = json.load(f)
try:
for way in data['ways']:
node_coords_per_way = []
for node in data['ways'][way]:
node_coords_per_way.append(nodes[int(node)])
data['ways'][way] = node_coords_per_way
with open(relation_file_with_coords, 'w') as f:
json.dump(data, f)
except KeyError:
print(relation)
Now I have what I need and I can start adding the info to my Django database, but that is beyond the scope of this question.
Btw, there are some relations that give an error, I suspect that for some relations ways were labelled as nodes but I'm not sure. I'll update here if I find out. I also have to do this process regularly (when the world file updates, or every now and then) so I'll probably write something more concise later on, but for now this works and the steps are understandable, to me, after a lot of thinking at least.
All of the complexity comes from the fact that the data is not big enough for memory, otherwise I'd have created a pandas.DataFrame in step one and be done with it. I could also have loaded the data in a database in one go perhaps, but I'm not that good with databases, yet.

How to know a material name of a PID in ABAQUS odb file(not MDB file) using Python scripting?

Here I need to know how to find the name of the material for each PID. I'm a beginner so every answer can help me a lot.
odb= odbAccess.openOdb(path=(path+file), readOnly=True)
tag="ERT"
step = []
step.append(odb.steps[odb.steps.keys()[0]])
frame = []
frame.append(step[0].frames[-1])
assembly = []
assembly.append(odb.rootAssembly)
instance = []
instance.append(assembly[0].instances[instance_n])
PIDs = []
for key in instance[0].elementSets.keys():
if tag in key:
PIDs.append(key)
for PID in PIDs:
print(material_name[PID]) # here i need to know name of material for this PID

The information about the material or rather the section is stored within the section object. If you know your sections (you did not mention if you have actual access to the model). If not then you also have to obtain the materials information and the section information from the odb.
What you CAN do within your set is this
odb = odbAccess.openOdb(path=(path+file), readOnly=True)
my_elset = odb.rootAssembly.instances[0].elementSets[setname]
element_from_set = my_elset.elements[0] # list with mesh element object, e.g. first
sec_category = element_from_set[0].sectionCatergory
section category contains several informations about the underlying section. In case you do know your sections and the corresponding material (maybe stored in a file after creating the model): good.
Id not, you have to further obtain section information e.g. by:
odb.sections[sectionname]
which amongst other contains the material for each section in the odb.
In any case you I think you would make your live easier if you obtained that information directly within the mdb if possible.
The above examples are only rudimentary, you might need to loop over them in case you don't have explicit setnames but they can also be obtained.
EDIT:
As a general recommendation: You can try all of this in the command line either by opening abaqus command promt and typing
abaqus python
and you get abaqus python shell or, with open abaqus you can use the shell at the bottom (>>> on yellow ground) or the PDE.

Python parser. Need to read the "Name and author" out of a text file and out put all of the collected names into another text file

I'm trying to take two things out of text files that are in folders and output them into a neat list in a single text file. I've never done something like this before and all of the online resources are either too simple for my task or too complex for my task.I have a feeling this task is specific to what I'm trying to do.
[Info]
name = "bridget"
displayname = "BRIDGET"
versiondate = 04,13,2002
mugenversion = 04,14,2001
author = "[fraya]"
pal.defaults = 1
All I'm trying to do is take the "displayname" and "author" text fields and output them to a file in a list with the format "(displayname) by (author)"
a parser was the first thing that came to my mind when I wanted to try this (and python I heard was a good choice for this).
So if anyone could point me in the right direction or give me some building blocks that would be helpful.

You don't need to write a parser; this is (almost) standard .ini file format, which can be read by the configparser module. You'll just need to strip the quotes when you output the values.
To get you started:
import configparser
c = configparser.ConfigParser()
c.read(['myfilename.ini'])
info = c['Info']
displayname = info['displayname'].strip('"')
author = info['author'].strip('"')
print("{} by {}".format(displayname, author))

Constant first row of a .csv file?

I have a Python code which is logging some data into a .csv file.
logging_file = 'test.csv'
dt = datetime.datetime.now()
f = open(logging_file, 'a')
f.write('\n "{:%H:%M:%S}",{},{}'.format(dt,x,y,))
The above code is the core part and this produces continuous data in .csv file as
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
Now I wish to add the following lines in first row of this data. time, data1,data2.I expect output as
time, data1, data2
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
I tried many ways. Those ways not produced me the result as preferred format.But I am unable to get my expected result.
Please help me to solve the problem.

I would recommend writing a class specifically for creating and managing logs.Have it initialize a file, on creation, with the expected first line (don't forget a \n character!), and keep track of any necessary information about that log(the name of the log it created, where it is, etc). You can then have the class 'write' to the log (append the log, really), you can create new logs as necessary, and, you can have it check for existing logs, and make decisions about either updating what is existing, or scrapping it and starting over.

How can I modify the attributes of an SVG file from Python?

I have an svg file that was generated by the map data-visualisation software 'Kartograph'. It contains a large number of paths representing areas on a map. These paths each have some data fields:
<path d=" ...path info... " data-electorate="Canberra" data-id="Canberra" data-no="23" data-nop="0.92" data-percentile="6" data-state="ACT" data-totalvotes="25" data-yes="2" data-yesp="0.08" id="Canberra"/>
So that I don't have to generate a new svg file every time, I want to modify some attributes, such as the number of 'yes' votes, from within python. Specifically, I would like to increment/increase the 'yes' votes value by one (for each execution of the code).
I have tried lxml and have browsed the documentation for it extensively, but so far this code has not worked:
from lxml import etree
filename = "aus4.svg"
tree = etree.parse(open(filename, 'r'))
for element in tree.iter():
if element.tag.split("}")[1] == "path":
if element.get("id") == "Lingiari":
yes_votes = element.get("data-yes")
print(yes_votes)
yes_votes.set(yes_votes, str(int(yes_votes) + 1))
print(yes_votes)
Is python the best tool to use for this task? If so how might I change the above code or start afresh. Apologies for any confusion. I am new to this 'lxml' module and svg files, so I'm a bit lost.

You do not set the attribute again, but use its value instead of the elmenet in this line:
yes_votes.set(yes_votes, str(int(yes_votes) + 1))
yes_votes contains the content of the attribute and not a reference to the attribute itself. Change it to:
element.set( "data-yes", str(int(yes_votes) + 1))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.