I have an svg file that was generated by the map data-visualisation software 'Kartograph'. It contains a large number of paths representing areas on a map. These paths each have some data fields:
<path d=" ...path info... " data-electorate="Canberra" data-id="Canberra" data-no="23" data-nop="0.92" data-percentile="6" data-state="ACT" data-totalvotes="25" data-yes="2" data-yesp="0.08" id="Canberra"/>
So that I don't have to generate a new svg file every time, I want to modify some attributes, such as the number of 'yes' votes, from within python. Specifically, I would like to increment/increase the 'yes' votes value by one (for each execution of the code).
I have tried lxml and have browsed the documentation for it extensively, but so far this code has not worked:
from lxml import etree
filename = "aus4.svg"
tree = etree.parse(open(filename, 'r'))
for element in tree.iter():
if element.tag.split("}")[1] == "path":
if element.get("id") == "Lingiari":
yes_votes = element.get("data-yes")
print(yes_votes)
yes_votes.set(yes_votes, str(int(yes_votes) + 1))
print(yes_votes)
Is python the best tool to use for this task? If so how might I change the above code or start afresh. Apologies for any confusion. I am new to this 'lxml' module and svg files, so I'm a bit lost.
You do not set the attribute again, but use its value instead of the elmenet in this line:
yes_votes.set(yes_votes, str(int(yes_votes) + 1))
yes_votes contains the content of the attribute and not a reference to the attribute itself. Change it to:
element.set( "data-yes", str(int(yes_votes) + 1))
Related
I'm trying to take an API call response and parse the XML data into list, but I am struggling with the multiple child/parent relationships.
My hope is to export a new XML file that would line up each job ID and tracking number, which I could then import into Excel.
Here is what I have so far
The source XML file looks like this:
<project>
<name>October 2019</name>
<jobs>
<job>
<id>5654206</id>
<tracking>
<mailPiece>
<barCode>00270200802095682022</barCode>
<address>Accounts Payable,1661 Knott Ave,La Mirada,CA,90638</address>
<status>En Route</status>
<dateTime>2019-10-12 00:04:21.0</dateTime>
<statusLocation>PONTIAC,MI</statusLocation>
</mailPiece>
</tracking>...
Code:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement
tree = ET.parse('mailings.xml')
root = tree.getroot()
print(root.tag)
for x in root[1].findall('job'):
id=x.find('id').text
tracking=x.find('tracking').text
print(root[1].tag,id,tracking)
The script currently returns the following:
jobs 5654206 None
jobs 5654203 None
Debugging is your friend...
I am struggling with the multiple child/parent relationships.
The right way to resolve this yourself is through using a debugger. For example, with VS Code, after applying a breakpoint and running the script with the debugger, it will stop at the breakpoint and I can inspect all the variables in memory, and run commands at the debug console just as if they were in my script. The Variable windows output looks like this:
There are various ways to do this at the command-line, or with a REPL like iPython, etc., but I find using debugging in a modern IDE environment like VS Code or PyCharm are definitely the way to go. Their debuggers remove the need to pepper print statements everywhere to test out your code, rewriting your code to expose more variables that must be printed to the console, etc.
A debugger allows you to see all the variables as a snapshot, exactly how the Python interpreter sees them, at any point in your code execution. You can:
step through your code line-by-line and watch the variables changes in real-time in the window
setup a separate watch window with only the variables you care about
and setup breakpoints that will only happen if variables are set to particular values, etc.
Child Hierarchy with the XML find method
Inspecting the variables as I step through your code, it appears that the find() method was walking the children within an Element at all levels, not just at the top level. When you used x.find('tracking') it is finding the mailPiece nodes directly. If you print the tag property, instead of the text property, you will see it is 'mailPiece' (see the debug windows above).
So, one way to resolve your issue is to store each mailPiece element as a variable, then pull out the individual attributes you want from it (i.e. BarCode, address, etc.) using find.
Here is some code that pulls all of this into a combined hierarchy of lists and dictionaries that you can then use to build your Excel outputs.
Note: The most efficient way to do this is line-by-line as you read the xml, but this is better for readability, maintainability, and if you need to do any post-processing that requires knowledge of more than one node at a time.
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement
from types import SimpleNamespace
tree = ET.parse('mailings.xml')
root = tree.getroot()
jobs = []
for job in root[1].findall('job'):
jobdict = {}
jobdict['id'] = job.find('id').text
jobdict['trackingMailPieces'] = []
for tracking in job.find('tracking'):
if tracking.tag == 'mailPiece':
mailPieceDict = {}
mailPieceDict['barCode'] = tracking.find('barCode').text
mailPieceDict['address'] = tracking.find('address').text
mailPieceDict['status'] = tracking.find('status').text
mailPieceDict['dateTime'] = tracking.find('dateTime').text
mailPieceDict['statusLocation'] = tracking.find('statusLocation').text
jobdict['trackingMailPieces'].append(mailPieceDict)
jobs.append(jobdict)
for job in jobs:
print('Job ID: {}'.format(job['id']))
for mp in job['trackingMailPieces']:
print(' mailPiece:')
for key, value in mp.items():
print(' {} = {}'.format(key, value))
The result is:
Job ID: 5654206
mailPiece:
barCode = 00270200802095682022
address = Accounts Payable,1661 Knott Ave,La Mirada,CA,90638
status = En Route
dateTime = 2019-10-12 00:04:21.0
statusLocation = PONTIAC,MI
Output?
I didn't address what to do with the output as that is beyond the scope of this question, but consider writing out to a CSV file, or even directly to an Excel file, if you don't need to pass on the XML to another program for some reason. There are Python packages that handle writing CSV and Excel files.
No need to create an intermediate format that you then need to manipulate after bringing it into Excel, for example.
This question already has an answer here:
How to use xml sax parser to read and write a large xml?
(1 answer)
Closed 3 years ago.
I have huge XML datasets (2-40GB). Some of the data is confidential, so I am trying to edit the dataset to mask all of the confidential information. I have a long list of each value that needs to be masked, so for example if I have ID 'GYT-1064' I need to find and replace every instance of it. These values can be in different fields/levels/subclasses, so in one object it might have 'Order-ID = GYT-1064' whereas another might say 'PO-Name = GYT-1064'. I have looked into iterparse but cannot figure out how to in-place edit the xml file instead of building the entire new tree in memory, because I have to loop through it multiple times to find each instance of each ID.
Ideal functionality:
For each element, if a given string is in element, replace the text and change the line in the XML file.
I have a solution that works if the dataset is small enough to load into memory, but I can't figure out how to correctly leverage iterparse. I've also looked into every answer that talks about lxml iterparse, but since I need to iterate through the entire file multiple times, I need to be able to edit it in place
Simple version that works, but has to load the whole xml into memory (and isn't in-place)
values_to_mask = ['val1', 'GMX-103', 'etc-555'] #imported list of vals to mask
with open(dataset_name, encoding='utf8') as f:
tree = ET.parse(f)
root = tree.getroot()
for old in values_to_mask:
new = mu.generateNew(old, randomnumber) #utility to generate new amt
for elem in root.iter():
try:
elem.text = elem.text.replace(old, new)
except AttributeError:
pass
tree.write(output_name, encoding='utf8')
What I attempted with iterparse:
with open(output_name, mode='rb+') as f:
context = etree.iterparse( f )
for old in values_to_mask:
new = mu.generateNew(old, randomnumber)
mu.fast_iter(context, mu.replace_if_exists, old, new, f)
def replace_if_exists(elem, old, new, xf):
try:
if(old in elem.text):
elem.text = elem.text.replace(old, new)
xf.write(elem)
except AttributeError:
pass
It runs but doesn't replace any text, and I get print(context.root) = 'Null'. Additionally, it doesn't seem like it would correctly write back to the file in place.
Basically how the XML data looks (hierarchical objects with subclasses)
It looks generally like this:
<Master_Data_Object>
<Package>
<PackageNr>1000</PackageNr>
<Quantity>900</Quantity>
<ID>FAKE_CONFIDENTIALGYO421</ID>
<Item_subclass>
<ItemType>C</ItemType>
<MasterPackageID>FAKE_CONFIDENTIALGYO421</MasterPackageID>
<Package>
<Other_Types>
Since Lack of Dataset , I would like to suggest you to
1) use readlines() in loop to read substantial amount of data at a time
2) use a regular expression for identifying confidential information (if Possible) then replace it.
Let me know if it works
You can pretty much use SAX parser for big xml files.
Here is your answer -
Editing big xml files using sax parser
I'm trying to insert a picture into a Word document using python-docx but running into errors.
The code is simply:
document.add_picture("test.jpg", width = Cm(2.0))
From looking at the python-docx documentation I can see that the following XML should be generated:
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="1" name="python-powered.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId7"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="859536" cy="343814"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
This does in fact get generated in my document.xml file. (When unzipping the docx file). However looking into the OOXML format I can see that the image should also be saved under the media folder and the relationship should be mapped in word/_rels/document.xml:
<Relationship Id="rId20"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
Target="media/image20.png"/>
None of this is happens however, and when I open the Word document I'm met with a "The picture can't be displayed" placeholder.
Can anyone help me understand what is going on?
It looks like the image is not embedded the way it should be and I need to insert it in the media folder and add the mapping for it, however as a well documented feature this should be working as expected.
UPDATE:
Testing it out with an empty docx file that image does get added as expected which leads me to believe it might have something to do with the python-docx-template library. (https://github.com/elapouya/python-docx-template)
It uses python-docx and jinja to allow templating capabilities but runs and works the same way python-docx should. I added the image to a subdoc which then gets inserted into a full document at a given place.
A sample code can be seen below (from https://github.com/elapouya/python-docx-template/blob/master/tests/subdoc.py):
from docxtpl import DocxTemplate
from docx.shared import Inches
tpl=DocxTemplate('test_files/subdoc_tpl.docx')
sd = tpl.new_subdoc()
sd.add_paragraph('A picture :')
sd.add_picture('test_files/python_logo.png', width=Inches(1.25))
context = {
'mysubdoc' : sd,
}
tpl.render(context)
tpl.save('test_files/subdoc.docx')
I'll keep this up in case anyone else manages to make the same mistake as I did :) I managed to debug it in the end.
The problem was in how I used the python-docx-template library. I opened up a DocxTemplate like so:
report_output = DocxTemplate(template_path)
DoThings(value,template_path)
report_output.render(dictionary)
report_output.save(output_path)
But I accidentally opened it up twice. Instead of passing the template to a function, when working with it, I passed a path to it and opened it again when creating subdocs and building them.
def DoThings(data,template_path):
doc = DocxTemplate(template_path)
temp_finding = doc.new_subdoc()
#DO THINGS
Finally after I had the subdocs built, I rendered the first template which seemed to work fine for paragraphs and such but I'm guessing the images were added to the "second" opened template and not to the first one that I was actually rendering. After passing the template to the function it started working as expected!
I came acrossed with this problem and it was solved after the parameter width=(1.0) in method add_picture removed.
when parameter width=(1.0) was added, I could not see the pic in test.docx
so, it MIGHT BE resulted from an unappropriate size was set to the picture,
to add pictures, headings, paragraphs to existing document:
doc = Document(full_path) # open an existing document with existing styles
for row in tableData: # list from the json api ...
print ('row {}'.format(row))
level = row['level']
levelStyle = 'Heading ' + str(level)
title = row['title']
heading = doc.add_heading( title , level)
heading.style = doc.styles[levelStyle]
p = doc.add_paragraph(row['description'])
if row['img_http_path']:
ip = doc.add_paragraph()
r = ip.add_run()
r.add_text(row['img_name'])
r.add_text("\n")
r.add_picture(row['img_http_path'], width = Cm(15.0))
doc.save(full_path)
I'm trying to edit the text inside of all of the tags named "Volume" in an XML file by multiplying that text by a number entered by the user. The text inside of the "Volume" tag will always be a number. My code works so far, but only on the first instance of the "Volume" text.
Here's an example of the XML:
<blah>
<moreblah> sometext </moreblah> ;
<blah2>
<blah3> <blah4> 30 </blah4> <Volume> 15 </Volume> </blah3>
</blah2>
</blah>
<blah>
<moreblah> sometext </moreblah> ;
<blah2>
<blah3> <blah4> 30 </blah4> <Volume> 25 </Volume> </blah3>
</blah2>
</blah>
And here's my Python code:
#import modules
import xml.dom.minidom
from xml.dom.minidom import parse
import os
import fileinput
#create a backup of original file
new_file_name = 'blah.xml'
old_file_name = new_file_name + "_old"
os.rename(new_file_name, old_file_name)
#find all instances of "Volume"
doc = parse(old_file_name)
volume = doc.getElementsByTagName('Volume')[0]
child = volume.childNodes[0]
txt = child.nodeValue
#ask for percentage input
print
percentage = raw_input("Set Volume Percentage (1 - 100): ")
if percentage.isdigit():
if int(percentage) <101 >1:
print 'Thank You'
#append text of <Volume> tag
child.nodeValue = str(int(float(txt) * (int(percentage)/100.0)))
#persist changes to new file
xml_file = open(new_file_name, "w")
doc.writexml(xml_file)
xml_file.close()
#remove XML Declaration
text = open("blah.xml", "r").read()
text = text.replace('<?xml version="1.0" ?>', '')
open("blah.xml", "w").write(text)
else:
print
print 'Please enter a number between 1 and 100.'
print
print 'Try again.'
print
print 'Exiting.'
xml_file = open(new_file_name, "w")
doc.writexml(xml_file)
xml_file.close()
os.remove(old_file_name)
I know that in my code, I have "doc.getElementsByTagName('Volume')[0]" which denotes the first instance of the "Volume" tag, but I was just doing that as a test to see if it would work. So I'm aware that the code is working exactly as it should. But I'm wondering if anyone has any suggestions, or could tell me the easiest way to apply the user input percentage to all of the instances of the "Volume" tag.
This is also my first attempt at Python, so if you see anything else that seems weird, please let me know.
Thank you for your help!
You'll be much happier if you use a more modern XML API, like ElementTree (in the standard library) or lxml (more advanced).
In ElementTree or lxml you get access to XPath (or something close), which allows for a much more flexible syntax in finding elements and attributes in XML documents.
In ElementTree:
volumes = my_parsed_xml_file.find('.//Volume')
...will find all occurrences of the Volume element.
If you stick with the current syntax, by doing:
doc.getElementsByTagName('Volume')[0]
...you're specifically asking for the zero-th (first) Volume. If you want to process them all, you want a loop:
for volume in doc.getElementsByTagName('Volume'):
child = volume.childNodes[0]
// ... rest of your code inside the loop
If constructs like loops are unfamiliar to you, you should probably step back and read an introductory programming guide, as things will get pretty complicated quickly without some fundamentals. Best of luck!
I'm really new to Python, but I've picked a problem that actually pertains to work and I think as I figure out how to do it I'll learn along the way.
I have a directory full of JSON-formatted files. I've gotten as far as importing everything in the directory into a list, and iterating through the list to do a simple print that verifies I got the data.
I'm trying to figure out how to actually work with a given JSON object in Python. In javascript, its as simple as
var x = {'asd':'bob'}
alert( x.asd ) //alerts 'bob'
Accessing the various properties on an object is simple dot notation. What's the equivalent for Python?
So this is my code that is doing the import. I'd like to know how to work with the individual objects stored in the list.
#! /usr/local/bin/python2.6
import os, json
#define path to reports
reportspath = "reports/"
# Gets all json files and imports them
dir = os.listdir(reportspath)
jsonfiles = []
for fname in dir:
with open(reportspath + fname,'r') as f:
jsonfiles.append( json.load(f) )
for i in jsonfiles:
print i #prints the contents of each file stored in jsonfiles
What you get when you json.load a file containing the JSON form of a Javascript object such as {'abc': 'def'} is a Python dictionary (normally and affectionately called a dict) (which in this case happens to have the same textual representation as the Javascript object).
To access a specific item, you use indexing, mydict['abc'], while in Javascript you'd use attribute-access notation, myobj.abc. What you get with attribute-access notation in Python are methods that you can call on your dict, for example mydict.keys() would give ['abc'], a list with all the key values that are present in the dictionary (in this case, only one, and it's a string).
Dictionaries are extremely rich in functionality, with a wealth of methods that will make your head spin plus strong support for many Python language structures (for example, you can loop on a dict, for k in mydict:, and k will step through the dictionary's keys, iteratively and sequentially).
To access all properties, try eval() statement before append a list.
like:
import os
#define path to reports
reportspath = "reports/"
# Gets all json files and imports them
dir = os.listdir(reportspath)
for fname in dir:
json = eval(open(fname).read())
# now, json is a normal python object
print json
# list all properties...
print dir(json)