Python minidom look for empty text node - python

I am parsing an XML file with the minidom parser, where I'm iterating over the XML and output specific information that stands between the tags into a dictionary.
Like this:
d={}
dom = parseString(data)
macro=dom.getElementsByTagName('macro')
for node in macro:
d={}
id_name=node.getElementsByTagName('id')[0].toxml()
id_data=id_name.replace('<id>','').replace('</id>','')
print (id_data)
cl_name=node.getElementsByTagName('cl')[1].toxml()
cl_data=cl_name.replace('<cl>','').replace('</cl>','')
print (cl_data)
d_source[id_data]=(cl_data)
Now, my problem is that the data where I'm looking for in cl_name=node.getElementsByTagName('cl')[1].toxml() is sometimes non-existent!
In this case the part of the XML looks like this:
<cl>blabla</cl>
<cl></cl>
Because of this I receive an "index is out of range"-error.
However, I really need this "nothing" in my dictionary. My dictionary should look like this:
d={blabla:'',xyz:'abc'}
I have to look for the empty text node, which I tried by doing this:
if node.getElementsByTagName('cl')[1].toxml is None:
print ('')
else:
cl_name=node.getElementsByTagName('cl')[1].toxml()
cl_data=cl_name.replace('<cl>','').replace('</cl>','')
print (cl_data)
d_target[id_data]=(cl_data)
print(d_target)
I still receive that indexing error...I also thought about inserting a white space into the original source file, but am not sure if this would solve the issue. Any ideas?

If the minidom is not dictated somehow, I suggest to change your mind and use the standard xml.etree.ElementTree. It is much easier.

I figured out it's working when adding a white space into the original source file. This looks a bit messy though. So if anyone has a better idea, I'm looking forward to it!

Related

python remove ctrl-character from string

I have a bunch of XML files dumped to disk in batches.
When I tried to prase them I found that some hade a control character inserted into an attribute.
It looked like this:
<root ^KIND="A"></root>
When it was supposed to look like this:
<root KIND="A"></root>
Now in this case it was easily fixed, just some regexp magic:
import re
xml = re.sub(r'<([^>]*)\v([^>]*)>', r'<\1K\2>', xml)
But then the requirements changed, I had to dump the docs out to disk, individually.
Naturally I raw the substitution before saving so i wouldn't have that problem again.
There are alot of these documents you see, many millions...
And so, I was getting ready to extract some data from them again.
This time however I got a new error:
<root KIND="A"><CLASSIFICATION></CLASSIFICATIO^N></root>
When it was supposed to look like this:
<root KIND="A"><CLASSIFICATION></CLASSIFICATION></root>
I am not sure why I keep getting these errors not why its always 'ctrl-characters` that are inserted. It migth be that its pure luck so far.
The regexp I used in hte first case wont wore in general, ^K translates to vertical tab so I could match agains that. But is there some what to filter out any ctrl-character?
Try using a translate table to get rid of ctrl-A through ctrl-Z:
in_chars = ''.join([chr(x) for x in range(1, 27)])
out_chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
tr_table = str.maketrans(in_chars, out_chars)
# pass all strings through the translate table:
x = input('Enter text: ')
print(x.translate(tr_table))
Prints:
Enter text: abc^Kdef
abcKdef

How to do parsing in python?

I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?
The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.
As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.

Python XML parsing - equivalent of "grep -v" in bash

This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one

XML creation from a Dictionary in Python

I am quite new to XML as well as Python, so please overlook . I am trying to unpack a dictionary straight into XML format. My Code Fragment is as follows:
from xml.dom.minidom import Document
def make_xml(dictionary):
doc = Document()
result = doc.createElement('result')
doc.appendChild(result)
for key in dictionary:
attrib = doc.createElement(key)
result.appendChild(attrib)
value = doc.createTextNode(dictionary[key])
attrib.appendChild(value)
print doc
I expected an answer of the format
<xml>
<result>
<attrib#1>value#1</attrib#1>
...
However all I am getting is
<xml.dom.minidom.Document instance at 0x01BE6130>
Please help
You have not checked the
http://docs.python.org/library/xml.dom.minidom.html
docs.
Look at the toxml() or prettyprettyxml() methods.
You can always use a library like xmler which easily takes a python dictionary and converts it to xml. Full disclosure, I created the package, but I feel like it will probably do what you need.
Also feel free to take a look at the source.

Python: Matching & Stripping port number from socket data

I have data coming in to a python server via a socket. Within this data is the string '<port>80</port>' or which ever port is being used.
I wish to extract the port number into a variable. The data coming in is not XML, I just used the tag approach to identifying data for future XML use if needed. I do not wish to use an XML python library, but simply use something like regexp and strings.
What would you recommend is the best way to match and strip this data?
I am currently using this code with no luck:
p = re.compile('<port>\w</port>')
m = p.search(data)
print m
Thank you :)
Regex can't parse XML and shouldn't be used to parse fake XML. You should do one of
Use a serialization method that is nicer to work with to start with, such as JSON or an ini file with the ConfigParser module.
Really use XML and not something that just sort of looks like XML and really parse it with something like lxml.etree.
Just store the number in a file if this is the entirety of your configuration. This solution isn't really easier than just using JSON or something, but it's better than the current one.
Implementing a bad solution now for future needs that you have no way of defining or accurately predicting is always a bad approach. You will be kept busy enough trying to write and maintain software now that there is no good reason to try to satisfy unknown future needs. I have never seen a case where "I'll put this in for later" has led to less headache later on, especially when I put it in by doing something completely wrong. YAGNI!
As to what's wrong with your snippet other than using an entirely wrong approach, angled brackets have a meaning in regex.
Though Mike Graham is correct, using regex for xml is not 'recommended', the following will work:
(I have defined searchType as 'd' for numerals)
searchStr = 'port'
if searchType == 'd':
retPattern = '(<%s>)(\d+)(</%s>)'
else:
retPattern = '(<%s>)(.+?)(</%s>)'
searchPattern = re.compile(retPattern % (searchStr, searchStr))
found = searchPattern.search(searchStr)
retVal = found.group(2)
(note the complete lack of error checking, that is left as an exercise for the user)

Categories

Resources