XML to store system paths in Python with lxml - python

I'm using an xml file to store configurations for a software.
One of theese configurations would be a system path like
> set_value = "c:\\test\\3 tests\\test"
i can store it by using:
> setting = etree.SubElement(settings,
> "setting", name=tmp_set_name, type =
> set_type , value= set_value)
If I use
doc.write(output_file, method='xml',encoding = 'utf-8', compression=0)
the file would be:
< setting type="str" name="MyPath" value="c:\test\3 tests\test"/>
Now I read it again with the etree.parse method
I obtain an etree child object with a string value, but the string
contains the
\3
character and if i try to use it to write again to xml it will be interpreted !!!!! So i cannot use it anymore as a path
Maybe i'm only missing a simple string operation, but I cannot see it =)
How would you solve it in a smart way ?
This is an example, but what is the best way, you think to store paths in xml and parse them with lxml ?
Thank you !!

Now I read it again with the
etree.parse method
I obtain an etree child object with a
string value, but the string contains
the
\3
character and if i try to use it to
write again to xml it will be
interpreted !!!!!
I just tried that, and it doesn't get "interpreted". The elements attributes as returned after parsed is:
{'type': 'str', 'name': 'yowza!', 'value': 'c:\\test\\3 tests\\test'}
So as you see this works just as you expected it to work. If you really have this problem, you are doing something else than what you are saying. Show us the real code, or make a small example code where you demonstrate the problem and use that.

Related

Working with Parameters containing Escaped Characters in Python Config file

I have a config file that I'm reading using the following code:
import configparser as cp
config = cp.ConfigParser()
config.read('MTXXX.ini')
MT=identify_MT(msgtext)
schema_file = config.get(MT,'kbfile')
fold_text = config.get(MT,'fold')
The relevant section of the config file looks like this:
[536]
kbfile=MT536.kb
fold=:16S:TRANSDET\n
Later I try to find text contained in a dictionary that matches the 'fold' parameter, I've found that if I find that text using the following function:
def test (find_text)
return {k for k, v in dictionary.items() if find_text in v}
I get different results if I call that function in one of two ways:
test(fold_text)
Fails to find the data I want, but:
test(':16S:TRANSDET\n')
returns the results I know are there.
And, if I print the content of the dictionary, I can see that it is, as expected, shown as
:16S:TRANSDET\n
So, it matches when I enter the search text directly, but doesn't find a match when I load the same text in from a config file.
I'm guessing that there's some magic being applied here when reading/handling the \n character pattern in from the config file, but don't know how to get it to work the way I want it to.
I want to be able to parameterise using escape characters but it seems I'm blocked from doing this due to some internal mechanism.
Is there some switch I can apply to the config reader, or some extra parsing I can do to get the behavior I want? Or perhaps there's an alternate solution. I do find the configparser module convenient to use, but perhaps this is a limitation that requires an alternative, or even self-built module to lift data out of a parameter file.

Python parsing json data

I have a json object saved inside test_data and I need to know if the string inside test_data['sign_in_info']['package_type'] contains the string "vacation_package" in it. I assumed that in could help but I'm not sure how to use it properly or if it´s correct to use it. This is an example of the json object:
"checkout_details": {
"file_name" : "pnc04",
"test_directory" : "test_pnc04_package_today3_signedout_noinsurance_cc",
"scope": "wdw",
"number_of_adults": "2",
"number_of_children": "0",
"sign_in_info": {
"should_login": false,
**"package_type": "vacation_package"**
},
package type has "vacation_package" in it, but it's not always this way.
For now I´m only saving the data this way:
package_type = test_data['sign_in_info']['package_type']
Now, is it ok to do something like:
p= "vacation_package"
if(p in package_type):
....
Or do I have to use 're' to cut the string and find it that way?
You answer depends on what exactly you expect to get from test_data['sign_in_info']['package_type']. Will 'vacation_package' always be by itself? Then in is fine. Could it be part of a larger string? Then you need to use re.search. It might be safer just to use re.search (and a good opportunity to practice regular expressions).
No need to use re, assuming you are using the json package. Yes, it's okay to do that, but are you trying to see if there is a "package type" listed, or if the package type contains vacation_package, possibly among other things? If not, this might be closer to what you want, as it checks for exact matches:
import json
data = json.load(open('file.json'))
if data['sign_in_info'].get('package_type') == "vacation_package":
pass # do something

How to do parsing in python?

I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?
The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.
As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.

Python XML parsing - equivalent of "grep -v" in bash

This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one

Python String parse

Im working on a data packet retrieval system which will take a packet, and process the various parts of the packet, based on a system of tags [similar to HTML tags].
[text based files only, no binary files].
Each part of the packet is contained between two identical tags, and here is a sample packet:
"<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
The entire packet is contained within the <PACKET><PACKET> tags.
All meta-data is contained within the <HEAD><HEAD> tags and the filename from which the packet is part of is contained within the, you guessed it, the <FILENAME><FILENAME> tags.
Lets say, for example, a single packet is received and stored in a temporary string variable called sTemp.
How do you efficiently retrieve, for example, only the contents of a single pair of tags, for example the contents of the <FILENAME><FILENAME> tags?
I was hoping for such functionality as saying getTagFILENAME( packetX ), which would return the textual string contents of the <FILENAME><FILENAME> tags of the packet.
Is this possible using Python?
Any suggestions or comments appreciated.
If the packet format effectively uses XML-looking syntax (i.e., if the "closing tags" actually include a slash), the xml.etree.ElementTree could be used.
This libray is part of Python Standard Library, starting in Py2.5. I find it a very convenient one to deal with this kind of data. It provides many ways to read and to modify this kind of tree structure. Thanks to the generic nature of XML languages and to the XML awareness built-in the ElementTree library, the packet syntax could evolve easily for example to support repeating elements, element attributes.
Example:
>>> import xml.etree.ElementTree
>>> myPacket = '<PACKET><HEAD><ID>123</ID><SEQ>1</SEQ><FILENAME>Test99.txt</FILE
NAME></HEAD><DATA>spam and cheese</DATA></PACKET>'
>>> xt = xml.etree.ElementTree.fromstring(myPacket)
>>> wrk_ele = xt.find('HEAD/FILENAME')
>>> wrk_ele.text
'Test99.txt'
>>>
Something like this?
import re
def getPacketContent ( code, packetName ):
match = re.search( '<' + packetName + '>(.*?)<' + packetName + '>', code )
return match.group( 1 ) if match else ''
# usage
code = "<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
print( getPacketContent( code, 'HEAD' ) )
print( getPacketContent( code, 'SEQ' ) )
As mjv points out, there's not the least sense in inventing an XML-like format if you can just use XML.
But: If you're going to use XML for your packet format, you need to really use XML for it. You should use an XML library to create your packets, not just to parse them. Otherwise you will come to grief the first time one of your field values contains an XML markup character.
You can, of course, write your own code to do the necessary escaping, filter out illegal characters, guarantee well-formedness, etc. For a format this simple, that may be all you need to do. But going down that path is a way to learn things about XML that you perhaps would rather not have to learn.
If using an XML library to create your packets is a problem, you're probably better off defining a custom format (and I'd define one that didn't look anything like XML, to keep people from getting ideas) and building a parser for it using pyparsing.

Categories

Resources