read the text of a file between 2 words in python - python

I am trying to open, read and extract the content (fragment) that is between 2 words (which are opening and closing profile, also included) of an .xml locating the fragment by means of a keyword that I introduce and write only that fragment (between 2 tags) in another new .xml that I generate.
Currently the python script that I have allows me to open, read the source .xml file, search for the keyword that I introduce in the text and return those complete lines where the keyword is found by writing them in a new .xml file that I generate as follows:
keyword = 'Georgia'
occurrences = []
with open('test_input.xml') as lines:
for line in lines:
if keyword in line:
occurrences.append(line)
archi1=open("test_output.xml","w")
archi1.write(''.join(occurrences))
archi1.close()
The result I get is a "test_output.xml" file that contains the following:
<id>Georgia-1</id>
<profile>Georgia-p1</profile>
<id>Georgia-2</id>
<profile>Georgia-p2</profile>
And the problem is that I not only need it to return the complete lines that contain the keyword (in this case 'Georgia') but also the entire fragment that contains those two words and that is delimited between the opening and the closing of the word or tag 'profile', that is, I need it to return the following result:
<profile>
<id>Georgia-1</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Georgia-p1</profile>
<showtitle>Georgia_s1</showtitle>
<ip>000.000.0.3</ip>
<port>00003</port>
<persistencePort>00033</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_3</webstart.server.name>
<codebaseProtocolServer>T3</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Georgia-2</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Georgia-p2</profile>
<showtitle>Georgia_s2</showtitle>
<ip>000.000.0.4</ip>
<port>00004</port>
<persistencePort>00044</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_4</webstart.server.name>
<codebaseProtocolServer>T4</codebaseProtocolServer>
</properties>
</profile>
The full source .xml I am using is as follows:
<project>
<profile>
<id>Azerbaiyan-1</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Azerbaiyan-p1</profile>
<showtitle>Azerbaiyan_s1</showtitle>
<ip>000.000.0.1</ip>
<port>00001</port>
<persistencePort>00011</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_1</webstart.server.name>
<codebaseProtocolServer>T1</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Azerbaiyan-2</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Azerbaiyan-p2</profile>
<showtitle>Azerbaiyan_s2</showtitle>
<ip>000.000.0.2</ip>
<port>00002</port>
<persistencePort>00022</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_2</webstart.server.name>
<codebaseProtocolServer>T2</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Georgia-1</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Georgia-p1</profile>
<showtitle>Georgia_s1</showtitle>
<ip>000.000.0.3</ip>
<port>00003</port>
<persistencePort>00033</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_3</webstart.server.name>
<codebaseProtocolServer>T3</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Georgia-2</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Georgia-p2</profile>
<showtitle>Georgia_s2</showtitle>
<ip>000.000.0.4</ip>
<port>00004</port>
<persistencePort>00044</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_4</webstart.server.name>
<codebaseProtocolServer>T4</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>USA-1</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>USA-p1</profile>
<showtitle>USA1_s1</showtitle>
<ip>000.000.0.5</ip>
<port>00005</port>
<persistencePort>00055</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_5</webstart.server.name>
<codebaseProtocolServer>T5</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>USA-2</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>USA-p2</profile>
<showtitle>USA1_s2</showtitle>
<ip>000.000.0.6</ip>
<port>00006</port>
<persistencePort>00066</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_6</webstart.server.name>
<codebaseProtocolServer>T6</codebaseProtocolServer>
</properties>
</profile>

Parse the input as XML and capture the profile elements that have an id child element whose text value contains the string "Georgia".
The following program uses the ElementTree standard library and outputs the wanted result:
import xml.etree.ElementTree as ET
tree = ET.parse("input.xml")
# Iterate over all 'profile' elements
for profile in tree.findall("profile"):
id = profile.find("id").text
if "Georgia" in id:
print(ET.tostring(profile).decode())

Related

Python - Construct DF From Nested XML Response

What would be the best way to construct a DF from the below nested XML data?
Each "properties" element has three "property" elements nested containing the "name" and "value" of our data. I tried doing two for loops, pandas read_xml option, and a few other pieces but haven't quite gotten the nested logic figured out. My current approach below is closer, but does not keep the names and values together.
Using Python 3.7+ in Jupyter on windows
Sample XML Data:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
<env:Header
xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<wsa:Action>RetrieveResponse</wsa:Action>
<wsa:MessageID>urn:uuid:1234</wsa:MessageID>
<wsa:RelatesTo>urn:uuid:1234</wsa:RelatesTo>
<wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To>
<wsse:Security>
<wsu:Timestamp wsu:Id="Timestamp-45333">
<wsu:Created>2022-11-07T17:02:44Z</wsu:Created>
<wsu:Expires>2022-11-07T17:07:44Z</wsu:Expires>
</wsu:Timestamp>
</wsse:Security>
</env:Header>
<soap:Body>
<RetrieveResponseMsg
xmlns="http://exacttarget.com/wsdl/partnerAPI">
<OverallStatus>MoreDataAvailable</OverallStatus>
<RequestID>asdfds455</RequestID>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>asdfdfd12</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>asdf</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>fasdsa</Value>
</Property>
</Properties>
</Results>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>fasd123</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>asdfd</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>a0A4f</Value>
</Property>
</Properties>
</Results>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>0034P00</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>fasdfs</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>a0fasd</Value>
</Property>
</Properties>
</Results>
</RetrieveResponseMsg>
</soap:Body>
</soap:Envelope>
What I've Attempted So Far:
data_output = []
for el in soup_de.find_all('Property'):
dict_ = {el.find('Name').text:el.find('Value').text}
data_output.append(dict_)
print(len(data_output))
# print(data_output)
testing_de_df = pd.DataFrame(data_output)
display(testing_de_df.info())
display(testing_de_df.head(25))
Desired Output:
details = {'FIELD_NAME': ['asdfdfd12', 'fasd123', '0034P00'],
'FIELD_NAME_2': ['asdf', 'asdfd', 'fasdfs'],
'FIELD_NAME_3': ['fasdsa', 'a0A4f', 'a0fasd']}
desired_output = pd.DataFrame(details)
print(desired_output)
Since <Property> sits at a shallow part of the XML, simply call pandas.read_xml narrowing in on that set of nodes while acknowledging the default namespace (http://exacttarget.com/wsdl/partnerAPI):
property_df = pd.read_xml(
"Input.xml",
xpath = ".//rrm:Property",
namespaces = {"rrm": "http://exacttarget.com/wsdl/partnerAPI"}
)
print(property_df)
# Name Value
# 0 FIELD_NAME asdfdfd12
# 1 FIELD_NAME_2 asdf
# 2 FIELD_NAME_3 fasdsa
# 3 FIELD_NAME fasd123
# 4 FIELD_NAME_2 asdfd
# 5 FIELD_NAME_3 a0A4f
# 6 FIELD_NAME 0034P00
# 7 FIELD_NAME_2 fasdfs
# 8 FIELD_NAME_3 a0fasd
To delineate by property, consider creating a property group number with groupby().cumcount() and reshaping data wide with pivot_table:
property_wide_df = (
property_df
.assign(property_no = lambda x: x.groupby("Name").cumcount().add(1))
.pivot_table(index="property_no", columns="Name", values="Value", aggfunc="sum")
)
print(property_wide_df)
# Name FIELD_NAME FIELD_NAME_2 FIELD_NAME_3
# property_no
# 1 asdfdfd12 asdf fasdsa
# 2 fasd123 asdfd a0A4f
# 3 0034P00 fasdfs a0fasd

retrieve data from XML with namespaces with the help of xml.etree.ElementTree

I have the following xml which i want to parse and get the 'SetOfFiles' in a dictionary containing list format however in spite of trying many permutation and combination, I am unable to get that data into a dictionary list.
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<ns1:SelectLogFilesResponse soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:ns1="http://schemas.cisco.com/ast/soap/">
<FileSelectionResult xsi:type="ns2:SchemaFileSelectionResult"
xmlns:ns2="http://cisco.com/ccm/serviceability/soap/LogCollection/">
<Node xsi:type="ns2:Node">
<name xsi:type="xsd:string">10.201.196.84</name>
<ServiceList soapenc:arrayType="ns2:ServiceLogs[1]" xsi:type="soapenc:Array"
xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/">
<item xsi:type="ns2:ServiceLogs">
<name xsi:type="xsd:string" xsi:nil="true"/>
<SetOfFiles soapenc:arrayType="ns2:file[2]" xsi:type="soapenc:Array">
<item xsi:type="ns2:file">
<name xsi:type="xsd:string">SDL002_200_000179.txt.gzo</name>
<absolutepath xsi:type="xsd:string">/var/log/active/cm/trace/cti/sdl/SDL002_200_000179.txt.gzo</absolutepath>
<filesize xsi:type="xsd:string">262967</filesize>
<modifiedDate xsi:type="xsd:string">Thu Apr 05 13:02:57 CDT 2018</modifiedDate>
</item>
<item xsi:type="ns2:file">
<name xsi:type="xsd:string">SDL002_100_000986.txt.gzo</name>
<absolutepath xsi:type="xsd:string">/var/log/active/cm/trace/ccm/sdl/SDL002_100_000986.txt.gzo</absolutepath>
<filesize xsi:type="xsd:string">912868</filesize>
<modifiedDate xsi:type="xsd:string">Thu Apr 05 13:02:56 CDT 2018</modifiedDate>
</item>
</SetOfFiles>
</item>
</ServiceList>
</Node>
</FileSelectionResult>
<ScheduleList soapenc:arrayType="ns3:Schedule[0]" xsi:type="soapenc:Array"
xmlns:ns3="http://cisco.com/ccm/serviceability/soap/LogCollection/"
xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"/>
</ns1:SelectLogFilesResponse>
</soapenv:Body>
</soapenv:Envelope>
What i have tried so far is following which doesn't give my any output:
reffering:
https://docs.python.org/3.6/library/xml.etree.elementtree.html
import xml.etree.ElementTree as ET
root = ET.fromstring(log)
ns={'ns1': 'http://schemas.cisco.com/ast/soap/', 'soapenv': 'http://schemas.xmlsoap.org/soap/envelope/'}
root.findall('soapenv:Envelope/soapenv:Body/ns1:SelectLogFilesResponse/FileSelectionResult/Node/ServiceList/item/SetOfFiles/item',ns)
one way to get the text is to do the following but that doesn't give the related data
for i in root.iter('absolutepath ')
print(i.text)
oh yes totally, I can then do the following:
`for i in range(0,len(root[0][0][0][0][1][0][1])):
l=[]
for f in root[0][0][0][0][1][0][1][i]:
l.append(f.text)
d['file'+str(i+1)]=l`
output:
`{'file1': ['SDL002_200_000179.txt.gzo', '/var/log/active/cm/trace/cti/sdl/SDL002_200_000179.txt.gzo', '262967', 'Thu Apr 05 13:02:57 CDT 2018'], 'file2': ['SDL002_100_000986.txt.gzo', '/var/log/active/cm/trace/ccm/sdl/SDL002_100_000986.txt.gzo', '912868', 'Thu Apr 05 13:02:56 CDT 2018']}`
Thank you Jean. You made that very simple for me.

Quick way to Upper every value in xml?

I have the following xml:
<Item>
<Platform>itunes</Platform>
<PlatformID>102224185</PlatformID>
<Info>
<LanguageOfMetadata>EN</LanguageOfMetadata>
<Name>Commander In Chief</Name>
<Studio>abc</Studio>
</Info>
<Type>TVSeries</Type>
</Item>
What would be the quickest way to UPPER all the values? For example:
<Item>
<Platform>ITUNES</Platform>
<PlatformID>102224185</PlatformID>
<Info>
<LanguageOfMetadata>EN</LanguageOfMetadata>
<Name>COMMANDER IN CHIEF</Name>
<Studio>ABC</Studio>
</Info>
<Type>TVSERIES</Type>
</Item>
You can find all elements and call upper() on each element's text:
import lxml.etree as ET
data = """<Item>
<Platform>itunes</Platform>
<PlatformID>102224185</PlatformID>
<Info>
<LanguageOfMetadata>EN</LanguageOfMetadata>
<Name>Commander In Chief</Name>
<Studio>abc</Studio>
</Info>
<Type>TVSeries</Type>
</Item>
"""
root = ET.fromstring(data)
for elm in root.xpath("//*"): # //* would find all elements recursively
elm.text = elm.text.upper()
print(ET.tostring(root))
Prints:
<Item>
<Platform>ITUNES</Platform>
<PlatformID>102224185</PlatformID>
<Info>
<LanguageOfMetadata>EN</LanguageOfMetadata>
<Name>COMMANDER IN CHIEF</Name>
<Studio>ABC</Studio>
</Info>
<Type>TVSERIES</Type>
</Item>
This though does not cover cases when you, for example, have a tail of an element - e.g. have <Studio>ABC</Studio>test instead of just <Studio>ABC</Studio>. To support that as well, put the following under the for loop as well:
elm.tail = elm.tail.upper() if elm.tail else None
Here is a way to upper everything, though note that this will include the tags as well:
node = etree.fromstring(etree.tostring(item).upper())
print etree.tostring(node, pretty_print=True)
<ITEM>
<PLATFORM>ITUNES</PLATFORM>
<PLATFORMID>102224185</PLATFORMID>
<INFO>
<LANGUAGEOFMETADATA>EN</LANGUAGEOFMETADATA>
<NAME>COMMANDER IN CHIEF</NAME>
<STUDIO>ABC</STUDIO>
</INFO>
<TYPE>TVSERIES</TYPE>
</ITEM>
Assuming you can parse the XML file you can just rewrite the contents using the .upper() function that is built into python for strings. You can call it like that:
"mystring".upper().

Parsing xml with etree

I am trying to parse an XML response from Amazon's Product Advertising API, this is the xml
<?xml version="1.0" ?>
<ItemLookupResponse xmlns="http://webservices.amazon.com/AWSECommerceService/2010-11-01"> <OperationRequest>
<HTTPHeaders>
<Header Name="UserAgent" Value="TSN (Language=Python)"></Header>
</HTTPHeaders>
<RequestId>96ef9bc3-68a8-4bf3-a2c7-c98b8aeae00f</RequestId>
<Arguments>
<Argument Name="Operation" Value="ItemLookup"></Argument>
<Argument Name="Service" Value="AWSECommerceService"></Argument>
<Argument Name="Signature" Value="gjc4wRNum3YT82app1d06vMIDM7v44fOmZTP8Uh3LqE="></Argument><Argument Name="AssociateTag" Value="sneakick-20"></Argument>
<Argument Name="Version" Value="2010-11-01"></Argument>
<Argument Name="ItemId" Value="810056013349,810056013264"></Argument>
<Argument Name="IdType" Value="UPC"></Argument>
<Argument Name="AWSAccessKeyId" Value="AKIAIFMUMJLJOOINRVRA"></Argument>
<Argument Name="Timestamp" Value="2012-01-03T21:26:39Z"></Argument>
<Argument Name="ResponseGroup" Value="ItemIds"></Argument>
<Argument Name="SearchIndex" Value="Apparel"></Argument>
</Arguments>
<RequestProcessingTime>0.0595830000000000</RequestProcessingTime>
</OperationRequest>
<Items>
<Request>
<IsValid>True</IsValid>
<ItemLookupRequest>
<IdType>UPC</IdType>
<ItemId>810056013349</ItemId>
<ItemId>810056013264</ItemId>
<ResponseGroup>ItemIds</ResponseGroup>
<SearchIndex>Apparel</SearchIndex>
<VariationPage>All</VariationPage>
</ItemLookupRequest>
</Request>
<Item>
<ASIN>B000XR4K6U</ASIN>
</Item>
<Item>
<ASIN>B000XR2UU8</ASIN>
</Item>
</Items>
</ItemLookupResponse>
All i am interested in is the Item tags inside Items , so basically all that xml was returned by amazon in a string which i parsed like so:
from xml.etree.ElementTree import fromstring
response = "xml string returned by amazon"
parsed = fromstring(response)
items = parsed[1] # This is how i get the Items element
# These were my attempts at getting the Item element
items.find('Item')
items.findall('Item')
items being the Items element, but so far no success, it keeps returning None/Empty , im i missing something , or is there another way to go about this ?
It is a namespace issue. This works:
from xml.etree import ElementTree as ET
XML = """<?xml version="1.0" ?>
<ItemLookupResponse xmlns="http://webservices.amazon.com/AWSECommerceService/2010-11-01">
<OperationRequest>
<HTTPHeaders>
<Header Name="UserAgent" Value="TSN (Language=Python)"></Header>
</HTTPHeaders>
<RequestId>96ef9bc3-68a8-4bf3-a2c7-c98b8aeae00f</RequestId>
<Arguments>
<Argument Name="Operation" Value="ItemLookup"></Argument>
<Argument Name="Service" Value="AWSECommerceService"></Argument>
<Argument Name="Signature" Value="gjc4wRNum3YT82app1d06vMIDM7v44fOmZTP8Uh3LqE="></Argument>
<Argument Name="AssociateTag" Value="sneakick-20"></Argument>
<Argument Name="Version" Value="2010-11-01"></Argument>
<Argument Name="ItemId" Value="810056013349,810056013264"></Argument>
<Argument Name="IdType" Value="UPC"></Argument>
<Argument Name="AWSAccessKeyId" Value="AKIAIFMUMJLJOOINRVRA"></Argument>
<Argument Name="Timestamp" Value="2012-01-03T21:26:39Z"></Argument>
<Argument Name="ResponseGroup" Value="ItemIds"></Argument>
<Argument Name="SearchIndex" Value="Apparel"></Argument>
</Arguments>
<RequestProcessingTime>0.0595830000000000</RequestProcessingTime>
</OperationRequest>
<Items>
<Request>
<IsValid>True</IsValid>
<ItemLookupRequest>
<IdType>UPC</IdType>
<ItemId>810056013349</ItemId>
<ItemId>810056013264</ItemId>
<ResponseGroup>ItemIds</ResponseGroup>
<SearchIndex>Apparel</SearchIndex>
<VariationPage>All</VariationPage>
</ItemLookupRequest>
</Request>
<Item>
<ASIN>B000XR4K6U</ASIN>
</Item>
<Item>
<ASIN>B000XR2UU8</ASIN>
</Item>
</Items>
</ItemLookupResponse>"""
NS = "{http://webservices.amazon.com/AWSECommerceService/2010-11-01}"
doc = ET.fromstring(XML)
Item_elems = doc.findall(".//" + NS + "Item") # All Item elements in document
print Item_elems
Output:
[<Element '{http://webservices.amazon.com/AWSECommerceService/2010-11-01}Item' at 0xbf0c50>,
<Element '{http://webservices.amazon.com/AWSECommerceService/2010-11-01}Item' at 0xbf0cd0>]
Variation closer to your own code:
NS = "{http://webservices.amazon.com/AWSECommerceService/2010-11-01}"
doc = ET.fromstring(XML)
items = doc[1] # Items element
first_item = items.find(NS + 'Item') # First direct Item child
all_items = items.findall(NS + 'Item') # List of all direct Item children
Namespace issue.
You can put the namespace in front of all of your items as spelled out in the first answer to either this question or this question. A possibly simpler solution is to ignore the namespace with a quick hack like this:
xml_hacked_namespace = raw_xml.replace(' xmlsn=', ' xmlnamespace=')
doc = fromstring(xml_hacked_namespace)
item_list = doc.findall('.//Item')
If you find that you are doing a lot of work with xml you may also be interested in checking out lxml. It is faster and provides a few extra methods that some find nice to have.

Editing the XML texts from a XML file using Python

I have an XML file which contains some data as given.
<?xml version="1.0" encoding="UTF-8" ?>
- <ParameterData>
<CreationInfo date="10/28/2009 03:05:14 PM" user="manoj" />
- <ParameterList count="85">
- <Parameter name="Spec 2 Included" type="boolean" mode="both">
<Value>n/a</Value>
<Result>n/a</Result>
</Parameter>
- <Parameter name="Spec 2 Label" type="string" mode="both">
<Value>n/a</Value>
<Result>n/a</Result>
</Parameter>
- <Parameter name="Spec 3 Included" type="boolean" mode="both">
<Value>n/a</Value>
<Result>n/a</Result>
</Parameter>
- <Parameter name="Spec 3 Label" type="string" mode="both">
<Value>n/a</Value>
<Result>n/a</Result>
</Parameter>
</ParameterList>
</ParameterData>
I have one text file with lines as
Spec 2 Included : TRUE
Spec 2 Label: 19-Flat2-HS3
Spec 3 Included : FALSE
Spec 3 Label: 4-1-Bead1-HS3
Now I want to edit XML texts; i,e. I want to replace the field (n/a)
with the corresponding values from the text file.
Like I want the file to looks like
<?xml version="1.0" encoding="UTF-8" ?>
- <ParameterData>
<CreationInfo date="10/28/2009 03:05:14 PM" user="manoj" />
- <ParameterList count="85">
- <Parameter name="Spec 2 Included" type="boolean" mode="both">
<Value>TRUE</Value>
<Result>TRUE</Result>
</Parameter>
- <Parameter name="Spec 2 Label" type="string" mode="both">
<Value>19-Flat2-HS3</Value>
<Result>19-Flat2-HS3</Result>
</Parameter>
- <Parameter name="Spec 3 Included" type="boolean" mode="both">
<Value>FALSE</Value>
<Result>FALSE</Result>
</Parameter>
- <Parameter name="Spec 3 Label" type="string" mode="both">
<Value>4-1-Bead1-HS3</Value>
<Result>4-1-Bead1-HS3</Result>
</Parameter>
</ParameterList>
</ParameterData>
I am new to this Python-XML coding.
I dont have idea about how to edit the text fields in a XML file.
I am trying to Use elementtree.ElementTree module.
but to read the lines in XML file and extract the attributes I dont know which modules need to be imported.
Please help.
Thanks and Regards.
You can convert your data text into python dictionary by regular expression
data="""Spec 2 Included : TRUE
Spec 2 Label: 19-Flat2-HS3
Spec 3 Included : FALSE
Spec 3 Label: 4-1-Bead1-HS3"""
#data=open("data.txt").read()
import re
data=dict(re.findall('(Spec \d+ (?:Included|Label))\s*:\s*(\S+)',data))
data will be as follows
{'Spec 3 Included': 'FALSE', 'Spec 2 Included': 'TRUE', 'Spec 3 Label': '4-1-Bead1-HS3', 'Spec 2 Label': '19-Flat2-HS3'}
Then you can convert it by using any of your favoriate xml parser, I will use minidom here.
from xml.dom import minidom
dom = minidom.parseString(xml_text)
params=dom.getElementsByTagName("Parameter")
for param in params:
name=param.getAttribute("name")
if name in data:
for item in param.getElementsByTagName("*"): # You may change to "Result" or "Value" only
item.firstChild.replaceWholeText(data[name])
print dom.toxml()
#write to file
open("output.xml","wb").write(dom.toxml())
Results
<?xml version="1.0" ?><ParameterData>
<CreationInfo date="10/28/2009 03:05:14 PM" user="manoj"/>
<ParameterList count="85">
<Parameter mode="both" name="Spec 2 Included" type="boolean">
<Value>TRUE</Value>
<Result>TRUE</Result>
</Parameter>
<Parameter mode="both" name="Spec 2 Label" type="string">
<Value>19-Flat2-HS3</Value>
<Result>19-Flat2-HS3</Result>
</Parameter>
<Parameter mode="both" name="Spec 3 Included" type="boolean">
<Value>FALSE</Value>
<Result>FALSE</Result>
</Parameter>
<Parameter mode="both" name="Spec 3 Label" type="string">
<Value>4-1-Bead1-HS3</Value>
<Result>4-1-Bead1-HS3</Result>
</Parameter>
</ParameterList>
</ParameterData>
Well, you could start with
import xml.etree.ElementTree as ET
tree = ET.parse("blah.xml")
Find the elements you want to modify.
To replace the contents of an element, just do
element.text = "TRUE"
The import statement above works in Python 2.5 or later. If you have an older version of Python you'll need to install ElementTree as an extension, and then the import statement is different: import elementtree.ElementTree as ET.
Unfortunately, the XPath supported by ElementTree isn't complete. Since Python 2.6 includes an older version, finding elements by attribute (as stated here) does not work. So Python's own documentation should be your first stop: xml.etree.ElementTree
import xml.etree.ElementTree as ET
original = ET.parse("original.xml")
parameters = original.findall(".//Parameter")
changes = {}
# read changes
with open("changes.txt", "rb") as in_file:
for change in in_file:
change = change.rstrip() # remove line endings
name, value = change.split(":")
changes[name.strip()] = value.strip() # remove whitespaces
# find paramter element and apply changes
for parameter in parameters:
parameter_name = parameter.get("name")
if changes.has_key(parameter_name):
value = parameter.find("./Value")
value.text = changes[parameter_name]
result = parameter.find("./Result")
result.text = changes[parameter_name]
original.write("new.xml")
Here is how you could do it using Amara
from amara import bindery
doc = bindery.parse(XML)
def cleanup_for_dict(key, value):
return key.strip(), value.strip()
params = dict(( cleanup_for_dict(*line.split(':', 1))
for line in TEXT.splitlines()))
for param in doc.ParameterData.ParameterList.Parameter:
if param.name in params:
param.Value = params[param.name]
param.Result = params[param.name]
doc.xml_write()

Categories

Resources