Python parsing complex text

Python parsing complex text - python

I'm struggling to develop an algorithm that can edit the below snip of an XML file. Can anyone help with ideas? Requirements are to parse the file as input, remove the "cipher" that uses "RC4", and output a new xml file, with just "RC4" cipher removed. The problem is there are multiple "Connector" sections within the XML file. I need to read all of them, but only edit the one that uses port 443 and with a specific IP address. So the script would need to parse each Connector section one at a time, but discard the ones that don't have correct IP address and port. Have tried:
1. Using ElementTree XML parser. Problem is it doesn't output the new XLM file well - it's a mess. I need it prettified with python 2.6.
<Connector
protocol="org.apache.coyote.http11.Http11NioProtocol"
port="443"
redirectPort="443"
executor="tomcatThreadPool"
disableUploadTimeout="true"
SSLEnabled="true"
scheme="https"
secure="true"
clientAuth="false"
sslEnabledProtocols="TLSv1,TLSv1.1,TLSv1.2"
keystoreType="JKS"
keystoreFile="tomcat.keystore"
keystorePass="XXXXX"
server="XXXX"
ciphers="TLS_DHE_RSA_WITH_AES_128_CBC_SHA,
TLS_DH_RSA_WITH_AES_128_CBC_SHA,
TLS_DHE_DSS_WITH_AES_128_CBC_SHA,
TLS_DH_DSS_WITH_AES_128_CBC_SHA,
TLS_RSA_WITH_AES_128_CBC_SHA,
TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA,
TLS_DH_RSA_WITH_3DES_EDE_CBC_SHA,
TLS_RSA_WITH_3DES_EDE_CBC_SHA,
TLS_RSA_WITH_RC4_128_SHA"
address="192.168.10.6">
Here was my code:
from xml.etree import ElementTree
print "[+] Checking for removal of RC4 ciphers"
file = "template.xml"
with open(file, 'rt') as f:
tree = ElementTree.parse(f)
f.close()
for node in tree.getiterator('Connector'):
if node.tag == 'Connector':
address = node.attrib.get('address')
port = node.attrib.get('port')
if "EMSNodeMgmtIp" in address and port == "443":
ciphers = node.attrib.get('ciphers')
if "RC4" in ciphers:
# If true, RC4 is enabled somewhere in the cipher suite
print "[+] Found RC4 enabled ciphers"
# Find RC4 specific cipher suite string, for replacement
elements = ciphers.split()
search_str = ""
for element in elements:
if "RC4" in element:
search_str = element
print "[+] Search removal RC4 string: %s" % search_str
# Replace string by removing RC4 cipher
print "[+] Removing RC4 cipher"
replace_str = ciphers.replace(search_str,"")
rstrip_str = replace_str.rstrip()
if rstrip_str.endswith(','):
new_cipher_str = rstrip_str[:-1]
#print new_cipher_str
node.set('ciphers', new_cipher_str)
tree.write('new.xml')

I included comments to explain what is going on.
inb4downvote
from lxml import etree
import re
xml = '''<?xml version="1.0"?>
<data>
<Connector
protocol="org.apache.coyote.http11.Http11NioProtocol"
port="443"
redirectPort="443"
executor="tomcatThreadPool"
disableUploadTimeout="true"
SSLEnabled="true"
scheme="https"
secure="true"
clientAuth="false"
sslEnabledProtocols="TLSv1,TLSv1.1,TLSv1.2"
keystoreType="JKS"
keystoreFile="tomcat.keystore"
keystorePass="XXXXX"
server="XXXX"
ciphers="TLS_DHE_RSA_WITH_AES_128_CBC_SHA,
TLS_DH_RSA_WITH_AES_128_CBC_SHA,
TLS_DHE_DSS_WITH_AES_128_CBC_SHA,
TLS_DH_DSS_WITH_AES_128_CBC_SHA,
TLS_RSA_WITH_AES_128_CBC_SHA,
TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA,
TLS_DH_RSA_WITH_3DES_EDE_CBC_SHA,
TLS_RSA_WITH_3DES_EDE_CBC_SHA,
TLS_RSA_WITH_RC4_128_SHA"
address="192.168.10.6"></Connector></data>'''
tree = etree.fromstring(xml)
root = tree.getroottree().getroot()
for connector in root.findall('Connector'):
port = connector.get('port')
ip = connector.get('address')
#change this to port/ip you want to remove
if port != '443' or ip != '192.168.10.6':
#removes child connector
connector.getparent().remove(connector)
continue
#here we use list comprehension to remove any cipher with "RC4"
ciphers = ','.join([x for x in re.split(r',\s*', connector.get('ciphers')) if 'RC4' not in x])
#set the modified cipher back
connector.set('ciphers', ciphers)
print etree.tostring(root, pretty_print=True)

If the XML tools don't preserve the original structure and formatting, dump them. This is a straightforward text-processing problem, and you can write a Python program to handle it.
Spin through the lines of the file; simply echo to the output anything other than a "cipher" statement. When you hit one of those:
Stuff the string into a variable.
Split the string into a list.
Drop any list element containing "RC4".
Print the resulting "cipher" statement in your desired format.
Return to normal "read-and-echo" processing.
Does this algorithm get you going?

Answer below. Basically had to read each of the Connector sections (there were 4) into a temporary list, to check if port and address are correct. If they are, then make a change to the Cipher by removing cipher string but only if RC4 cipher is enabled. So the code had to read in all of the 4 Connectors, one at a time, into a temporary list.
f = open('template.xml', 'r')
lines = f.readlines()
f.close()
new_file = open('new.xml', 'w')
tmp_list = []
connector = False
for line in lines:
if '<Connector' in line:
connector = True
new_file.write(line)
elif '</Connector>' in line:
connector = False
port = False
address = False
for a in tmp_list:
if 'port="443"' in a:
port = True
elif 'address="%(EMSNodeMgmtIp)s"' in a:
address = True
if port and address:
new_list = []
count = 0
for b in tmp_list:
if "RC4" in b:
print "[+] Found RC4 cipher suite string at line index %d: %s" % (count,b)
print "[+] Removing RC4 cipher string from available cipher suites"
# check if RC4 cipher string ends with "
check = b[:-1]
if check.endswith('"'):
tmp_str = tmp_list[count-1]
tmp_str2 = tmp_str[:-2]
tmp_str2+='"\n'
new_list[count-1] = tmp_str2
replace_line = b.replace(b,"")
new_list.append(replace_line)
else:
replace_line = b.replace(b,"")
new_list.append(replace_line)
else:
new_list.append(b)
count+=1
for c in new_list:
new_file.write(c)
new_file.write(' </Connector>\n')
else:
# Not port and address
for d in tmp_list:
new_file.write(d)
new_file.write(' </Connector>\n')
tmp_list = []
elif connector:
tmp_list.append(line)
else:
new_file.write(line)
new_file.close()

Related

How to filter characters for comparison of strings in Python

i am very new to coding and I am not familiar with python, could you guys give me maybe a small example of how you would solve this problem.
Basically this new device i will be working on has a 2D code(its sort of a barcode kkind of thing) and when i scan the code witha 2D scanner a string like this shows up on my notepad for example: 58183#99AF0M000F9EF3F800
the last 12 characters are the MAC address and the first 5 characters are the order number.
i need to compare that(58183#99AF0M000F9EF3F800) with the MAC address value i get from the XML page.
here is the terminal output for more reference:
####################################################################################################
Current device information:
Order-number: 58184 Software-version: 1.0.0 ( Build : 1 ) Hardware version: 1.00 MAC address: 00:0F:9E:F4:1A:80
desired-Order-number: 58183 desired-Softwareversion: 1.0.0 ( Build : 1 ) desired-hardwareversion: 1.00 pc-praefix: 7A2F7
PASS
PS C:\Users\Aswin\Python Project>
The MAC address from the XML page has looks like this "00:0F:9E:F4:1A:80" and the 2D scancode looks like this "58183#99AF0M000F9EF3F800". how can i take the last 12 characters of this scan code and compare it with the mac address from the XML page to see if they match.
Any example of code blocks would be much appreciated guys.
try:
preflash = urllib.request.urlopen("http://10.10.10.2", timeout=3).getcode()
print("Web page status code:", preflash, "FAIL")
sys.exit(0)
except urllib.error.URLError:
correct = urllib.request.urlopen("http://192.168.100.5", timeout=10).getcode()
print("Web page status code:", correct)
print("IP address: 192.168.100.5 is reachable")
print(100*"#")
# Declare url String
url_str = 'http://192.168.100.2/globals.xml'
# open webpage and read values
xml_str = urllib.request.urlopen(url_str).read()
# Parses XML doc to String for Terminal output
xmldoc = minidom.parseString(xml_str)
# prints the order_number from the xmldoc
order_number = xmldoc.getElementsByTagName('order_number')
ord_nmr = order_number[0].firstChild.nodeValue
# prints the firmware_version from the xmldoc
firmware_version = xmldoc.getElementsByTagName('firmware_version')
frm_ver = firmware_version[0].firstChild.nodeValue
# prints the hardware_version from the xmldoc
hardware_version = xmldoc.getElementsByTagName('hardware_version')
hrd_ver = hardware_version[0].firstChild.nodeValue
v = hrd_ver.split()[-1]
# prints the mac_address from the xmldoc
mac_address = xmldoc.getElementsByTagName('mac_address')
mac_addr = mac_address[0].firstChild.nodeValue
print("Current device information: ")
print("Order-number: ",ord_nmr, "Software-version: ",frm_ver, "Hardware version: ",v, "MAC address: ",mac_addr)
d_ordernum = "58183"
d_hw_version = "1.00"
d_sf_version = "1.0.0 ( Build : 1 )"
pc_praefix = "7A2F7"
print("desired-Order-number: 58183 desired-Softwareversion: 1.0.0 ( Build : 1 ) desired-hardwareversion: 1.00 pc-praefix: 7A2F7")
if d_sf_version == frm_ver:
print("PASS")
else:
print("FAIL")

You could take the string from the scan code and slice it
scan_code_cropped = scancode_string[11:]
This will get you the last 12 characters of the scan code.
Now to get the MAC address in a format to be able to compare it to the scan code, split it on the basis of ":"
list_of_chars = mac_address_string.split(":")
this will get you the character list, which can be concatenated using
mac_address_string_joined = ''.join(list_of_chars)
and finally to compare the two strings
if scan_code_cropped == mac_address_string_joined:
print("Mac address & Scan code matched !")
If needed in a function format, here you go:
def match_scan_with_mac(scancode_string, mac_address_string):
# get the last 12 characters of the scan code
scan_code_cropped = scancode_string[11:]
# get the mac address without the ":"
list_of_chars = mac_address_string.split(":")
mac_address_string_joined = ''.join(list_of_chars)
# compare the MAC address and the scan string
if scan_code_cropped == mac_address_string_joined:
print("Mac address & Scan code matched !")
return True
return False

Python:XML List index out of range

I'm having troubles to get some values in a xml file. The error is IndexError: list index out of range
XML
<?xml version="1.0" encoding="UTF-8"?>
<nfeProc xmlns="http://www.portalfiscal.inf.br/nfe" versao="3.10">
<NFe xmlns="http://www.portalfiscal.inf.br/nfe">
<infNFe Id="NFe35151150306471000109550010004791831003689145" versao="3.10">
<ide>
<nNF>479183</nNF>
</ide>
<emit>
<CNPJ>3213213212323</CNPJ>
</emit>
<det nItem="1">
<prod>
<cProd>7030-314</cProd>
</prod>
<imposto>
<ICMS>
<ICMS10>
<orig>1</orig>
<CST>10</CST>
<vICMS>10.35</vICMS>
<vICMSST>88.79</vICMSST>
</ICMS10>
</ICMS>
</imposto>
</det>
<det nItem="2">
<prod>
<cProd>7050-6</cProd>
</prod>
<imposto>
<ICMS>
<ICMS00>
<orig>1</orig>
<CST>00</CST>
<vICMS>7.49</vICMS>
</ICMS00>
</ICMS>
</imposto>
</det>
</infNFe>
</NFe>
</nfeProc>
I'm getting the values from XML, it's ok in some xml's, those having vICMS and vICMSST tags:
vicms = doc.getElementsByTagName('vICMS')[i].firstChild.nodeValue
vicmsst = doc.getElementsByTagName('vICMSST')[1].firstChild.nodeValue
This returns:
First returns:
print vicms
>> 10.35
print vicmsst
>> 88.79
Second imposto CRASHES because don't find vICMSST tag...
**IndexError: list index out of range**
What the best form to test it? I'm using xml.etree.ElementTree:
My code:
import os
import sys
import subprocess
import base64,xml.dom.minidom
from xml.dom.minidom import Node
import glob
import xml.etree.ElementTree as ET
origem = 0
# only loops over XML documents in folder
for file in glob.glob("*.xml"):
f = open("%s" % file,'r')
data = f.read()
i = 0
doc = xml.dom.minidom.parseString(data)
for topic in doc.getElementsByTagName('emit'):
#Get Fiscal Number
nnf= doc.getElementsByTagName('nNF')[i].firstChild.nodeValue
print 'Fiscal Number %s' % nnf
print '\n'
for prod in doc.getElementsByTagName('det'):
vicms = 0
vicmsst = 0
#Get value of ICMS
vicms = doc.getElementsByTagName('vICMS')[i].firstChild.nodeValue
#Get value of VICMSST
vicmsst = doc.getElementsByTagName('vICMSST')[i].firstChild.nodeValue
#PRINT INFO
print 'ICMS %s' % vicms
print 'Valor do ICMSST: %s' % vicmsst
print '\n\n'
i +=1
print '\n\n'

There is only one vICMSST tag in your XML document. So, when i=1, the following line returns an IndexError.
vicmsst = doc.getElementsByTagName('vICMSST')[1].firstChild.nodeValue
You can restructure this to:
try:
vicmsst = doc.getElementsByTagName('vICMSST')[i].firstChild.nodeValue
except IndexError:
# set a default value or deal with this how you like
It's hard to say what you should do upon an exception without knowing more about what you're trying to do.

You are making several general mistakes in your code.
Don't use counters to index into lists you don't know the length of. Normally, iteration with for .. in is a lot better than using indexes anyway.
You have many imports you don't seem to use, get rid of them.
You can use minidom, but ElementTree is better for your task because it supports searching for nodes with XPath and it supports XML namespaces.
Don't read an XML file as a string and then use parseString. Let the XML parser handle the file directly. This way all file encoding related issues will be handled without errors.
The following is a lot better than your original approach.
import glob
import xml.etree.ElementTree as ET
def get_text(context_elem, xpath, xmlns=None):
""" helper function that gets the text value of a node """
node = context_elem.find(xpath, xmlns)
if (node != None):
return node.text
else:
return ""
# set up XML namespace URIs
xmlns = {
"nfe": "http://www.portalfiscal.inf.br/nfe"
}
for path in glob.glob("*.xml"):
doc = ET.parse(path)
for infNFe in doc.iterfind('.//nfe:infNFe', xmlns):
print 'Fiscal Number\t%s' % get_text(infNFe, ".//nfe:nNF", xmlns)
for det in infNFe.iterfind(".//nfe:det", xmlns):
print ' ICMS\t%s' % get_text(det, ".//nfe:vICMS", xmlns)
print ' Valor do ICMSST:\t%s' % get_text(det, ".//nfe:vICMSST", xmlns)
print '\n\n'

Index Error: list index out of range in python

I have project in internet security class. My partner started the project and wrote some python code and i have to continue from where he stopped. But i don't know python and i was planning to learn by running his code and checking how it works. however when i am executing his code i get an error which is "IndexError: list index out of range".
import os
# Deauthenticate devices
os.system("python2 ~/Downloads/de_auth.py -s 00:22:b0:07:58:d4 -d & sleep 30; kill $!")
# renew DHCP on linux "sudo dhclient -v -r & sudo dhclient -v"
# Capture DHCP Packet
os.system("tcpdump -lenx -s 1500 port bootps or port bootpc -v > dhcp.txt & sleep 20; kill $!")
# read packet txt file
DHCP_Packet = open("dhcp.txt", "r")
# Get info from txt file of saved packet
line1 = DHCP_Packet.readline()
line1 = line1.split()
sourceMAC = line1[1]
destMAC = line1[3]
TTL = line1[12]
length = line1[8]
#Parse packet
line = DHCP_Packet.readline()
while "0x0100" not in line:
line = DHCP_Packet.readline()
packet = line + DHCP_Packet.read()
packet = packet.replace("0x0100:", "")
packet = packet.replace("0x0110:", "")
packet = packet.replace("0x0120:", "")
packet = packet.replace("0x0130:", "")
packet = packet.replace("0x0140:", "")
packet = packet.replace("0x0150:", "")
packet = packet.replace("\n", "")
packet = packet.replace(" ", "")
packet = packet.replace(" ", "")
packet = packet.replace("000000000000000063825363", "")
# Locate option (55) = 0x0037
option = "0"
i=0
length = 0
while option != "37":
option = packet[i:i+2]
hex_length = packet[i+2:i+4]
length = int(packet[i+2:i+4], 16)
i = i+ length*2 + 4
i = i - int(hex_length, 16)*2
print "Option (55): " + packet[i:i+length*2 ] + "\nLength: " + str(length) + " Bytes"
print "Source MAC: " + sourceMAC
Thank you a lot

The index error probably means you have an empty or undefined section (index) in your lists. It's most likely in the loop condition at the bottom:
while option != "37":
option = packet[i:i+2]
hex_length = packet[i+2:i+4]
length = int(packet[i+2:i+4], 16)
i = i+ length*2 + 4
Alternatively, it could be earlier in reading your text file:
# Get info from txt file of saved packet
line1 = DHCP_Packet.readline()
line1 = line1.split()
sourceMAC = line1[1]
destMAC = line1[3]
TTL = line1[12]
length = line1[8]
Try actually opening the text file and make sure all the lines are referred to correctly.
If you're new to coding and not used to understanding error messages or using a debugger yet, one way to find the problem area is including print ('okay') between lines in the code, moving it down progressively until the line no longer prints.
I'm pretty new to python as well, but I find it easier to learn by writing your own code and googling what you want to achieve (especially when a partner leaves you code like that...). This website provides documentation on in-built commands (choose your version at the top): https://docs.python.org/3.4/contents.html,
and this website contains more in-depth tutorials for common functions: http://www.tutorialspoint.com/python/index.htm

I think the variable line1 that being split does not have as much as 13 numbers,so you will get error when executing statement TTL = line1[12].
Maybe you do not have the same environment as your partner worked with ，so the result you get(file dhcp.txt) by executing os.system("") maybe null(or with a bad format).
You should check the content of the file dhcp.txt or add statement print line1 after line1 = DHCP_Packet.readline() to check if it has a correct format.

Is there a good regular expression for multiline matching of received SIP invites?

I really need python regexp which would give me this information:
Data:
Received from 1.1.1.1 18:41:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com To:
sdfasdfasdfas From: "test"
Via:
sdafsdfasdfasd
Sent from 1.1.1.1 18:42:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com
From: "test"
To:
sdfasdfasdfas Via:
sdafsdfasdfasd
Received from 1.1.1.1 18:50:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com
Via: sdafsdfasdfasd
From: "test"
To:
sdfasdfasdfas
What I need to achieve, is to find the newest INVITE that was "Received" in order to get From: header value. So searching the data backwards.
Is it possible with unique regexp ? :)
Thanks.

One-line answer, assuming you suck the entire header into a string with embedded newlines (or cr/nl's):
sorted(re.findall("Received [^\r\n]+ (\d{2}:\d{2}:\d{2}:\d{3})[^\"]+From: \"([^\r\n]+)\"", data))[-1][1]
The trick to doing it with one RE is using [^\r\n] instead of . when you want to scan over stuff. This works assuming from string always has the double quotes. The double quotes are used to keep the scanner from swallowing the entire string at the first Received... ;)

I do not think a single regular expression is the answer. I think a stateful line-by-line matcher is what you're looking for here.
import re
import collections
_msg_start_re = re.compile('^(Received|Sent)\s+from\s+(\S.*):\s*$')
_msg_field_re = re.compile('^([A-Za-z](?:(?:\w|-)+)):\s+(\S(?:.*\S)?)\s*$')
def message_parser():
hdr = None
fields = collections.defaultdict(list)
msg = None
while True:
if msg is not None:
line = (yield msg)
msg = None
hdr = None
fields = collections.defaultdict(list)
else:
line = (yield None)
if hdr is None:
hdr_match = _msg_start_re.match(line)
hdr = None if hdr_match is None else hdr_match.groups()
elif len(fields) <= 0:
field_match = _msg_field_re.match(line)
if field_match is not None:
fields[field_match.group(1)].append(field_match.group(2))
else: # Waiting for the end of the message
if line.strip() == '':
msg = (hdr, dict(fields))
else:
field_match = _msg_field_re.match(line)
fields[field_match.group(1)].append(field_match.group(2))
Example of use:
parser = msg_parser()
parser.next()
recvd_invites = [msg for msg in (parser.send(line) for line in linelst) \
if (msg is not None) and \
(msg[0][0] == 'Received') and \
('INVITE' in msg[1])]
You might be able to do this with a multiple line regex, but if you do it this way you get the message nicely parsed into its various fields. Presumably you want to do something interesting with the messages, and this will let you do a whole bunch more with them without having to use more regexps.
This also allows you to parse something other than an already existing file or a giant string with all the messages in it. For example, if you want to parse the output of a pipe that's printing out these requests as they happen you can simply do msg = parser.send(line) every time you receive a line and get a new message out as soon as its all been printed (if the line isn't the end of a message then msg will be None).

Socket Connection: Python

So I'm trying to send several iterations of a barcode file to a device. This is the relevant part of the code:
# Start is the first barcode
start = 1234567
# Number is the quantity
number = 3
with open('barcode2.xml', 'rt') as f:
tree = ElementTree.parse(f)
# Iterate over all elements in a tree for the root element
for node in tree.getiterator():
# Looks for the node tag called 'variable', which is the name assigned
# to the accession number value
if node.tag == "variable":
# Iterates over a list whose range is specified by the command
# line argument 'number'
for barcode in range(number):
# The 'A-' prefix and the 'start' argument from the command
# line are assigned to variable 'accession'
accession = "A-" + str(start)
# Start counter is incremented by 1
start += 1
# The node ('variable') text is the accession number.
# The acccession variable is assigned to node text.
node.text = accession
# Writes out to an XML file
tree.write("barcode2.xml")
header = "<?xml version=\"1.0\" standalone=\"no\"?>\n<!DOCTYPE labels SYSTEM \"label.dtd\">\n"
with open("barcode2.xml", "r+") as f:
old = f.read()
f.seek(0)
f.write(header + old)
# Create socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Connect to server
host = "xxx.xx.xx.x"
port = 9100
sock.connect((host, port))
# Open XML file and read its contents
target_file = open("barcode2.xml")
barc_file_text = target_file.read()
# Send to printer
sock.sendall(barc_file_text)
# Close connection
sock.close()
This is very much a version one.
The device is failing to receive the files after the first one. Could this be because the port is being reused again too quickly? What's a better way to architect this? Thanks so much for your help.

target_file = open("barcode2.xml")
barc_file_text = target_file.read()
sock.sendall(barc_file_text)
sock.close()
The socket gets closed, but the file doesn't. The next time through the loop, there is already a lock on the file when you get to the with open... part.
Solution: Use with open... here as well. Also, you don't need to baby-step everything; don't give something a name (by assigning it to a variable) if it isn't important.
with open("barcode2.xml", "r") as to_send:
sock.sendall(to_send.read())
sock.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parsing complex text - python

Related

How to filter characters for comparison of strings in Python

Python:XML List index out of range

Index Error: list index out of range in python

Is there a good regular expression for multiline matching of received SIP invites?

Socket Connection: Python

Categories

Resources