Getting multiple children's values using minidom

Getting multiple children's values using minidom - python

As you can see from the xml here there are multiple <item> nodes with a set of children such as <summary>, <status> and <key>.
The problem I've encountered is that in using minidom, it's possible to get values of the firstChild and lastChild, but not necessarily any values in between.
I've created the below which doesn't work, but I think is a close approximation of what I need to be doing
import xml.dom.minidom
xml = xml.dom.minidom.parse(result) # or xml.dom.minidom.parseString(xml_string)
itemList = xml.getElementsByTagName('item')
for item in itemList [1:]:
summaryList = item.getElementsByTagName('summary')
statusList = item.getElementsByTagName('status')
keyList = item.getElementsByTagName('key')
lineText = (summaryList[0].nodeValue + " " + statusList[0].nodeValue + " " + keyList[0].nodeValue)
p = Paragraph(lineText, style)
Story.append(p)

Define get_text() function that joins all of the text child nodes (see this answer):
def get_text(element):
return " ".join(t.nodeValue for t in element[0].childNodes
if t.nodeType == t.TEXT_NODE)
dom = xml.dom.minidom.parseString(data)
itemList = dom.getElementsByTagName('item')
for item in itemList[1:]:
summaryList = item.getElementsByTagName('summary')
statusList = item.getElementsByTagName('status')
keyList = item.getElementsByTagName('key')
print get_text(summaryList)
print get_text(statusList)
print get_text(keyList)
print "----"
prints:
Unapprove all pull request reviewers after major change
Needs Triage
STASH-4473
----
Allow using left/right arrow to move side by side diff left/right
Needs Triage
STASH-4478
----
Hope that helps.

How about something like
for item in itemList:
lineText = ' '.join(child.nodeValue for child in item.childNodes)
p = Paragraph(lineText, style)
Story.append(p)

Related

Revit, Using Python, Can't get the Family "Type Names" for a Family, Just the ID

I've tried for hours to attempt to solve this myself so I can learn. I'm able to get the Family I want out of Revit (called familyToUpdate) and list the family (symbol) types, but I can't get the type name itself only their ID's. I want to compare the actual Type Name against a text parameter I called (typeToDelete) so that I can delete only the types I know are not being used. I've been through numerous examples but can never get them to work.
Here is my code to date:
import Autodesk.Revit.DB as DB
from Autodesk.Revit.DB import *
uidoc = __revit__.ActiveUIDocument
doc = __revit__.ActiveUIDocument.Document
app = doc.Application
familyToUpdate = "MyFamily"
typeToDelete = "MyFamilyType"
print "Family Name = " + familyToUpdate
print "Type To Delete = " + typeToDelete
#Delete Family Type
Elements = FilteredElementCollector(doc).OfClass(Family).ToElements()
for m in Elements:
try:
if m.Name.startswith((familyToUpdate)):
symbols = list(m.GetFamilySymbolIds())
for i in symbols:
print "Family Type Id = " + str(i)
famsymbol = doc.GetElement(i)
print "famsymbol = " + str(famsymbol)
#symbolName = famsymbol.Family.Name
#print symbolName
#if symbolName == typeToDelete:
# print "I found the type name"
except:
pass

Answered it myself. Work on it for hours, then FINALLY post a question. Take one more look at it, and there it is!!!
Here's the code for anyone else in the future fumbling through what I did:
import Autodesk.Revit.DB as DB
from Autodesk.Revit.DB import *
uidoc = __revit__.ActiveUIDocument
doc = __revit__.ActiveUIDocument.Document
app = doc.Application
familyToUpdate = "VA Titleblock Consultant Logo (PIN07)"
typeToDelete = "VA Titleblock Consultant Logo (PIN07) (Hagerman)"
print "Family Name = " + familyToUpdate
print "Type To Delete = " + typeToDelete + "\n\n"
#Delete Family Type
Elements = FilteredElementCollector(doc).OfClass(Family).ToElements()
for m in Elements:
try:
if m.Name.startswith((familyToUpdate)):
symbols = list(m.GetFamilySymbolIds())
for i in symbols:
#print "Family Type Id = " + str(i)
famsymbol = doc.GetElement(i)
#print "Symbol ID = " + str(famsymbol)
symbolName = famsymbol.get_Parameter(BuiltInParameter.SYMBOL_NAME_PARAM).AsString()
print "SymbolName = " + symbolName
except:
pass

Thank you for the solution. In it, you are retrieving the symbol name from the SYMBOL_NAME_PARAM built-in parameter. That is perfectly valid. An easier and more direct way to read the symbol name is to simply query the Element.Name property. Element is the parent class of all Revit database resident objects, including FamilySymbol.

Element.Name will not work because of multi-level inheritance. try the following instead:
name = Element.Name.GetValue(familysymbol)

How to get the subelement of child using Python's ElementTree

The XML looks more less like this:
<root>
<course>
<reg_num>10577</reg_num>
<subj>ANTH</subj>
<crse>211</crse>
<sect>F01</sect>
<title>Introduction to Anthropology</title>
<units>1.0</units>
<instructor>Brightman</instructor>
<days>M-W</days>
<time>
<start_time>03:10PM</start_time>
<end_time>04:30</end_time>
</time>
<place>
<building>ELIOT</building>
<room>414</room>
</place>
</course>
<root>
Then here is my code to get the title and such....And I would like to get the time or the place tag which have a child element. How can I do that, and I also tried different methods but none of them seem to works. Thank you! Any help is appreciated
for c in courses:
title = c.find('title').text
num = c.find('crse').text
days = c.find('days').text
# time = c.find('time').text
# for t in c:
# timeSlot1 = t.find('start_time')
# timeSlot2 = t.find('end_time')
# format text using {}
print(' *{} {} [{}] {} {} {}'.format(b, title, days, num, timeSlot1, timeSlot2))
# how to get date

You're almost there: just select the correct child by specifying a path relative to <course>:
for c in courses:
title = c.find('title').text
# [...]
timeSlot1 = c.find('time/start_time').text
timeSlot2 = c.find('time/end_time').text

How to parse a single-column text file into a table using python?

I'm new here to StackOverflow, but I have found a LOT of answers on this site. I'm also a programming newbie, so i figured i'd join and finally become part of this community - starting with a question about a problem that's been plaguing me for hours.
I login to a website and scrape a big body of text within the b tag to be converted into a proper table. The layout of the resulting Output.txt looks like this:
BIN STATUS
8FHA9D8H 82HG9F RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
INVENTORY CODE: FPBC *SOUP CANS LENTILS
BIN STATUS
HA8DHW2H HD0138 RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU 00A123 #2956- INVALID STOCK COUPON CODE (MISSING).
93827548 096DBR RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
There are a bunch of pages with the exact same blocks, but i need them to be combined into an ACTUAL table that looks like this:
BIN INV CODE STATUS
HA8DHW2HHD0138 FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU00A123 FPBC-*SOUP CANS LENTILS #2956- INVALID STOCK COUPON CODE (MISSING).
93827548096DBR FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8FHA9D8H82HG9F SSXR-98-20LM NM CORN CREAM RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
Essentially, all separate text blocks in this example would become part of this table, with the inv code repeating with its Bin values. I would post my attempts at parsing this data(have tried Pandas/bs/openpyxl/csv writer), but ill admit they are a little embarrassing, as i cannot find any information on this specific problem. Is there any benevolent soul out there that can help me out? :)
(Also, i am using Python 2.7)

A simple custom parser like the following should do the trick.
from __future__ import print_function
def parse_body(s):
line_sep = '\n'
getting_bins = False
inv_code = ''
for l in s.split(line_sep):
if l.startswith('INVENTORY CODE:') and not getting_bins:
inv_data = l.split()
inv_code = inv_data[2] + '-' + ' '.join(inv_data[3:])
elif l.startswith('INVENTORY CODE:') and getting_bins:
print("unexpected inventory code while reading bins:", l)
elif l.startswith('BIN') and l.endswith('MESSAGE'):
getting_bins = True
elif getting_bins == True and l:
bin_data = l.split()
# need to add exception handling here to make sure:
# 1) we have an inv_code
# 2) bin_data is at least 3 items big (assuming two for
# bin_id and at least one for message)
# 3) maybe some constraint checking to ensure that we have
# a valid instance of an inventory code and bin id
bin_id = ''.join(bin_data[0:2])
message = ' '.join(bin_data[2:])
# we now have a bin, an inv_code, and a message to add to our table
print(bin_id.ljust(20), inv_code.ljust(30), message, sep='\t')
elif getting_bins == True and not l:
# done getting bins for current inventory code
getting_bins = False
inv_code = ''

A rather complex one, but this might get you started:
import re, pandas as pd
from pandas import DataFrame
rx = re.compile(r'''
(?:INVENTORY\ CODE:)\s*
(?P<inv>.+\S)
[\s\S]+?
^BIN.+[\n\r]
(?P<bin_msg>(?:(?!^\ ).+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
string = your_string_here
# set up the dataframe
df = DataFrame(columns = ['BIN', 'INV', 'MESSAGE'])
for match in rx.finditer(string):
inv = match.group('inv')
bin_msg_raw = match.group('bin_msg').split("\n")
rxbinmsg = re.compile(r'^(?P<bin>(?:(?!\ {2}).)+)\s+(?P<message>.+\S)\s*$', re.MULTILINE)
for item in bin_msg_raw:
for m in rxbinmsg.finditer(item):
# append it to the dataframe
df.loc[len(df.index)] = [m.group('bin'), inv, m.group('message')]
print(df)
Explanation
It looks for INVENTORY CODE and sets up the groups (inv and bin_msg) for further processing in afterwork() (note: it would be easier if you had only one line of bin/msg as you need to split the group here afterwards).
Afterwards, it splits the bin and msg part and appends all to the df object.

I had a code written for a website scrapping which may help you.
Basically what you need to do is write click on the web page go to html and try to find the tag for the table you are looking for and using the module (i am using beautiful soup) extract the information. I am creating a json as I need to store it into mongodb you can create table.
#! /usr/bin/python
import sys
import requests
import re
from BeautifulSoup import BeautifulSoup
import pymongo
def req_and_parsing():
url2 = 'http://businfo.dimts.in/businfo/Bus_info/EtaByRoute.aspx?ID='
list1 = ['534UP','534DOWN']
for Route in list1:
final_url = url2 + Route
#r = requests.get(final_url)
#parsing_file(r.text,Route)
outdict = []
outdict = [parsing_file( requests.get(url2+Route).text,Route) for Route in list1 ]
print outdict
conn = f_connection()
for i in range(len(outdict)):
insert_records(conn,outdict[i])
def parsing_file(txt,Route):
soup = BeautifulSoup(txt)
table = soup.findAll("table",{"id" : "ctl00_ContentPlaceHolder1_GridView2"})
#trtags = table[0].findAll('tr')
tdlist = []
trtddict = {}
"""
for trtag in trtags:
print 'print trtag- ' , trtag.text
tdtags = trtag.findAll('td')
for tdtag in tdtags:
print tdtag.text
"""
divtags = soup.findAll("span",{"id":"ctl00_ContentPlaceHolder1_ErrorLabel"})
for divtag in divtags:
for divtag in divtags:
print "div tag - " , divtag.text
if divtag.text == "Currently no bus is running on this route" or "This is not a cluster (orange bus) route":
print "Page not displayed Errored with below meeeage for Route-", Route," , " , divtag.text
sys.exit()
trtags = table[0].findAll('tr')
for trtag in trtags:
tdtags = trtag.findAll('td')
if len(tdtags) == 2:
trtddict[tdtags[0].text] = sub_colon(tdtags[1].text)
return trtddict
def sub_colon(tag_str):
return re.sub(';',',',tag_str)
def f_connection():
try:
conn=pymongo.MongoClient()
print "Connected successfully!!!"
except pymongo.errors.ConnectionFailure, e:
print "Could not connect to MongoDB: %s" % e
return conn
def insert_records(conn,stop_dict):
db = conn.test
print db.collection_names()
mycoll = db.stopsETA
mycoll.insert(stop_dict)
if __name__ == "__main__":
req_and_parsing()

Python List of Dictionaries Only See Last Element

Struggling to figure out why this doesn't work. It should. But when I create a list of dictionaries and then look through that list, I only ever see the final entry from the list:
alerts = []
alertDict = {}
af=open("C:\snort.txt")
for line in af:
m = re.match(r'([0-9/]+)-([0-9:.]+)\s+.*?(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d{1,5})\s+->\s+(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d{1,5})', line)
if m:
attacktime = m.group(2)
srcip = m.group(3)
srcprt = m.group(4)
dstip = m.group(5)
dstprt = m.group(6)
alertDict['Time'] = attacktime
alertDict['Source IP'] = srcip
alertDict['Destination IP'] = dstip
alerts.append(alertDict)
for alert in alerts:
if alert["Time"] == "13:13:42.443062":
print "Found Time"

You create exactly one dict at the beginning of the script, and then append that one dict to the list multiple times.
Try creating multiple individual dicts, by moving the initialization to the inside of the loop.
alerts = []
af=open("C:\snort.txt")
for line in af:
alertDict = {}
#rest of loop goes here

Getting multiple child values from XML doc using Python

I'm reading XML METAR (Weather) Data using Python. I can read the data, and have also added error checking (only for visibility_statute_mi below!). Here is an example of the XML data:
<METAR>
<raw_text>
FALE 201800Z VRB01KT 9999 FEW016 BKN028 23/22 Q1010 NOSIG
</raw_text>
<station_id>FALE</station_id>
<observation_time>2013-01-20T18:00:00Z</observation_time>
<temp_c>23.0</temp_c>
<dewpoint_c>22.0</dewpoint_c>
<wind_dir_degrees>0</wind_dir_degrees>
<wind_speed_kt>1</wind_speed_kt>
<altim_in_hg>29.822834</altim_in_hg>
<quality_control_flags>
<no_signal>TRUE</no_signal>
</quality_control_flags>
<sky_condition sky_cover="FEW" cloud_base_ft_agl="1600"/>
<sky_condition sky_cover="BKN" cloud_base_ft_agl="2800"/>
<flight_category>MVFR</flight_category>
<metar_type>METAR</metar_type>
</METAR>
Here is my Python 2.7 code to parse the data:
# Output the XML in a HTML friendly manner
def outputHTML(xml):
# The get the METAR Data list
metar_data = xml.getElementsByTagName("data")
# Our return string
outputString = ""
# Cycled through the metar_data
for state in metar_data:
#Gets the stations and cycle through them
stations = state.getElementsByTagName("METAR")
for station in stations:
# Grab data from the station element
raw_text = station.getElementsByTagName("raw_text")[0].firstChild.data
station_id = station.getElementsByTagName("station_id")[0].firstChild.data
observation_time = station.getElementsByTagName('observation_time')[0].firstChild.data
temp_c = station.getElementsByTagName('temp_c')[0].firstChild.data
dewpoint_c = station.getElementsByTagName('dewpoint_c')[0].firstChild.data
wind_dir_degrees = station.getElementsByTagName('wind_dir_degrees')[0].firstChild.data
wind_speed_kt = station.getElementsByTagName('wind_speed_kt')[0].firstChild.data
visibility_statute_mi = station.getElementsByTagName('visibility_statute_mi')
if len(visibility_statute_mi) > 0:
visibility_statute_mi = visibility_statute_mi[0].firstChild.data
altim_in_hg = station.getElementsByTagName('altim_in_hg')[0].firstChild.data
metar_type = station.getElementsByTagName('metar_type')[0].firstChild.data
# Append the data onto the string
string = "<tr><td>" + str(station_id) + "</td><td>" + str(observation_time) + "</td><td>" + str(raw_text) + "</td><td>" + str(temp_c) + "</td><td>" + str(dewpoint_c) + "</td></tr>"
outputString+=string
# Output string
return outputString
How do I read the sky_condition data and loop to get the sky_cover and cloud_base_ft_agl values?
I'll also need to check if there are any sky-condition values, because quite often there is no cloud cover and then no data.
Andre

I would parse the xml into a tree and query it, e.g. like this:
import xml.etree.ElementTree as et
xmltext = """
<METAR>
<raw_text>
FALE 201800Z VRB01KT 9999 FEW016 BKN028 23/22 Q1010 NOSIG
</raw_text>
<station_id>FALE</station_id>
<observation_time>2013-01-20T18:00:00Z</observation_time>
<temp_c>23.0</temp_c>
<dewpoint_c>22.0</dewpoint_c>
<wind_dir_degrees>0</wind_dir_degrees>
<wind_speed_kt>1</wind_speed_kt>
<altim_in_hg>29.822834</altim_in_hg>
<quality_control_flags>
<no_signal>TRUE</no_signal>
</quality_control_flags>
<sky_condition sky_cover="FEW" cloud_base_ft_agl="1600"/>
<sky_condition sky_cover="BKN" cloud_base_ft_agl="2800"/>
<flight_category>MVFR</flight_category>
<metar_type>METAR</metar_type>
</METAR>
"""
tree = et.fromstring(xmltext)
for sky_con in tree.iterfind('sky_condition'):
print sky_con.attrib["cloud_base_ft_agl"]
print sky_con.attrib.keys()
by reading the keys() you can check the presence of the attribute you're interested in.
edit: if you want to use xml.dom.minidom you can add these lines to your stations-loop to extract the same attributes:
for sky_con in station.getElementsByTagName("sky_condition"):
print sky_con._attrs["cloud_base_ft_agl"].value
print sky_con._attrs["sky_cover"].value

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting multiple children's values using minidom - python

How about something like for item in itemList: lineText = ' '.join(child.nodeValue for child in item.childNodes) p = Paragraph(lineText, style) Story.append(p)

Related

Revit, Using Python, Can't get the Family "Type Names" for a Family, Just the ID

How to get the subelement of child using Python's ElementTree

How to parse a single-column text file into a table using python?

Python List of Dictionaries Only See Last Element

Getting multiple child values from XML doc using Python

Categories

Resources