updating metadata for feature classes programatically using arcpy - python

I would like to be able to take an excel file that contains a record for each feature class, and some metadata fields, like summary, description, etc., and convert that to the feature class metadata. From the research I've done it seems like I need to convert each record in the excel table to xml, and then from there I may be able to import the xml file as metadata. Looks like I could use ElementTree, but I'm a little unsure of how to execute. Has anyone done this before and if so could you provide some guidance?

Man, this can be quite a process! I had to update some metadata information for a project at work the other day, so here goes nothing. It would be helpful to stored all of the metadata information in the excel table as a dictionary list or other data structure of your choosing (I work with csvs and try to stay away from excel spreadsheets for experience reasons).
metaInfo = [{"featureClass":"fc1",
"abstract":"text goes here",
"description":"text goes here",
"tags":["tag1","tag2","tag3"]},
{"featureClass":"fc2",
"abstract":"text goes here",
"description":"text goes here",
"tags":["tag1","tag2","tag3"]},...]
From there, I would actually export the current metadata feature class using the Export Metadata function to convert your feature class metadata into an xml file using a FGDC schema. Here is a code example below:
#Directory containing ArcGIS Install files
installDir = arcpy.GetInstallInfo("desktop")["InstallDir"]
#Path to XML schema for FGDC
translator = os.path.join(installDir, "Metadata/Translator/ARCGIS2FGDC.xml")
#Export your metadata
arcpy.ExportMetadata_conversion(featureClassPath, translator, tempXmlExportPath)
From there, you can use the xml module to access the ElementTree class. However, I would recommend using the lxml module (http://lxml.de/index.html#download) because it allows you to incorporate html code into your metadata through the CDATA factory if you needed special elements like line breaks in your metadata. From there, assuming that you have imported lxml, parse your local xml document:
import lxml.etree as ET
tree = ET.parse(tempXmlExportPath)
root = tree.getroot()
If you want to update the tags use the code below:
idinfo = root[0]
#Create keyworks element
keywords = ET.SubElement(idinfo, "keywords")
tree.write(tempXmlExportPath)
#Create theme child
theme = ET.SubElement(keywords, "theme")
tree.write(tempXmlExportPath)
#Create themekt and themekey grandchildren/insert tag info
themekt = ET.SubElement(theme, "themekt")
tree.write(tempXmlExportPath)
for tag in tags: #tags list from your dictionary
themekey = ET.SubElement(theme, "themekey")
themekey.text = tag
tree.write(tempXmlExportPath)
To update the Summary tags, use this code:
#Create descript tag
descript = ET.SubElement(idinfo, "descript")
tree.write(tempXmlExportPath)
#Create purpose child from abstract
abstract = ET.SubElement(descript, "abstract")
text = #get abstract string from dictionary
abstract.text = text
tree.write(tempXmlExportPath)
If a tag in the xml already exists, store the tag as an object using the parent.find("child") method, and update the text similar to the code examples above. Once you have updated your local xml file, use the Import Metadata method to import the xml file back into the feature class and remove the local xml file.
arcpy.ImportMetadata_conversion(tempXmlExportPath, "FROM_FGDC", featureClassPath)
shutil.rmtree(tempXmlExportPath)
Keep in mind that these tools in Arc are only for 32 bit, so if you are scripting through the 64 bit background geoprocessor, this will not work. I am working off of ArcMap 10.1. If you have any questions, please let me know or consult the documentation below:
lxml module
http://lxml.de/index.html#documentation
Export Metadata arcpy
http://resources.arcgis.com/en/help/main/10.1/index.html#//00120000000t000000
Import Metadata arcpy
http://resources.arcgis.com/en/help/main/10.1/index.html#//00120000000w000000

Related

parsing xml with namespace from request with lxml in python

I am trying to get some text out of a table from an online xml file. I can find the tables:
from lxml import etree
import requests
main_file = requests.get('https://training.gov.au/TrainingComponentFiles/CUA/CUAWRT601_R1.xml')
main_file.encoding = 'utf-8-sig'
root = etree.fromstring(main_file.content)
tables = root.xpath('//foo:table', namespaces={"foo": "http://www.authorit.com/xml/authorit"})
print(tables)
But I can't get any further than that. The text that I am looking for is:
Prepare to write scripts
Write draft scripts
Produce final scripts
When I paste the xml in here: http://xpather.com/
I can get it using the following expression:
//table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text()
but that doesn't work here and I'm out of ideas. How can I get that text?
Use the namespace prefix you declared (with namespaces={"foo": "http://www.authorit.com/xml/authorit"}) e.g. instead of //table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text() use //foo:table[1]/foo:tr/foo:td[#width="2700"]/foo:p[#id="4"][not(*)]/text().

How to get a custom MP3 tag via Python?

I am working on an algorithm that uses AcousticBrainz API. Part of the process is assigning an audio file with a specific UUID that refers to a file in a database. The tag is added via Picard and is present among other tags when checking e.g. via VLC Media Player:
Is there any way to access these 'custom' tags? I tried to use eyeD3 and mutagen, however, I think they only enable accessing specific tags like artist or length of the file.
Can I use eyed3 or mutagen to accomplish the goal? Is there any other tool that enables such operation?
Yes, you can use either one. These custom tags are stored as user text frames, with the frame ID "TXXX".
Here's some example code with eyeD3:
import eyed3
file = eyed3.load("test.mp3")
for frame in file.tag.frameiter(["TXXX"]):
print(f"{frame.description}: {frame.text}")
# get a specific tag
artist_id = file.tag.user_text_frames.get("MusicBrainz Artist Id").text
And with mutagen (it supports multiple values in each frame, but this seems to violates the ID3 spec; see this picard PR for the gory details):
from mutagen.id3 import ID3
audio = ID3("test.mp3")
for frame in audio.getall("TXXX"):
print(f"{frame.desc}: {frame.text}")
# get a specific tag
artist_id = audio["TXXX:MusicBrainz Artist Id"].text[0]
You can see how Picard uses mutagen to read these tags here: https://github.com/metabrainz/picard/blob/ee06ed20f3b6ec17d16292045724921773dde597/picard/formats/id3.py#L314-L336
Thank you Eric Johnson! I was unaware of the different tag formats. I was interested in accessing the MBID of the recording and I could not get it through ID3, however, your example and reference to Picard really helped. I ended up needing to read UFID tag, so I used the following:
audio = ID3(filepath)
for frame in audio.getall("UFID"):
print(str(frame.data, 'utf-8'))
Posting it here in case anyone needs it in the future.

What is the best way to store an XML file in a database using sqlalchemy-flask?

I'm working on a Flask app that I'd like to have store xml files in a database. I'd like to use flask-sqlalchemy. I've seen that in regular old sqlalchemy it is possible to use the LONGTEXT type. I believe this would work for my use case.
I would like to know (1) if LONGTEXT would be the best way store xml files and, if so, (2) how to use LONGTEXT within the flask-sqlalchemy syntax.
What should {insert-name-here} be in the code below? Will I need to install additional dependencies to use whatever is suggested?
xml_column = db.Column(db.{insert-name-here})
I use python some time.
I use [xml.etree.ElementTree] package to read or write xml data in python.
I use it like:
'''
#import
Import xml.etree.ElementTree as ET
#xml file
_xml = 'c:/.../test.xml'
#read
tree = ET.parse(_xml)
root = tree.getroot()
h_data = root.findall('h')
#write
root = ET.Element('test')
tree = ET.ElementTree(root)
tree.write(_xml), encoding='utf-8', xml_declaration=1)
'''
More you can see documents.
Xml file can save as a txt, but best the encoding is utf8.
I think xml data is not best for python.
The best is json data.
Hope I can help you.

pdfminer doesn't extract data from filled-out pdf form

I'm trying to use pdfminer to extract the filled-out contents in a pdf form. The instructions for accessing the pdf are:
Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1073757&parDT_END=99991231
Click "Create Report" next to the fourth report from the top (i.e.,Banking Organization Systemic Risk Report (FR Y-15))
Click "Your request for a financial report is ready"
To extract the contents in blue, I copied code from this post:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
filename = 'FRY15_1073757_20160630.PDF'
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print '{0}: {1}'.format(name, value)
This didn't extract the data fields as expected -- nothing was printed. I tried the same code on another pdf and it worked so I suspect the failure might have to do with the security setting of the first pdf, which is shown below
For the second pdf on which the code worked, the security setting shows "Allowed" for all the actions. I also tried using pdfminer's pdf2txt.py functionality (see here) but the filled-out data in the fields in the original pdf form (which is what I want) was not in the converted text file; only the "flat" non-fillable part of the pdf was converted. Interestingly, if I use Adobe Reader's Save As Text to convert the pdf to a text file, the fillable part was in the converted text file. This is what I've been doing to get around the failed code.
Any idea how I can extract data directly from the pdf form? Thanks.
I can only explain what the problem is but cannot present a solution because I have no working Python knowledge.
Your code iterates over the immediate children of the AcroForm Fields array and expect them to represent the form fields.
While this expectation often is fulfilled, it actually only represents a special case: Form fields are arranged as a tree structure with that Fields array as root element, e.g. in case of your sample document there is large tree:
Thus, you have to descend into the structure, not merely iterate over the immediate children of Fields, to find all form fields.

How to retrieve the author of an office file in python?

Title explains the problem, there are doc and docs files that which I want to retrieive their author information so that I can restructure my files.
os.stat returns only size and datetime, real-file related information.
open(filename, 'rb').read(200) returns many characters that I could not parse.
There is a module called xlrd for reading xlsx files. Yet, this still doesn't let me read doc or docx files. I am aware of new office files are not easily read on non-msoffice programs, so if that's impossible, gathering info from old office files would suffice.
Since docx files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored as dc:creator in docProps/core.xml.
Here's how you can open the docx file and retrieve the creator:
import zipfile, lxml.etree
# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text
You can use COM interop to access the Word object model. This link talks about the technique: http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/
The secret when working with any of the office objects is knowing what item to access from the overwhelming amount of methods and properties. In this case each document has a list of BuiltInDocumentProperties . The property of interest is "Last Author".
After you open the document you will access the author with something like word.ActiveDocument.BuiltInDocumentProperties("Last Author")
How about using docx library. You could pull more information about the file not only author.
#sudo pip install python-docx
#sudo pip2 install python-docx
#sudo pip3 install python-docx
import docx
file_name = 'file_path_name.doxs'
document = docx.Document(docx = file_name)
core_properties = document.core_properties
print(core_properties.author)
print(core_properties.created)
print(core_properties.last_modified_by)
print(core_properties.last_printed)
print(core_properties.modified)
print(core_properties.revision)
print(core_properties.title)
print(core_properties.category)
print(core_properties.comments)
print(core_properties.identifier)
print(core_properties.keywords)
print(core_properties.language)
print(core_properties.subject)
print(core_properties.version)
print(core_properties.keywords)
print(core_properties.content_status)
find more information about the docx library here and the github account is here
For old office documents (.doc, .xls) you can use hachoir-metadata.
It does not work well with the new file formats: for example, it can parse .xlsx files, but will not provide you with an Author name.
The newer Office formats are just zip containers containing xml files. You can have a look here https://github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py for a very simple straightforward approach.
The code listed is easily extendable for OpenOffice formats.
Pseudocode:
z = zipfile.ZipFile(filename, 'r')
data = _zipfile.read('docProps/core.xml')
or
data = _zipfile.read('meta.xml')
doc = xml.dom.minidom.parseString(data)
tag = "data you're interested in"
metadata_string = doc.getElementsByTagName(tag)[0].childNodes[0].data
Files to search metadata in:
docProps/core.xml for MS Office files
meta.xml for OpenOffice files
A non-exhaustive list of tags you can search for:
From the Dublin core namespace rules: dc
Title: dc:title
Creator (of most recent modification): dc:creator
Description: dc:description
Subject: dc:subject
Date (last modified): dc:date
Language: ???
From the ODF specification: meta
Generator (creating software application): meta:generator
Keywords: meta:keyword
Initial Creator: ???
Creation Date and Time: meta:creation-date
Modification Date and Time: ???
Print Date and Time: ???
Document Template: meta:template (data in attributes)
Document Statistics (word count, page count, etc.): meta:document-statistic (data in attributes)
MS Office specific:
Creation Date and Time: dcterms:created
Date (last modified): dcterms:modified
Creator of most recent modification: cp:lastModifiedBy

Categories

Resources