Obtain DTD information from XML using Python

Obtain DTD information from XML using Python - python

I'm trying to extract the DTD information from an XML document using Python and preferably the standard library. At first glance, it seems xml.sax.handler.DTDHandler is the way to go, so I wrote the following example code to extract the DTD of a trivial DocBook v4 document:
import xml.sax
from contextlib import closing
XML_CODE = '''<!DOCTYPE example PUBLIC
"-//OASIS//DTD DocBook XML V4.1.2//EN"
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<example><title>Hello World in Python</title>
<programlisting>
print('Hello World!')
</programlisting>
</example>'''
class DTDPrinter(xml.sax.handler.DTDHandler):
def notationDecl(self, name, publicId, systemId):
print('name={}, publicId={}'.format(name, publicId))
if __name__ == '__main__':
with closing(xml.sax.make_parser()) as parser:
parser.setFeature(xml.sax.handler.feature_external_pes, False)
parser.setFeature(xml.sax.handler.feature_validation, False)
parser.setDTDHandler(DTDPrinter())
print('---------- before feed')
parser.feed(XML_CODE)
print('---------- after feed')
My expectation was that when running this code with Python 3.5 the output would be something like:
---------- before feed
name=example, publicId=-//OASIS//DTD DocBook XML V4.1.2//EN
---------- after feed
Instead I get an output with DTD's seemingly related to various image formats but not the one specified in the document:
---------- before feed
name=BMP, publicId=+//ISBN 0-7923-9432-1::Graphic Notation//NOTATION Microsoft Windows bitmap//EN
name=CGM-CHAR, publicId=ISO 8632/2//NOTATION Character encoding//EN
name=CGM-BINARY, publicId=ISO 8632/3//NOTATION Binary encoding//EN
...
name=WMF, publicId=+//ISBN 0-7923-9432-1::Graphic Notation//NOTATION Microsoft Windows Metafile//EN
name=WPG, publicId=None
name=linespecific, publicId=None
---------- after feed
Although maybe the last entry with the name linespecific might refer to the document DTD in a crippled way?
I also noticed a couple of seconds delay after the last output despite the document being trivial. Maybe the parsers attempts to connect to the internet? I tried to disable this by settings the features
parser.setFeature(xml.sax.handler.feature_external_pes, False)
parser.setFeature(xml.sax.handler.feature_validation, False)
but to no avail.
How can I convince the DTDHandler to react to the DTD occurring in the document and not connect to the internet?

As I didn't receive an answer and couldn't get this to work properly in the past 2 weeks I resorted to violence and simply used a regular expression to extract the information. This is ugly and won't be able to properly process all valid ways to express a DTD but is good enough for my purpose.
import re
DTD_REGEX = re.compile(
r'<!DOCTYPE\s+(?P<name>[a-zA-Z][a-zA-Z-]*)\s+PUBLIC\s+"(?P<public_id>.+)"')
dtd_match = DTD_REGEX.match(XML_CODE)
if dtd_match is not None:
public_id = dtd_match.group('public_id')
print(public_id)

Related

How to parse doc files on a mac with python? [duplicate]

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux?
Is there any library?

Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

benjamin's answer is a pretty good one. I have just consolidated...
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)

OpenOffice.org can be scripted with Python: see here.
Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.

I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
Hopefully this will help anyone having similar issues in the future.

Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.

(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module.
Hope this helps.

Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv

To read Word 2007 and later files, including .docx files, you can use the python-docx package:
from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')
To read .doc files from Word 2003 and earlier, make a subprocess call to antiword. You need to install antiword first:
sudo apt-get install antiword
Then just call it from your python script:
import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))

If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.

Is this an old question?
I believe that such thing does not exist.
There are only answered and unanswered ones.
This one is pretty unanswered, or half answered if you wish.
Well, methods for reading *.docx (MS Word 2007 and later) documents without using COM interop are all covered.
But methods for extracting text from *.doc (MS Word 97-2000), using Python only, lacks.
Is this complicated?
To do: not really, to understand: well, that's another thing.
When I didn't find any finished code, I read some format specifications and dug out some proposed algorithms in other languages.
MS Word (*.doc) file is an OLE2 compound file.
Not to bother you with a lot of unnecessary details, think of it as a file-system stored in a file. It actually uses FAT structure, so the definition holds. (Hm, maybe you can loop-mount it in Linux???)
In this way, you can store more files within a file, like pictures etc.
The same is done in *.docx by using ZIP archive instead.
There are packages available on PyPI that can read OLE files. Like (olefile, compoundfiles, ...)
I used compoundfiles package to open *.doc file.
However, in MS Word 97-2000, internal subfiles are not XML or HTML, but binary files.
And as this is not enough, each contains an information about other one, so you have to read at least two of them and unravel stored info accordingly.
To understand fully, read the PDF document from which I took the algorithm.
Code below is very hastily composed and tested on small number of files.
As far as I can see, it works as intended.
Sometimes some gibberish appears at the start, and almost always at the end of text.
And there can be some odd characters in-between as well.
Those of you who just wish to search for text will be happy.
Still, I urge anyone who can help to improve this code to do so.
doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf
Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
* Did the author of original algorithm used uint32 and int32 when unpacking correctly?
I copied each occurence as in original algo.
* Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
* Did I interpret each C# command correctly?
I think I did!
"""
from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack
__all__ = ["doc2text"]
def doc2text (path):
text = u""
cr = CompoundFileReader(path)
# Load WordDocument stream:
try:
f = cr.open("WordDocument")
doc = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
# Extract file information block and piece table stream informations from it:
fib = doc[:1472]
fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0]
lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
tableName = ("0Table", "1Table")[tableFlag]
# Load piece table stream:
try:
f = cr.open(tableName)
table = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
cr.close()
# Find piece table inside a table stream:
clx = table[fcClx:fcClx+lcbClx]
pos = 0
pieceTable = ""
lcbPieceTable = 0
while True:
if clx[pos]=="\x02":
# This is piece table, we store it:
lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
pieceTable = clx[pos+5:pos+5+lcbPieceTable]
break
elif clx[pos]=="\x01":
# This is beggining of some other substructure, we skip it:
pos = pos+1+1+ord(clx[pos+1])
else: break
if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
# Read info from pieceTable, about each piece and extract it from WordDocument stream:
pieceCount = (lcbPieceTable-4)/12
for x in xrange(pieceCount):
cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
fcValue = unpack("L", pieceDescriptor[2:6])[0]
isANSII = (fcValue & 0x40000000) == 0x40000000
fc = fcValue & 0xbfffffff
cb = cpEnd-cpStart
enc = ("utf-16", "cp1252")[isANSII]
cb = (cb*2, cb)[isANSII]
text += doc[fc:fc+cb].decode(enc, "ignore")
return "\n".join(text.splitlines())

I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!

Just an option for reading 'doc' files without using COM: miette. Should work on any platform.

Aspose.Words Cloud SDK for Python is a platform independent solution to convert MS Word/Open Office files to text. It is a commercial product but free trial plan provides 150 monthly API calls.
P.S: I am a developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.docx'
dest_name = 'C:/Temp/02_pages.txt'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='txt')
result = words_api.convert_document(request)
copyfile(result, dest_name)

How to write a python script for downloading?

I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php
But I only want to get certain ones.
Is there a way to write a python script to do this? I have intermediate knowledge of python
I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this
thanks,
Shrub
Here's a link to my code

I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.
I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.
Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:
f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()
Hope that helps

Make a url request for the page. Once you have the source, filter out and get urls.
The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria.
After filtration, then do a url request for each matched url's data and write it to memory.
Sample code:
#!/usr/bin/python
import re
import sys
import urllib
#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()
#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)
#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)
#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)
#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)
def fileDl(targetUrl):
#Function to handle downloading of files.
#Arg: url => a String
#Output: Boolean to signify if file has been written to memory
#Validation of the url assumed, for the sake of keeping the illustration short
urlAddInfo = urllib.urlopen(targetUrl)
data = urlAddInfo.read()
fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
if not fileNameSearch:
sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
return False
fileName = fileNameSearch.groups(1)[0]
with open(fileName, "wb") as f:
f.write(data)
sys.stderr.write("Wrote %s to memory\n"%(fileName))
return True
#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))
#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.
The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X

Why yes! 5 years later and, not only is this possible, but you've now got a lot of ways to do it.
I'm going to avoid code-examples here, because mainly want to help break your problem into segments and give you some options for exploration:
Segment 1: GET!
If you must stick to the stdlib, for either python2 or python3, urllib[n]* is what you're going to want to use to pull-down something from the internet.
So again, if you don't want dependencies on other packages:
urllib or urllib2 or maybe another urllib[n] I'm forgetting about.
If you don't have to restrict your imports to the Standard Library:
you're in luck!!!!! You've got:
requests with docs here. requests is the golden standard for gettin' stuff off the web with python. I suggest you use it.
uplink with docs here. It's relatively new & for more programmatic client interfaces.
aiohttp via asyncio with docs here. asyncio got included in python >= 3.5 only, and it's also extra confusing. That said, it if you're willing to put in the time it can be ridiculously efficient for exactly this use-case.
...I'd also be remiss not to mention one of my favorite tools for crawling:
fake_useragent repo here. Docs like seriously not necessary.
Segment 2: Parse!
So again, if you must stick to the stdlib and not install anything with pip, you get to use the extra-extra fun and secure (<==extreme-sarcasm) xml builtin module. Specifically, you get to use the:
xml.etree.ElementTree() with docs here.
It's worth noting that the ElementTree object is what the pip-downloadable lxml package is based on, and made make easier to use. If you want to recreate the wheel and write a bunch of your own complicated logic, using the default xml module is your option.
If you don't have to restrict your imports to the Standard Library:
lxml with docs here. As i said before, lxml is a wrapper around xml.etree that makes it human-usable & implements all those parsing tools you'd need to make yourself. However, as you can see by visiting the docs, it's not easy to use by itself. This brings us to...
BeautifulSoup aka bs4 with docs here. BeautifulSoup makes everything easier. It's my recommendation for this.
Segment 3: GET GET GET!
This section is nearly exactly the same as "Segment 1," except you have a bunch of links not one.
The only thing that changes between this section and "Segment 1" is my recommendation for what to use: aiohttp here will download way faster when dealing with several URLs because it's allows you to download them in parallel.**
* - (where n was decided-on from python-version to ptyhon-version in a somewhat frustratingly arbitrary manner. Look up which urllib[n] has .urlopen() as a top-level function. You can read more about this naming-convention clusterf**k here, here, and here.)
** - (This isn't totally true. It's more sort-of functionally-true at human timescales.)

I would use a combination of wget for downloading - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885 and BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the downloaded file

How do you extract feed urls from an OPML file exported from Google Reader?

I have a piece of software called Rss-Aware that I'm trying to use. It basically desktop feed-checker that checks if RSS feeds are updated and gives a notification through Ubuntu's Notify-OSD system.
However, to know what feeds to check, you have to list out the feed urls in a text file in ~/.rss-aware/rssfeeds.txt one after the other in a list with linebreak between each feed url. Something like:
http://example.com/feed.xml
http://othersite.org/feed.xml
http://othergreatsite.net/rss.xml
...Seems pretty simple right? Well, the list of feeds I'd like to use are exported from Google Reader as an OPML file (it's a type of XML) and I have no clue how to parse it to just output the the feed urls. It seems like it should be pretty straight forward yet I'm stumped.
I'd love if anyone could give an implementation in Python or Ruby or something I could do quickly from a prompt. A bash script would be awesome.
Thanks you so much for the help, I'm a really weak programmer and would love to learn how to do this basic parsing.
EDIT: Also, here is the OPML file I'm trying to extract the feed urls from.

I wrote a subscription list parser for this very purpose. It's called listparser, and it's written in Python. I just tested your OPML file, and it appears to parse the file perfectly. It will also make your feeds' labels available.
If you've ever used feedparser, the interface should be familiar:
>>> import listparser as lp
>>> d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
>>> len(d.feeds)
112
>>> d.feeds[100].url
u'http://longreads.com/rss'
>>> d.feeds[100].tags
[u'reading']
It's possible to create the file with feed URLs using a script similar to:
import listparser as lp
d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
f = open('/home/USERNAME/.rss-aware/rssfeeds.txt', 'w')
for i in d.feeds:
f.write(i.url + '\n')
f.close()
Just replace USERNAME with your actual username. Done!

XML parsing was so easy to implement and worked great for me.
from xml.etree import ElementTree
def extract_rss_urls_from_opml(filename):
urls = []
with open(filename, 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//outline'):
url = node.attrib.get('xmlUrl')
if url:
urls.append(url)
return urls
urls = extract_rss_urls_from_opml('your_file')

Since it's an XML file, you can use an XPath query to extract the urls.
In the XML file, it looks like the rss feed urls are stored in xmlUrl attributes. The XPath expression //#xmlUrl will select all values of that attribute.
If you want to test this out in your web-browser, you can use an online XPath tester. If you want to perform this XPath query in Python, this question explains how to use XPath in Python. Additionally, the lxml docs have a page on using XPath in lxml that might be helpful.

You could also use a regex. I used the following search-and-replace regex to convert my Google Reader OPML export to a Firefox HTML live-bookmark import:
^\s+<outline.*?title="(.*?)".*?xmlUrl="(.*?)".*?htmlUrl="(.*?)".*?/>
<DT><A FEEDURL="$2" HREF="$3">$1</A>

python libxml2 reader and XML_PARSE_RECOVER

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.
However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):
Sample code (with small example):
import cStringIO
import libxml2
DOC = "<a>some broken & xml</a>"
reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)
ret = reader.Read()
while ret:
print 'ret: %d' % ret
print "node name: ", reader.Name(), reader.NodeType()
ret = reader.Read()
Any ideas how to recover correctly?

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:
from cStringIO import StringIO
from lxml import etree
DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())
The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.
Edit:
If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:
DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)
for data in StringIO(DOC).read():
reader.feed(data)
tree = reader.close()
print etree.tostring(tree)

Isn't the xml broken in some consistent way? Isn't there some pattern you could follow to repair your xml before parsing?
For example - if the error is caused only by unescaped ampersands and you don't use CDATA or processing instructions, it can be repaired with a regexp.
EDIT: Then take a look at sgmllib in python standard library. BeautifulSoup uses it, so it can be useful in your case. (BeatifulSoup itself offers only the tree representation, not the events).

Consider using xml.sax. When I'm presented really malformed XML that can have a plethora of different problems try dividing the problem into small pieces.
You mentioned that you have a very large XML file, well it probably has many records that you process serially. And each record (e.g. <item>...</item> has a start and end tag, presumably - these will will your recovery points.
In xml.sax you provide the reader, the handler, and the input sources. At worse a single records will be unrecoverable with this technique. Its a little more setup, but incrementally parsing a malformed feed a record at a time logging the bad records is probably the best you can do.
In the logs make sure to give yourself enough information to rebuild the original record so you can add additional recovery code for all the cases that you'll no doubt have to handle (e.g. create a badrecords_today's date.xml so you can reprocess manually).
Good luck.

Or, you could use BeautifulSoup. It does a nice job recovering broken ML.

abstracting the conversion between id3 tags, m4a tags, flac tags

I'm looking for a resource in python or bash that will make it easy to take, for example, mp3 file X and m4a file Y and say "copy X's tags to Y".
Python's "mutagen" module is great for manupulating tags in general, but there's no abstract concept of "artist field" that spans different types of tag; I want a library that handles all the fiddly bits and knows fieldname equivalences. For things not all tag systems can express, I'm okay with information being lost or best-guessed.
(Use case: I encode lossless files to mp3, then go use the mp3s for listening. Every month or so, I want to be able to update the 'master' lossless files with whatever tag changes I've made to the mp3s. I'm tired of stubbing my toes on implementation differences among formats.)

I needed this exact thing, and I, too, realized quickly that mutagen is not a distant enough abstraction to do this kind of thing. Fortunately, the authors of mutagen needed it for their media player QuodLibet.
I had to dig through the QuodLibet source to find out how to use it, but once I understood it, I wrote a utility called sequitur which is intended to be a command line equivalent to ExFalso (QuodLibet's tagging component). It uses this abstraction mechanism and provides some added abstraction and functionality.
If you want to check out the source, here's a link to the latest tarball. The package is actually a set of three command line scripts and a module for interfacing with QL. If you want to install the whole thing, you can use:
easy_install QLCLI
One thing to keep in mind about exfalso/quodlibet (and consequently sequitur) is that they actually implement audio metadata properly, which means that all tags support multiple values (unless the file type prohibits it, which there aren't many that do). So, doing something like:
print qllib.AudioFile('foo.mp3')['artist']
Will not output a single string, but will output a list of strings like:
[u'The First Artist', u'The Second Artist']
The way you might use it to copy tags would be something like:
import os.path
import qllib # this is the module that comes with QLCLI
def update_tags(mp3_fn, flac_fn):
mp3 = qllib.AudioFile(mp3_fn)
flac = qllib.AudioFile(flac_fn)
# you can iterate over the tag names
# they will be the same for all file types
for tag_name in mp3:
flac[tag_name] = mp3[tag_name]
flac.write()
mp3_filenames = ['foo.mp3', 'bar.mp3', 'baz.mp3']
for mp3_fn in mp3_filenames:
flac_fn = os.path.splitext(mp3_fn)[0] + '.flac'
if os.path.getmtime(mp3_fn) != os.path.getmtime(flac_fn):
update_tags(mp3_fn, flac_fn)

I have a bash script that does exactly that, atwat-tagger. It supports flac, mp3, ogg and mp4 files.
usage: `atwat-tagger.sh inputfile.mp3 outputfile.ogg`
I know your project is already finished, but somebody who finds this page through a search engine might find it useful.

Here's some example code, a script that I wrote to copy tags between
files using Quod Libet's music format classes (not mutagen's!). To run
it, just do copytags.py src1 dest1 src2 dest2 src3 dest3, and it
will copy the tags in sec1 to dest1 (after deleting any existing tags
on dest1!), and so on. Note the blacklist, which you should tweak to
your own preference. The blacklist will not only prevent certain tags
from being copied, it will also prevent them from being clobbered in
the destination file.
To be clear, Quod Libet's format-agnostic tagging is not a feature of mutagen; it is implemented on top of mutagen. So if you want format-agnostic tagging, you need to use quodlibet.formats.MusicFile to open your files instead of mutagen.File.
Code can now be found here: https://github.com/DarwinAwardWinner/copytags
If you also want to do transcoding at the same time, use this: https://github.com/DarwinAwardWinner/transfercoder
One critical detail for me was that Quod Libet's music format classes
expect QL's configuration to be loaded, hence the config.init line in my
script. Without that, I get all sorts of errors when loading or saving
files.
I have tested this script for copying between flac, ogg, and mp3, with "standard" tags, as well as arbitrary tags. It has worked perfectly so far.
As for the reason that I didn't use QLLib, it didn't work for me. I suspect it was getting the same config-related errors as I was, but was silently ignoring them and simply failing to write tags.

You can just write a simple app with a mapping of each tag name in each format to an "abstract tag" type, and then its easy to convert from one to the other. You don't even have to know all available types - just those that you are interested in.
Seems to me like a weekend-project type of time investment, possibly less. Have fun, and I won't mind taking a peek at your implementation and even using it - if you won't mind releasing it of course :-) .

There's also tagpy, which seems to work well.

Since the other solutions have mostly fallen off the net, here is what I came up, based on the python mediafile library (python3-mediafile in Debian GNU/Linux).
#!/usr/bin/python3
import sys
from mediafile import MediaFile
src = MediaFile (sys.argv [1])
dst = MediaFile (sys.argv [2])
for field in src.fields ():
try:
setattr (dst, field, getattr (src, field))
except:
pass
dst.save ()
Usage: mediafile-mergetags srcfile dstfile
It copies (merges) all tags from srcfile into dstfile, and seems to work properly with flac, opus, mp3 and so on, including copying album art.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.