Writing metadata to a pdf using pyobjc - python

I'm trying to write metadata to a pdf file using the following python code:
from Foundation import *
from Quartz import *
url = NSURL.fileURLWithPath_("test.pdf")
pdfdoc = PDFDocument.alloc().initWithURL_(url)
assert pdfdoc, "failed to create document"
print "reading pdf file"
attrs = {}
attrs[PDFDocumentTitleAttribute] = "THIS IS THE TITLE"
attrs[PDFDocumentAuthorAttribute] = "A. Author and B. Author"
PDFDocumentTitleAttribute = "test"
pdfdoc.setDocumentAttributes_(attrs)
pdfdoc.writeToFile_("mynewfile.pdf")
print "pdf made"
This appears to work fine (no errors to the consoled), however when I examine the metadata of the file it is as follows:
PdfID0:
242b7e252f1d3fdd89b35751b3f72d3
PdfID1:
242b7e252f1d3fdd89b35751b3f72d3
NumberOfPages: 4
and the original file had the following metadata:
InfoKey: Creator
InfoValue: PScript5.dll Version 5.2.2
InfoKey: Title
InfoValue: Microsoft Word - PROGRESS ON THE GABION HOUSE Compressed.doc
InfoKey: Producer
InfoValue: GPL Ghostscript 8.15
InfoKey: Author
InfoValue: PWK
InfoKey: ModDate
InfoValue: D:20101021193627-05'00'
InfoKey: CreationDate
InfoValue: D:20101008152350Z
PdfID0: d5fd6d3960122ba72117db6c4d46cefa
PdfID1: 24bade63285c641b11a8248ada9f19
NumberOfPages: 4
So the problems are, it is not appending the metadata, and it is clearing the previous metadata structure. What do I need to do to get this to work? My objective is to append metadata that reference management systems can import.

Mark is on the right track, but there are a few peculiarities that should be accounted for.
First, he is correct that pdfdoc.documentAttributes is an NSDictionary that contains the document metadata. You would like to modify that, but note that documentAttributes gives you an NSDictionary, which is immutable. You have to convert it to an NSMutableDictionary as follows:
attrs = NSMutableDictionary.alloc().initWithDictionary_(pdfDoc.documentAttributes())
Now you can modify attrs as you did. There is no need to write PDFDocument.PDFDocumentTitleAttribute as Mark suggested, that one won't work, PDFDocumentTitleAttribute is declared as a module-level constant, so just do as you did in your own code.
Here is the full code that works for me:
from Foundation import *
from Quartz import *
url = NSURL.fileURLWithPath_("test.pdf")
pdfdoc = PDFDocument.alloc().initWithURL_(url)
attrs = NSMutableDictionary.alloc().initWithDictionary_(pdfdoc.documentAttributes())
attrs[PDFDocumentTitleAttribute] = "THIS IS THE TITLE"
attrs[PDFDocumentAuthorAttribute] = "A. Author and B. Author"
pdfdoc.setDocumentAttributes_(attrs)
pdfdoc.writeToFile_("mynewfile.pdf")

DISCLAIMER: I'm utterly new to Python, but an old hand at PDF.
To avoid smashing all the existing attributes, you need to start attrs with pdfDoc.documentAttributes, not {}. setDocumentAttributes is almost certainly an overwrite rather than a merge (given your output here).
Second, all the PDFDocument*Attribute constants are part of PDFDocument. My Python ignorance is undoubtedly showing, but shouldn't you be referencing them as attributes rather than as bare variables? Like this:
attrs[PDFDocument.PDFDocumentTitleAttribute] = "THIS IS THE TITLE"
That you can assign to PDFDocumentTitleAttribute leads me to believe it's not a constant.
If I'm right, your attrs will have tried to assign numerous values to a null key. My Python is weak, so I don't know how you'd check that. Examining attrs prior to calling pdfDoc.setDocumentAttributes_() should be revealing.

Related

Python: How to replace every space in a dictionary to an underscore?

Im still a beginner so maybe the answer is very easy, but I could not find a solution (at least one I could understand) online.
Currently I am learning famous works of art through the app "Anki". So I imported a deck for it online containing over 700 pieces.
Sadly the names of the pieces are in english and I would like to learn them in my mother language (german). So I wanted to write a script to automate the process of translating all the names inside the app. I started out by creating a dictionary with every artist and their art pieces (to fill this dictionary automatically reading the app is a task for another time).
art_dictionary = {
"Wassily Kandinsky": "Composition VIII",
"Zhou Fang": "Ladies Wearing Flowers in Their Hair",
}
My plan is to access wikipedia (or any other database for artworks) that stores the german name of the painting (because translating it with a eng-ger dictionary often returns wrong results since the german translation can vary drastically):
replacing every space character inside the name to an underscore
letting python access the wikipedia page of said painting:
import re
from urllib.request import urlopen
painting_name = "Composition_VIII" #this is manual input of course
url = "wikipedia.org/wiki/" + painting_name
page = urlopen(url)
somehow access the german version of the site and extracting the german name of the painting.
html = page.read().decode("utf-8")
pattern = "<title.*?>.*?</title.*?>" #I think Wikipedia stores the title like <i>Title</i>
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title)
storing it in a list or variable
inserting it in the anki app
maybe this is impossible or "over-engineering", but I'm learning a lot along the way.
I tried to search for a solution online, but could not find anything similar to my problem.
You can use dictionary comprehension with the replace method to update all the values (names of art pieces in this case) of the dictionary.
art_dictionary = {
"Wassily Kandinsky": "Composition VIII",
"Zhou Fang": "Ladies Wearing Flowers in Their Hair",
}
art_dictionary = {key:value.replace(' ', '_') for key,value in art_dictionary.items()}
print(art_dictionary)
# Output: {'Wassily Kandinsky': 'Composition_VIII', 'Zhou Fang': 'Ladies_Wearing_Flowers_in_Their_Hair'}

Django precise_bbcode not properly parsing contents of [code] tags?

I just installed precise_bbcode 1.2.6 in my Django 1.10.4 application.
When I provide the string:
>>> s = """
[code]
for i in var:
print(var[i])
[/code]
"""
the output is just plain text:
for i in var:
print(var[i])
However if I change the [i] to [i2] it works fine and formats the text as expected.
I am guessing that precise_bbcode thinks [i] has something to do with italic text (even though it is surrounded by [code] tags and the [i] has no associated closing tag). This behavior is also present for [b] and probably any other recognized tag
I then I tried setting the option render_embedded = False but I still get the same behavior.
I then tried making my own "code" tag:
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
class PygmentsBBCodeTag(BBCodeTag):
name = 'code'
class Options:
strip = False
replace_links = False
render_embedded = False
transform_newlines = False
escape_html = False
def render(self, value, option=None, parent=None):
print(value)
return highlight(value, PythonLexer(), HtmlFormatter())
tag_pool.register_tag(PygmentsBBCodeTag)
And got the same result.
Stranger still, in my PygmentsBBCodeTag class whenever the [i] exists I notice that it is never called (as the value is not printed).
Is there any way to tell precise_bbcode to look at the contents between the [code] tag purely as a string and to ignore anything except for the closing [/code] tag??
Thanks
I came up with a solution that only requires two lines in bbcode/parser.py to be added.
Note: this worked for my application and based on my testing this
change did not create unwanted side effects. I can not guarantee
this is the best solution and I encourage that you test your app
throughly after making this edit. However, if I do find any undesirable
behavior in the future I will try to post the details here.
The procedure:
Find the file: .../dist-packages/precise_bbcode/bbcode/parser.py
Open file and navigate to:
For django-precise-bbcode 1.2.6 locate the line #242
For django-precise-bbcode 1.2.9 locate the line #248
Then modify the contents:
From:
if previous_tag_options.end_tag_closes:
opening_tags.pop()
To:
if not tag_options.render_embedded:
opening_tags = []
elif previous_tag_options.end_tag_closes:
opening_tags.pop()
Those are the only changes needed.
After doing this, for any tag where render_embedded = False it no longer appears to break when something resembling a bbcode tag is recognized within the enclosed text. And now passes the entire string to be formatted by the syntax highlighter.
[Update]
Nearly a year later (and after upgrading to version 1.2.9) and the original issue remains. However I have been using the procedure as illustrated in this post and so far it still works well!

How can you access the original tag or filename text in a mp3 with a MusicBrainz: Picard plugin?

I am trying to access the filename or MetaData I have added to my music over the years. (Live), (Demo), (Live: In Athens), (Acoustic), (Live In Las Vegas 2005), (Metallica Cover), (Bonus Track), etc. I've done this to distinguish between tracks easily.
I an going through trying to fix my music and organize/tag it better with MusicBrainz Picard. But Picard doesn't allow access to the original Tags, or Filename. Relying only on what is pulled from their database. (As you can see my info isn't standard. and it's just for me and my own personal collection, so most of it, would be useless to add to their database)
So one of the Forum Admins/Programmers (I think), Suggested that maybe it was possible to do this through a plugin.
I've never programmed in Python, and don't know the first thing about it. lately i've barely been getting into RegEx. But have a Fairly decent, though not advanced understanding of that.
Now, Ideally I'd like to check the original metadata if possible, and then check the filename. and pull out anything in () and save it to a Several Variables in the file as they are there: ExtraInfo1, ExtraInfo2, etc. then check each against the title to make sure it's not already in the title as sometimes the titles themselves have parentheses in their titles. then if not, be able to add them back to the title. to tag and rename them.
I did find this plugin, which pulls information out of the title and moves it to the version tag. So that is almost exactly what i'm looking for, except instead of taking it from the Title tag, i'd like to take it from the original title tag, or the filename. then add it to the new Title tag.
Could someone maybe help me with this?
here's the plugin I found:
PLUGIN_NAME = 'Move metadata to version tag'
PLUGIN_AUTHOR = 'Jacob Rask'
PLUGIN_DESCRIPTION = 'Moves song metadata such as "demo", "live" from title and titlesort to version tag.'
PLUGIN_VERSION = "0.1.4"
PLUGIN_API_VERSIONS = ["0.12", "0.15"]
from picard.metadata import register_track_metadata_processor
import re
_p_re = re.compile(r"\(.*?\)")
_v_re = re.compile(r"((\s|-)?(acoustic|akustisk|album|bonus|clean|club|cut|C=64|dance|dirty|disco|encore|extended|inch|maxi|live|original|radio|redux|rehearsal|reprise|reworked|ringtone|[Ss]essions?|short|studio|take|variant|version|vocal)(\s|-)?|.*?(capp?ella)\s?|(\s|-)?(alternat|demo|dub|edit|ext|fail|instr|long|orchestr|record|remaster|remix|strument|[Tt]ape|varv).*?|.*?(complete|mix|inspel).*?)")
def add_title_version(tagger, metadata, release, track):
if metadata["titlesort"]:
title = metadata["titlesort"]
else:
title = metadata["title"]
pmatch = _p_re.findall(title)
if pmatch: # if there's a parenthesis, investigate
pstr = pmatch[-1][1:-1] # get last match and strip paranthesis
vmatch = _v_re.search(pstr)
if vmatch:
metadata["titlesort"] = re.sub("\(" + pstr + "\)", "", title).strip()
metadata["title"] = re.sub("\(" + pstr + "\)", "", title).strip()
metadata["version"] = pstr
register_track_metadata_processor(add_title_version)
Thanks,
-Dev

Using Mutagen to process all accepted file types

What do I need to do in order to process every file type accepted by mutagen, .ogg, .apev2, .wma, flac, mp4, and asf? (I excluded mp3 because it has the most documentation on it)
I'd appreciated if someone who know how this is done could provide some pseudo-code in order to explain the techniques used. The main tags that I'd want extracted are the title, and artist of the files, album if available.
Where to start?
Each tag type has different names for the fields, and they don't all map perfectly.
If you just want a handful of the most important fields, Mutagen has "easy" wrappers for ID3v2 and MP4/ITMF. So, for example, you can do this:
>>> m = mutagen.File(path, easy=True)
>>> m['title']
[u'Sunshine Smile']
>>> m['artist']
[u'Adorable']
>>> m['album']
[u'Against Perfection']
But this will only work for these two file formats. Vorbis, Metaflac, APEv2, and WMT tags are essentially free-form key: value or key: [list of values] mappings. Vorbis does have a recommended set of names for common comment fields, and WM has a set of fields that are mapped by the WMP GUI and the .NET API, but Metaflac and APEv2 don't even have that. In fact, it's pretty common to see both "Artist", from the old ID3v1 field name, and "ARTIST", from Vorbis, in Metaflac comments.
And even for ID3v2, the mappings aren't perfect—iTunes shows the "TPE1" frame as "Artist" and "TPE2" as "Album Artist", while Foobar2000 shows TPE2 as "Artist" and TXXX:ALBUM ARTIST as "Album Artist".
So, to do this right, you have to look at the iTMF, Vorbiscomment, ID3v2 (or see Wikipedia), and WMT, and then look at the files you have and add some heuristics to decide how to get what you want from the files you have.
For example, you might try something like this:
>>> m = mutagen.File(path)
>>> for tag in ('TPE1', 'TPE2', u'©ART', 'Author', 'Artist', 'ARTIST',
... 'TRACK ARTIST', 'TRACKARTIST', 'TrackArtist', 'Track Artist'):
... try:
... artist = unicode(m[tag][0])
... break
... except KeyError:
... pass
A better solution would be to switch on the tag type and only try the appropriate fields for the format.
Fortunately, other people have done this work for you. You can find almost all the information people have gathered about how different players/taggers map values to each format at the Hydrogen Audio forums and wiki, and various other projects have turned that information into simple tag-mapping tables that you can just pick up and borrow for your code, like this one from MusicBrainz. MusicBrainz Picard even has a wrapper around Mutagen that lets you use a consistent set of metadata names (the ones described here) with all tag types.

exporting wikipedia with Python

I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!

Categories

Resources