Saving NLTK Alignments

Saving NLTK Alignments - python

I am using NLTK 3.2 and I was wondering how you save NLTK alignments. I have found this link: How to save Python NLTK alignment models for later use?, but it seems that there is no align() method. Also, I figured out that nltk.align has been renamed to nltk.translate, but I still cannot access the align() method. Thanks!

Yeah, you are right. The method align became private in the current version. So, if you want to use that method, you have to modify the source code.
To modify the source code, you have to get to the directory of the file. You can find that directory by:
Open your terminal
Type these commands:
>>> python
>>> import nltk
>>> nltk.translate.ibm1.__file__
Here is a screen-shot of what it should look like:
Now, you have to go to that directory and find the file 'ibm1.py'. Open the file and modify the method __align to align.
It's the last method in the file.
CAUTION:
The align method returns Alignment class instead of AlignedSent in earlier versions.

Related

How can I convert docx to doc using Python?

How can I convert file.docx to file.doc using Python? I have a code that outputs a file in docx format, but this program is for someone who can only use Word 2003, so I need to convert that file to .doc using Python. How can I do it? Thank you in advance.

Bit clunky, but you could use pywinauto to programatically open your .docx documents in word, then save-as .doc. It'd be using word to do the conversion, so it should be as clean as you could get.
This is a snippet of what I've used for converting to pdf within word (it was just a test). You'd have to follow the keystrokes necessary to save as a .doc
import pywinauto
from pywinauto.application import Application
app1 = Application(backend="uia").connect(path="C:\\Program Files (x86)\\Microsoft Office\\root\\Office16\\WINWORD.EXE")
wordhndl = app1.top_window()
wordhndl.type_keys('^o')
wordhndl.type_keys('%f')
wordhndl.type_keys('^o')
wordhndl.type_keys('^o')
#Now that we're in a sub-window, using the top_window() handle doesn't work...
#Instead refer to absolute (using friendly_class_name())
app1.Dialog.Open.type_keys("Y:\\996.Software\\04.Python\\Test\\SampleDoc1.docx")
app1.Dialog.Open.type_keys('~')
#Publish it to pdf
app1.SampleDoc1docx.type_keys('%f')
app1.SampleDoc1docx.type_keys('e')
app1.SampleDoc1docx.type_keys('p')
app1.SampleDoc1docx.type_keys('a')
app1.SampleDoc1docx.PublishasPDForXPS.Publish.type_keys('~')
#Deal with popups & prompts
if app1.Dialog.PublishasPDForXPS.ConfirmSaveAs.exists():
app1.Dialog.PublishasPDForXPS.ConfirmSaveAs.Yes.click() #This line can take some time...
I think the .doc keystrokes would be (not tested)
app1.SampleDoc1docx.type_keys('%f')
app1.SampleDoc1docx.type_keys('a')
app1.SampleDoc1docx.type_keys('y')
app1.SampleDoc1docx.type_keys('4')
app1.SampleDoc1docx.type_keys('{DOWN}')
app1.SampleDoc1docx.type_keys('{DOWN}')
app1.SampleDoc1docx.type_keys('~')
app1.SampleDoc1docx.type_keys('{RIGHT}')
app1.SampleDoc1docx.type_keys('~')
But... the better solution is to use word. I've used VBA within word to do this exact thing before. Don't have the code to hand, but a good pointer would be:
https://www.datanumen.com/blogs/3-quick-ways-to-batch-convert-word-doc-to-docx-files-and-vice-versa/

you can do something of the likes of:
import docx
doc = docx.Document("myWordxFile.docx")
doc.save('myNewWordFile.doc')
check this too. Good luck!

How to write lists as a file attribute using PyGTK

A comment by the Linux Mint founder stated that file emblems in newer versions of Nemo can be programatically accessed, as seen in the following example using Python and PyGTK:
import gio
file = gio.File("/home/guest/Documents/Todo")
emblems = file.query_info("metadata::emblems")
print emblems.get_attribute_as_string("metadata::emblems")
Which outputs something in the format
[emblem-important, emblem-urgent]
The object stored as metadata::emblems, as you can see, is a list (I guess of strings). However, in the PyGTK documentation on Gio.FileInfo, I can't find a method to access (read or write) attributes of array types.
Is there any method to do this (i.e. reading single emblems or setting new emblems programatically)? If yes, how can I accomplish this?

It's strange that there's no convenience method to do this, but you can call File.set_attribute directly, using STRINGV as the type.
f.set_attribute('metadata::emblems', gio.FILE_ATTRIBUTE_TYPE_STRINGV, ['emblem-important', 'emblem-urgent'])

Storing a file in the clipboard in python

Is there a way to use the win32clipboard module to store a reference to a file in the windows clipboard in python. My goal is to paste an image in a way that allows transparency. If I drag and drop a 'png' file into OneNote or I copy the file and then paste it into OneNote, this seems to preserve transparency. As far as I can tell, the clipboard can't store transparent images which is why it has to be a reference to a file.
My research suggests that it might involve the win32clipboard.CF_HDrop attribute but I'm not sure.
So, just to summarize, my goal is to have a bit of python code which I can click and which uses a specific file on my Desktop named 'img.png' for instance. The result is that 'img.png' gets stored in the clipboard and can be pasted into other programs. Essentially, the same behavior as if I selected the file on the Desktop myself, right-clicked and selected 'Copy'.
EDIT:
This page seems to suggest there is a way using win32clipboard.CF_HDrop somehow:
http://timgolden.me.uk/pywin32-docs/win32clipboard__GetClipboardData_meth.html
It says "CF_HDROP" is associated with "a tuple of Unicode filenames"

from PythonMagick import Image
Image("img.png").write("clipboard:")
Grab the windows binaries for PythonMagick

I write this as an answer, although it's just a step that might help you, because comments don't have a lot of formatting options.
I wrote this sample script:
import win32clipboard as clp, win32api
clp.OpenClipboard(None)
rc= clp.EnumClipboardFormats(0)
while rc:
try: format_name= clp.GetClipboardFormatName(rc)
except win32api.error: format_name= "?"
print "format", rc, format_name
rc= clp.EnumClipboardFormats(rc)
clp.CloseClipboard()
Then I selected an image file in explorer and copied it; then, the script reports the following available clipboard formats:
format 49161 DataObject
format 49268 Shell IDList Array
format 15 ?
format 49519 DataObjectAttributes
format 49292 Preferred DropEffect
format 49329 Shell Object Offsets
format 49158 FileName
format 49159 FileNameW
format 49171 Ole Private Data
This “Preferred DropEffect” seems suspicious, although I'm far from a Windows expert. I would try first with FileNameW, though, since this might do the job for you (I don't have OneNote installed, sorry). It seems it expects as data only the full pathname encoded as 'utf-16-le' with a null character (i.e encoded as '\0\0') at the end.

python: edit ISO file directly

Is it possible to take an ISO file and edit a file in it directly, i.e. not by unpacking it, changing the file, and repacking it?
It is possible to do 1. from Python? How would I do it?

You can use for listing and extracting, I tested the first.
https://github.com/barneygale/iso9660/blob/master/iso9660.py
import iso9660
cd = iso9660.ISO9660("/Users/murat/Downloads/VisualStudio6Enterprise.ISO")
for path in cd.tree():
print path
https://github.com/barneygale/isoparser
import isoparser
iso = isoparser.parse("http://www.microsoft.com/linux.iso")
print iso.record("boot", "grub").children
print iso.record("boot", "grub", "grub.cfg").content

Have you seen Hachoir, a Python library to "view and edit a binary stream field by field"? I haven't had a need to try it myself, but ISO 9660 is listed as a supported parser format.

PyCdlib can read and edit ISO-9660 files, as well as Joliet, Rock Ridge, UDF, and El Torito extensions. It has a bunch of detailed examples in its documentation, including one showing how to edit a file in-place. At the time of writing, it cannot add or remove files, or edit directories. However, it is still actively maintained, in contrast to the libraries linked in older answers.

Of course, as with any file.
It can be done with open/read/write/seek/tell/close operations on a file. Pack/unpack the data with struct/ctypes. It would require serious knowledge of the contents of ISO, but I presume you already know what to do. If you're lucky you can try using mmap - the interface to file contents string-like.

abstracting the conversion between id3 tags, m4a tags, flac tags

I'm looking for a resource in python or bash that will make it easy to take, for example, mp3 file X and m4a file Y and say "copy X's tags to Y".
Python's "mutagen" module is great for manupulating tags in general, but there's no abstract concept of "artist field" that spans different types of tag; I want a library that handles all the fiddly bits and knows fieldname equivalences. For things not all tag systems can express, I'm okay with information being lost or best-guessed.
(Use case: I encode lossless files to mp3, then go use the mp3s for listening. Every month or so, I want to be able to update the 'master' lossless files with whatever tag changes I've made to the mp3s. I'm tired of stubbing my toes on implementation differences among formats.)

I needed this exact thing, and I, too, realized quickly that mutagen is not a distant enough abstraction to do this kind of thing. Fortunately, the authors of mutagen needed it for their media player QuodLibet.
I had to dig through the QuodLibet source to find out how to use it, but once I understood it, I wrote a utility called sequitur which is intended to be a command line equivalent to ExFalso (QuodLibet's tagging component). It uses this abstraction mechanism and provides some added abstraction and functionality.
If you want to check out the source, here's a link to the latest tarball. The package is actually a set of three command line scripts and a module for interfacing with QL. If you want to install the whole thing, you can use:
easy_install QLCLI
One thing to keep in mind about exfalso/quodlibet (and consequently sequitur) is that they actually implement audio metadata properly, which means that all tags support multiple values (unless the file type prohibits it, which there aren't many that do). So, doing something like:
print qllib.AudioFile('foo.mp3')['artist']
Will not output a single string, but will output a list of strings like:
[u'The First Artist', u'The Second Artist']
The way you might use it to copy tags would be something like:
import os.path
import qllib # this is the module that comes with QLCLI
def update_tags(mp3_fn, flac_fn):
mp3 = qllib.AudioFile(mp3_fn)
flac = qllib.AudioFile(flac_fn)
# you can iterate over the tag names
# they will be the same for all file types
for tag_name in mp3:
flac[tag_name] = mp3[tag_name]
flac.write()
mp3_filenames = ['foo.mp3', 'bar.mp3', 'baz.mp3']
for mp3_fn in mp3_filenames:
flac_fn = os.path.splitext(mp3_fn)[0] + '.flac'
if os.path.getmtime(mp3_fn) != os.path.getmtime(flac_fn):
update_tags(mp3_fn, flac_fn)

I have a bash script that does exactly that, atwat-tagger. It supports flac, mp3, ogg and mp4 files.
usage: `atwat-tagger.sh inputfile.mp3 outputfile.ogg`
I know your project is already finished, but somebody who finds this page through a search engine might find it useful.

Here's some example code, a script that I wrote to copy tags between
files using Quod Libet's music format classes (not mutagen's!). To run
it, just do copytags.py src1 dest1 src2 dest2 src3 dest3, and it
will copy the tags in sec1 to dest1 (after deleting any existing tags
on dest1!), and so on. Note the blacklist, which you should tweak to
your own preference. The blacklist will not only prevent certain tags
from being copied, it will also prevent them from being clobbered in
the destination file.
To be clear, Quod Libet's format-agnostic tagging is not a feature of mutagen; it is implemented on top of mutagen. So if you want format-agnostic tagging, you need to use quodlibet.formats.MusicFile to open your files instead of mutagen.File.
Code can now be found here: https://github.com/DarwinAwardWinner/copytags
If you also want to do transcoding at the same time, use this: https://github.com/DarwinAwardWinner/transfercoder
One critical detail for me was that Quod Libet's music format classes
expect QL's configuration to be loaded, hence the config.init line in my
script. Without that, I get all sorts of errors when loading or saving
files.
I have tested this script for copying between flac, ogg, and mp3, with "standard" tags, as well as arbitrary tags. It has worked perfectly so far.
As for the reason that I didn't use QLLib, it didn't work for me. I suspect it was getting the same config-related errors as I was, but was silently ignoring them and simply failing to write tags.

You can just write a simple app with a mapping of each tag name in each format to an "abstract tag" type, and then its easy to convert from one to the other. You don't even have to know all available types - just those that you are interested in.
Seems to me like a weekend-project type of time investment, possibly less. Have fun, and I won't mind taking a peek at your implementation and even using it - if you won't mind releasing it of course :-) .

There's also tagpy, which seems to work well.

Since the other solutions have mostly fallen off the net, here is what I came up, based on the python mediafile library (python3-mediafile in Debian GNU/Linux).
#!/usr/bin/python3
import sys
from mediafile import MediaFile
src = MediaFile (sys.argv [1])
dst = MediaFile (sys.argv [2])
for field in src.fields ():
try:
setattr (dst, field, getattr (src, field))
except:
pass
dst.save ()
Usage: mediafile-mergetags srcfile dstfile
It copies (merges) all tags from srcfile into dstfile, and seems to work properly with flac, opus, mp3 and so on, including copying album art.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.