data exchange between R and python (music21) - python

My goal is to take a text file with a number list generated by R (e.g 1 2 3 4), and "translate" the numbers into music21 notes (that is, to compose a melody when each note is identified with a number).
Having the number list, one idea I had was creating a R vector with strings that matches with music21 note names, and trying to get a new output with the note names instead of numbers. But I'm not very sure of that. Besides, I don't know how to proceed after that.
I also read some topics talking about using R as a subprocess in Python, but again, I couldn't clearly understand how that works (the fact that running the subprocess almost makes my poor old laptop crash had something to do with that...)
How can I proceed here?

Personally, I would try to use only python. I realize you have little experience with it; but python is more general purpose than R and should be able to do anything R can do. Trying to use both at the same time seems like it would generate additional complexity and overhead you simply don't need.
It looks this music21 takes notes and lengths; however there are also rests. Let's say you have a list for durations called "durations", and a list for notes (and rests) called notes:
from music21 import *
mymusic = stream.Stream()
notes = ["F4", "F4", "rest", "F4"]
durations = [0.25, 1, 0.25, 1]
for n,d in zip(notes, durations):
if n == "rest":
mymusic.append(note.Rest(d))
else:
mymusic.append(note.Note(n,d))
mymusic.show("midi")
Music21 uses a special kind of list called a stream. We're making an empty stream first, and then populating it with notes and durations. Zip lets us walk through both lists at the same time. We chekc if the note is supposed to be a rest; if it is a rest we add the rest with the right duration, else we continue to add a note of the right duration. (notice I am not a composer, you could generate the notes and durations any way you like :-) ).
If you really wanted to; you could write a csv file or something of notes and durations in R and read that in python. However, I think generating the lists in python itself is a cleaner approach.
Thanks for introducing me to this music21 library, it looks very neat.

Related

Pyenchant Module - Spell checker

How do I trim the output of Python Pyenchat Module's 'suggested words list ?
Quite often it gives me a huge list of 20 suggested words that looks awkward when displayed on the screen and also has a tendency to go out of the screen .
Like sentinel, I'm not sure if the problem you're having is specific to pyenchant or a python-familiarity issue. If I assume the latter, you could simply select the number of values you'd like as part of your program. In simple form, this could be as easy as:
suggestion_list = pyenchant_function(document_filled_with_typos)
number_of_suggestions = len(suggestion_list)
MAX_SUGGESTIONS = 3 # you choose what you like
if number_of_suggestions > MAX_SUGGESTIONS:
answer = suggestion_list[0:(MAX_Suggestions-1)] # python lists are indexed to 0
else:
answer = suggestion_list
Note: I'm choosing to be clear rather than concise here, since I'm guessing that will be valued by asker, if asker is unclear on using list indices.
Hope this helps and good luck with python.
Assuming it returns a standard Python list, you use standard Python slicing syntax. E.g. suggestedwords[:10] gets just the first 10.

Most efficient way in Python to iterate over a large file (10GB+)

I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)
Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.
This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.
Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c
Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.
Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

How to Sort Arrays in Dictionary?

I'm currently writing a program in Python to track statistics on video games. An example of the dictionary I'm using to track the scores :
ten = 1
sec = 9
fir = 10
thi5 = 6
sec5 = 8
games = {
'adom': [ten+fir+sec+sec5, "Ancient Domain of Mysteries"],
'nethack': [fir+fir+fir+sec+thi5, "Nethack"]
}
Right now, I'm going about this the hard way, and making a big long list of nested ifs, but I don't think that's the proper way to go about it. I was trying to figure out a way to sort the dictionary, via the arrays, and then, finding a way to display the first ten that pop up... instead of having to work deep in the if statements.
So... basically, my question is : Do you have any ideas that I could use to about making this easier, instead of wayyyy, way harder?
===== EDIT ====
the ten+fir produces numbers. I want to find a way to go about sorting the lists (I lack the knowledge of proper terminology) to go by the number (basically, whichever ones have the highest number in the first part of the array go first.
Here's an example of my current way of going about it (though, it's incomplete, due to it being very tiresome : Example Nests (paste2) (let's try this one?)
==== SECOND EDIT ====
In case someone doesn't see my comment below :
ten, fir, et cetera - these are just variables for scores. Basically, it goes from a top ten list into a variable number.
ten = 1, nin = 2, fir = 10, fir5 = 10, sec5 = 8, sec = 9...
so : 'adom': [ten+fir+sec+sec5, "Ancient Domain of Mysteries"] actually registers as : 'adom': [1+10+9+8, "Ancient Domain of Mysteries"] , which ends up looking like :
'adom': [28, "Ancient Domain of Mysteries"]
So, basically, if I ended up doing the "top two" out of my example, it'd be :
((1)) Nethack (48)
((2)) ADOM (28)
I'd write an actual number, but I'm thinking of changing a few things up, so the numbers might be a touch different, and I wouldn't want to rewrite it.
== THIRD (AND HOPEFULLY THE FINAL) EDIT ==
Fixed my original code example.
How about something like this:
scores = games.items()
scores.sort(key = lambda key, value: value[0])
return scores[:10]
This will return the first 10 items, sorted by the first item in the array.
I'm not sure if this is what you want though, please update the question (and fix the example link) if you need something else...
import heapq
return heapq.nlargest(10, games.iteritems(), key=lambda k, v: v[0])
is the most direct way to get the top ten key / value pairs, sorted by the first item of each "value" list. If you can define more precisely what output you want (just the names, the name / value pairs, or what else?) and the sorting criterion, this is easy to adjust, of course.
Wim's solution is good, but I'd say that you should probably go the extra mile and push this work off onto a database, rather than relying on Python. Python interfaces well with most types of databases, where much of what you're exploring is already a solved problem.
For example, instead of worrying about shifting your dictionaries to various other data types in order to properly sort them, you can simply get all the data for each pertinent entry pre-sorted based on the criteria of your query. There goes the need for convoluted sorting and resorting right there.
While dictionaries are tempting to use, because they give the illusion of database-like abilities to access data based on its attributes, I still think they stumble quite a bit with respect to implementation. I don't really have any numbers to throw at you, but just from personal experience, anything you do on Python when it comes to manipulating large amounts of data, you can do much faster and more efficient both in code and computation with something like MySQL.
I'm not sure what you have planned as far as the structure of your data goes, but along with adding data, changing its structure is a lot easier using a database, too.

How should I organise my functions with pyparsing?

I am parsing a file with python and pyparsing (it's the report file for PSAT in Matlab but that isn't important). here is what I have so far. I think it's a mess and would like some advice on how to improve it. Specifically, how should I organise my grammar definitions with pyparsing?
Should I have all my grammar definitions in one function? If so, it's going to be one huge function. If not, then how do I break it up. At the moment I have split it at the sections of the file. Is it worth making loads of functions that only ever get called once from one place. Neither really feels right to me.
Should I place all my input and output code in a separate file to the other class functions? It would make the purpose of the class much clearer.
I'm also interested to know if there is an easier way to parse a file, do some sanity checks and store the data in a class. I seem to spend a lot of my time doing this.
(I will accept answers of it's good enough or use X rather than pyparsing if people agree)
I could go either way on using a single big method to create your parser vs. taking it in steps the way you have it now.
I can see that you have defined some useful helper utilities, such as slit ("suppress Literal", I presume), stringtolits, and decimaltable. This looks good to me.
I like that you are using results names, they really improve the robustness of your post-parsing code. I would recommend using the shortcut form that was added in pyparsing 1.4.7, in which you can replace
busname.setResultsName("bus1")
with
busname("bus1")
This can declutter your code quite a bit.
I would look back through your parse actions to see where you are using numeric indexes to access individual tokens, and go back and assign results names instead. Here is one case, where GetStats returns (ngroup + sgroup).setParseAction(self.process_stats). process_stats has references like:
self.num_load = tokens[0]["loads"]
self.num_generator = tokens[0]["generators"]
self.num_transformer = tokens[0]["transformers"]
self.num_line = tokens[0]["lines"]
self.num_bus = tokens[0]["buses"]
self.power_rate = tokens[1]["rate"]
I like that you have Group'ed the values and the stats, but go ahead and give them names, like "network" and "soln". Then you could write this parse action code as (I've also converted to the - to me - easier-to-read object-attribute notation instead of dict element notation):
self.num_load = tokens.network.loads
self.num_generator = tokens.network.generators
self.num_transformer = tokens.network.transformers
self.num_line = tokens.network.lines
self.num_bus = tokens.network.buses
self.power_rate = tokens.soln.rate
Also, a style question: why do you sometimes use the explicit And constructor, instead of using the '+' operator?
busdef = And([busname.setResultsName("bus1"),
busname.setResultsName("bus2"),
integer.setResultsName("linenum"),
decimaltable("pf qf pl ql".split())])
This is just as easily written:
busdef = (busname("bus1") + busname("bus2") +
integer("linenum") +
decimaltable("pf qf pl ql".split()))
Overall, I think this is about par for a file of this complexity. I have a similar format (proprietary, unfortunately, so cannot be shared) in which I built the code in pieces similar to the way you have, but in one large method, something like this:
def parser():
header = Group(...)
inputsummary = Group(...)
jobstats = Group(...)
measurements = Group(...)
return header("hdr") + inputsummary("inputs") + jobstats("stats") + measurements("meas")
The Group constructs are especially helpful in a large parser like this, to establish a sort of namespace for results names within each section of the parsed data.

Categories

Resources