I am parsing a file with python and pyparsing (it's the report file for PSAT in Matlab but that isn't important). here is what I have so far. I think it's a mess and would like some advice on how to improve it. Specifically, how should I organise my grammar definitions with pyparsing?
Should I have all my grammar definitions in one function? If so, it's going to be one huge function. If not, then how do I break it up. At the moment I have split it at the sections of the file. Is it worth making loads of functions that only ever get called once from one place. Neither really feels right to me.
Should I place all my input and output code in a separate file to the other class functions? It would make the purpose of the class much clearer.
I'm also interested to know if there is an easier way to parse a file, do some sanity checks and store the data in a class. I seem to spend a lot of my time doing this.
(I will accept answers of it's good enough or use X rather than pyparsing if people agree)
I could go either way on using a single big method to create your parser vs. taking it in steps the way you have it now.
I can see that you have defined some useful helper utilities, such as slit ("suppress Literal", I presume), stringtolits, and decimaltable. This looks good to me.
I like that you are using results names, they really improve the robustness of your post-parsing code. I would recommend using the shortcut form that was added in pyparsing 1.4.7, in which you can replace
busname.setResultsName("bus1")
with
busname("bus1")
This can declutter your code quite a bit.
I would look back through your parse actions to see where you are using numeric indexes to access individual tokens, and go back and assign results names instead. Here is one case, where GetStats returns (ngroup + sgroup).setParseAction(self.process_stats). process_stats has references like:
self.num_load = tokens[0]["loads"]
self.num_generator = tokens[0]["generators"]
self.num_transformer = tokens[0]["transformers"]
self.num_line = tokens[0]["lines"]
self.num_bus = tokens[0]["buses"]
self.power_rate = tokens[1]["rate"]
I like that you have Group'ed the values and the stats, but go ahead and give them names, like "network" and "soln". Then you could write this parse action code as (I've also converted to the - to me - easier-to-read object-attribute notation instead of dict element notation):
self.num_load = tokens.network.loads
self.num_generator = tokens.network.generators
self.num_transformer = tokens.network.transformers
self.num_line = tokens.network.lines
self.num_bus = tokens.network.buses
self.power_rate = tokens.soln.rate
Also, a style question: why do you sometimes use the explicit And constructor, instead of using the '+' operator?
busdef = And([busname.setResultsName("bus1"),
busname.setResultsName("bus2"),
integer.setResultsName("linenum"),
decimaltable("pf qf pl ql".split())])
This is just as easily written:
busdef = (busname("bus1") + busname("bus2") +
integer("linenum") +
decimaltable("pf qf pl ql".split()))
Overall, I think this is about par for a file of this complexity. I have a similar format (proprietary, unfortunately, so cannot be shared) in which I built the code in pieces similar to the way you have, but in one large method, something like this:
def parser():
header = Group(...)
inputsummary = Group(...)
jobstats = Group(...)
measurements = Group(...)
return header("hdr") + inputsummary("inputs") + jobstats("stats") + measurements("meas")
The Group constructs are especially helpful in a large parser like this, to establish a sort of namespace for results names within each section of the parsed data.
Related
I'm building a module in python, that focuses mainly on mathematics. I thought it would be a nice touch to add support for mathematical series. I had no issues with implementing arithmetic progression and geometric series, but I stumbled upon a problem when attempting to implement recursive series. I've come up with a solution to that, but for that I first need to extract the elements of the series from a user-input string that represents the series.I think that regex might be the best option, but it is my biggest phobia in the world, so I'd really appreciate the help.
For example, for a string like
"a_n = a_{n-1} + a_{n-2}"
I want to have a set
{"a_n","a_{n-1}","a_{n-2}"}
It also needs to support more complicated recursive definitions, like:
"a_n*a_{n-1} = ln(a_{n-2} * a_n)*a_{n-3}"
the set will be:
{"a_n","a_{n-1}","a_{n-2}","a_{n-3}"}
Feel free to do some minor syntax changes if you think it'll make it easier for the task.
The regex is easy a_(?:n|{n-\d})
a_
then
either n
or {n-\d}
import re
ptn = re.compile(r"a_(?:n|{n-\d})")
print(set(ptn.findall("a_n = a_{n-1} + a_{n-2}")))
# {'a_{n-1}', 'a_n', 'a_{n-2}'}
print(set(ptn.findall("a_n*a_{n-1} = ln(a_{n-2} * a_n)*a_{n-3}")))
# {'a_{n-1}', 'a_{n-3}', 'a_n', 'a_{n-2}'}
I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.
If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found
Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.
I'm performing data analysis on a large number of variables contained in an hdf5 file. The code I've written loops over a list of variables and then performs analyses and outputs some graphs. It would be nice to be able to use the code for combinations of variables (like A+B or sqrt((A**2)+(B**2)) without having to put in a bunch of if statements, i.e. execute the statement in the string when loading the variables from my hdf5 file. If possible, I would like to avoid using pandas, but I'm not completely against it if that's the only efficient way to do what I want.
My hdf5 file looks something like this :
HDF5 "blahblah.hdf5" {
FILE_CONTENTS {
group /
group /all
dataset /all/blargle
dataset /all/blar
}
}
And what I would like to do is this (this functionality doesn't exist in h5py, so it bugs) :
myfile = h5py.File('/myfile/blahblah.hdf5')
varlist = ['blargle', 'blar', 'blargle+blar']
savelist = [None]*len(varlist)
for ido, varname in enumerate(varlist):
savelist[ido] = myfile['all'][varname]
#would like to evaluate varname upon loading
First you have to ask yourself: Do I know the arithmetic operations only at runtime or already at programming time?
If you know it already now, just write a function in Python for it.
If you know it only at runtime, you will need a parser. While there are libraries specialized on this out there (example), Python itself is already a parser. With exec you can execute strings containing Python code.
Now all you need to define is some sort of grammar for specific language. You need some conventions. You have them already, it seems you want to convert myfile['all']['blargle+blar'] to myfile['all']['blargle']+myfile['all']['blar']. In order to make life easier I recommend.
Put names of data sets in brackets.
varlist = ['[blargle]', '[blar]', '[blargle]+[blar]', 'sqrt(([blargle]**2)+([blar]**2)']
Then simply replace all terms in brackets by myfile['all'][name_in_brackets] and then execute the string with exec.
import re
for ido, varname in enumerate(varlist):
term = re.sub(r'\[(.*?)\]', lambda x: "myfile['all']['{}']".format(x), varname, flag='g')
savelist[ido] = exec(term)
The line using regular expression to match the variable names re.sub is actually not tested by me.
And still another drawback. I'm not sure reading data sets from an hdf5 object is fast since the same data set may be read multiple times and if hdf5 is not caching it might be better to store the data sets intermediately before doing computation on them.
My goal is to take a text file with a number list generated by R (e.g 1 2 3 4), and "translate" the numbers into music21 notes (that is, to compose a melody when each note is identified with a number).
Having the number list, one idea I had was creating a R vector with strings that matches with music21 note names, and trying to get a new output with the note names instead of numbers. But I'm not very sure of that. Besides, I don't know how to proceed after that.
I also read some topics talking about using R as a subprocess in Python, but again, I couldn't clearly understand how that works (the fact that running the subprocess almost makes my poor old laptop crash had something to do with that...)
How can I proceed here?
Personally, I would try to use only python. I realize you have little experience with it; but python is more general purpose than R and should be able to do anything R can do. Trying to use both at the same time seems like it would generate additional complexity and overhead you simply don't need.
It looks this music21 takes notes and lengths; however there are also rests. Let's say you have a list for durations called "durations", and a list for notes (and rests) called notes:
from music21 import *
mymusic = stream.Stream()
notes = ["F4", "F4", "rest", "F4"]
durations = [0.25, 1, 0.25, 1]
for n,d in zip(notes, durations):
if n == "rest":
mymusic.append(note.Rest(d))
else:
mymusic.append(note.Note(n,d))
mymusic.show("midi")
Music21 uses a special kind of list called a stream. We're making an empty stream first, and then populating it with notes and durations. Zip lets us walk through both lists at the same time. We chekc if the note is supposed to be a rest; if it is a rest we add the rest with the right duration, else we continue to add a note of the right duration. (notice I am not a composer, you could generate the notes and durations any way you like :-) ).
If you really wanted to; you could write a csv file or something of notes and durations in R and read that in python. However, I think generating the lists in python itself is a cleaner approach.
Thanks for introducing me to this music21 library, it looks very neat.
Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...