Haskell equivalent of Python's "Construct" - python

Construct is a DSL implemented in Python used to describe data structures (binary and textual). Once you have the data structure described, construct can parse and build it for you. Which is good ("DRY", "Declarative", "Denotational-Semantics"...)
Usage example:
# code from construct.formats.graphics.png
itxt_info = Struct("itxt_info",
CString("keyword"),
UBInt8("compression_flag"),
compression_method,
CString("language_tag"),
CString("translated_keyword"),
OnDemand(
Field("text",
lambda ctx: ctx._.length - (len(ctx.keyword) +
len(ctx.language_tag) + len(ctx.translated_keyword) + 5),
),
),
)
I am in need for such a tool for Haskell and
I wonder if something like this exists.
I know of:
Data.Binary: User implements parsing and building seperately
Parsec: Only for parsing? Only for text?
I guess one must use Template Haskell to achieve this?

I'd say it depends what you want to do, and if you need to comply with any existing format.
Data.Binary will (surprise!) help you with binary data, both reading and writing.
You can either write the code to read/write yourself, or let go of the details and generate the required code for your data structures using some additional tools like DrIFT or Derive. DrIFT works as a preprocessor, while Derive can work as a preprocessor and with TemplateHaskell.
Parsec will only help you with parsing text. No binary data (as easily), and no writing. Work is done with regular Strings. There are ByteString equivalents on hackage.
For your example above I'd use Data.Binary and write custom put/geters myself.
Have a look at the parser category at hackage for more options.

Currently (afaik) there is no equivalent to Construct in Haskell.
One can be implemented using Template Haskell.

I don't know anything about Python or Construct, so this is probably not what you are searching for, but for simple data structures you can always just derive read:
data Test a = I Int | S a deriving (Read,Show)
Now, for the expression
read "S 123" :: Test Double
GHCi will emit: S 123.0
For anything more complex, you can make an instance of Read using Parsec.

Related

Is there something like a reverse eval() function?

Imagine the following problem: you have a dictionary of some content in python and want to generate python code that would create this dict. (which is like eval but in reverse)
Is there something that can do this?
Scenario:
I am working with a remote python interpreter. I can give source files to it but no input. So I am now looking for a way to encode my input data into a python source file.
Example:
d = {'a': [1,4,7]}
str_d = reverse_eval(d)
# "{'a': [1, 4, 7]}"
eval(str_d) == d
repr(thing)
will output text that when executed will (in most cases) reproduce the dictionary.
Actually, it's important for which types of data do you want this reverse function to exist. If you're talking about built-in/standard classes, usually their .__repr__() method returns the code you want to access. But if your goal is to save something in a human-readable format, but to use an eval-like function to use this data in python, there is a json library.
It's better to use json for this reason because using eval is not safe.
Json's problem is that it can't save any type of data, it can save only standard objects, but if we're talking about not built-in types of data, you never know, what is at their .__repr__(), so there's no way to use repr-eval with this kind of data
So, there is no reverse function for all types of data, you can use repr-eval for built-in, but for built-in data the json library is better at least because it's safe

Perform basic math operations upon loading variables with h5py

I'm performing data analysis on a large number of variables contained in an hdf5 file. The code I've written loops over a list of variables and then performs analyses and outputs some graphs. It would be nice to be able to use the code for combinations of variables (like A+B or sqrt((A**2)+(B**2)) without having to put in a bunch of if statements, i.e. execute the statement in the string when loading the variables from my hdf5 file. If possible, I would like to avoid using pandas, but I'm not completely against it if that's the only efficient way to do what I want.
My hdf5 file looks something like this :
HDF5 "blahblah.hdf5" {
FILE_CONTENTS {
group /
group /all
dataset /all/blargle
dataset /all/blar
}
}
And what I would like to do is this (this functionality doesn't exist in h5py, so it bugs) :
myfile = h5py.File('/myfile/blahblah.hdf5')
varlist = ['blargle', 'blar', 'blargle+blar']
savelist = [None]*len(varlist)
for ido, varname in enumerate(varlist):
savelist[ido] = myfile['all'][varname]
#would like to evaluate varname upon loading
First you have to ask yourself: Do I know the arithmetic operations only at runtime or already at programming time?
If you know it already now, just write a function in Python for it.
If you know it only at runtime, you will need a parser. While there are libraries specialized on this out there (example), Python itself is already a parser. With exec you can execute strings containing Python code.
Now all you need to define is some sort of grammar for specific language. You need some conventions. You have them already, it seems you want to convert myfile['all']['blargle+blar'] to myfile['all']['blargle']+myfile['all']['blar']. In order to make life easier I recommend.
Put names of data sets in brackets.
varlist = ['[blargle]', '[blar]', '[blargle]+[blar]', 'sqrt(([blargle]**2)+([blar]**2)']
Then simply replace all terms in brackets by myfile['all'][name_in_brackets] and then execute the string with exec.
import re
for ido, varname in enumerate(varlist):
term = re.sub(r'\[(.*?)\]', lambda x: "myfile['all']['{}']".format(x), varname, flag='g')
savelist[ido] = exec(term)
The line using regular expression to match the variable names re.sub is actually not tested by me.
And still another drawback. I'm not sure reading data sets from an hdf5 object is fast since the same data set may be read multiple times and if hdf5 is not caching it might be better to store the data sets intermediately before doing computation on them.

Python regex to extract data from C header file

I have a C header file with a lot of enums, typedefs and function prototypes. I want to extract this data using Python regex (re). I really need help with the syntax, because I constantly seem to forget it every time I learn.
ENUMS
-----
enum
{
(tab character)(stuff to be extracted - multiple lines)
};
TYPES
-----
typedef struct (extract1) (extract2)
FUNCTIONS
---------
(return type)
(name)
(
(tab character)(arguments - multiple lines)
);
If anyone could point me in the right direction, I would be grateful.
I imagine something like this is what you're after?
>>> re.findall('enum\s*{\s*([^}]*)};', 'enum {A,B,C};')
['A,B,C']
>>> re.findall("typedef\s+struct\s+(\w+)\s+(\w+);", "typedef struct blah blah;")
[('blah', 'blah')]
There are of course numerous variations on the syntax, and functions are much more complicated, so I'll leave those for you, as frankly these regexps are already fragile and inelegant enough. I would urge you to use an actual parser unless this is just a one-off project where robustness is totally unimportant and you can be sure of the format of your inputs.

How should I organise my functions with pyparsing?

I am parsing a file with python and pyparsing (it's the report file for PSAT in Matlab but that isn't important). here is what I have so far. I think it's a mess and would like some advice on how to improve it. Specifically, how should I organise my grammar definitions with pyparsing?
Should I have all my grammar definitions in one function? If so, it's going to be one huge function. If not, then how do I break it up. At the moment I have split it at the sections of the file. Is it worth making loads of functions that only ever get called once from one place. Neither really feels right to me.
Should I place all my input and output code in a separate file to the other class functions? It would make the purpose of the class much clearer.
I'm also interested to know if there is an easier way to parse a file, do some sanity checks and store the data in a class. I seem to spend a lot of my time doing this.
(I will accept answers of it's good enough or use X rather than pyparsing if people agree)
I could go either way on using a single big method to create your parser vs. taking it in steps the way you have it now.
I can see that you have defined some useful helper utilities, such as slit ("suppress Literal", I presume), stringtolits, and decimaltable. This looks good to me.
I like that you are using results names, they really improve the robustness of your post-parsing code. I would recommend using the shortcut form that was added in pyparsing 1.4.7, in which you can replace
busname.setResultsName("bus1")
with
busname("bus1")
This can declutter your code quite a bit.
I would look back through your parse actions to see where you are using numeric indexes to access individual tokens, and go back and assign results names instead. Here is one case, where GetStats returns (ngroup + sgroup).setParseAction(self.process_stats). process_stats has references like:
self.num_load = tokens[0]["loads"]
self.num_generator = tokens[0]["generators"]
self.num_transformer = tokens[0]["transformers"]
self.num_line = tokens[0]["lines"]
self.num_bus = tokens[0]["buses"]
self.power_rate = tokens[1]["rate"]
I like that you have Group'ed the values and the stats, but go ahead and give them names, like "network" and "soln". Then you could write this parse action code as (I've also converted to the - to me - easier-to-read object-attribute notation instead of dict element notation):
self.num_load = tokens.network.loads
self.num_generator = tokens.network.generators
self.num_transformer = tokens.network.transformers
self.num_line = tokens.network.lines
self.num_bus = tokens.network.buses
self.power_rate = tokens.soln.rate
Also, a style question: why do you sometimes use the explicit And constructor, instead of using the '+' operator?
busdef = And([busname.setResultsName("bus1"),
busname.setResultsName("bus2"),
integer.setResultsName("linenum"),
decimaltable("pf qf pl ql".split())])
This is just as easily written:
busdef = (busname("bus1") + busname("bus2") +
integer("linenum") +
decimaltable("pf qf pl ql".split()))
Overall, I think this is about par for a file of this complexity. I have a similar format (proprietary, unfortunately, so cannot be shared) in which I built the code in pieces similar to the way you have, but in one large method, something like this:
def parser():
header = Group(...)
inputsummary = Group(...)
jobstats = Group(...)
measurements = Group(...)
return header("hdr") + inputsummary("inputs") + jobstats("stats") + measurements("meas")
The Group constructs are especially helpful in a large parser like this, to establish a sort of namespace for results names within each section of the parsed data.

Pythonic way to implement a tokenizer

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice?
I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices.
Listing Token Types:
In Java, for example, I would have a list of fields like so:
public static final int TOKEN_INTEGER = 0
But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this with normal variable declarations but that doesn't strike me as a great solution since the declarations could be altered.
Returning Tokens From The Tokenizer:
Is there a better alternative to just simply returning a list of tuples e.g.
[ (TOKEN_INTEGER, 17), (TOKEN_STRING, "Sixteen")]?
Cheers,
Pete
There's an undocumented class in the re module called re.Scanner. It's very straightforward to use for a tokenizer:
import re
scanner=re.Scanner([
(r"[0-9]+", lambda scanner,token:("INTEGER", token)),
(r"[a-z_]+", lambda scanner,token:("IDENTIFIER", token)),
(r"[,.]+", lambda scanner,token:("PUNCTUATION", token)),
(r"\s+", None), # None == skip token.
])
results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results
will result in
[('INTEGER', '45'),
('IDENTIFIER', 'pigeons'),
('PUNCTUATION', ','),
('INTEGER', '23'),
('IDENTIFIER', 'cows'),
('PUNCTUATION', ','),
('INTEGER', '11'),
('IDENTIFIER', 'spiders'),
('PUNCTUATION', '.')]
I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.
Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.
In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.
For generator see the original proposal or other online docs
Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):
class Tokenizer(object):
def __init__(self,file):
self.file = file
def __get_next_character(self):
return self.file.read(1)
def __peek_next_character(self):
character = self.file.read(1)
self.file.seek(self.file.tell()-1,0)
return character
def __read_number(self):
value = ""
while self.__peek_next_character().isdigit():
value += self.__get_next_character()
return value
def next_token(self):
character = self.__peek_next_character()
if character.isdigit():
return self.__read_number()
"Is there a better alternative to just simply returning a list of tuples?"
Nope. It works really well.
"Is there a better alternative to just simply returning a list of tuples?"
That's the approach used by the "tokenize" module for parsing Python source code. Returning a simple list of tuples can work very well.
I have recently built a tokenizer, too, and passed through some of your issues.
Token types are declared as "constants", i.e. variables with ALL_CAPS names, at the module level. For example,
_INTEGER = 0x0007
_FLOAT = 0x0008
_VARIABLE = 0x0009
and so on. I have used an underscore in front of the name to point out that somehow those fields are "private" for the module, but I really don't know if this is typical or advisable, not even how much Pythonic. (Also, I'll probably ditch numbers in favour of strings, because during debugging they are much more readable.)
Tokens are returned as named tuples.
from collections import namedtuple
Token = namedtuple('Token', ['value', 'type'])
# so that e.g. somewhere in a function/method I can write...
t = Token(n, _INTEGER)
# ...and return it properly
I have used named tuples because the tokenizer's client code (e.g. the parser) seems a little clearer while using names (e.g. token.value) instead of indexes (e.g. token[0]).
Finally, I've noticed that sometimes, especially writing tests, I prefer to pass a string to the tokenizer instead of a file object. I call it a "reader", and have a specific method to open it and let the tokenizer access it through the same interface.
def open_reader(self, source):
"""
Produces a file object from source.
The source can be either a file object already, or a string.
"""
if hasattr(source, 'read'):
return source
else:
from io import StringIO
return StringIO(source)
When I start something new in Python I usually look first at some modules or libraries to use. There's 90%+ chance that there already is somthing available.
For tokenizers and parsers this is certainly so. Have you looked at PyParsing ?
I've implemented a tokenizer for a C-like programming language. What I did was to split up the creation of tokens into two layers:
a surface scanner: This one actually reads the text and uses regular expression to split it up into only the most primitve tokens (operators, identifiers, numbers,...); this one yields tuples (tokenname, scannedstring, startpos, endpos).
a tokenizer: This consumes the tuples from the first layer, turning them into token objects (named tuples would do as well, I think). Its purpose is to detect some long-range dependencies in the token stream, particularly strings (with their opening and closing quotes) and comments (with their opening an closing lexems; - yes, I wanted to retain comments!) and coerce them into single tokens. The resulting stream of token objects is then returned to a consuming parser.
Both are generators. The benefits of this approach were:
Reading of the raw text is done only in the most primitive way, with simple regexps - fast and clean.
The second layer is already implemented as a primitive parser, to detect string literals and comments - re-use of parser technology.
You don't have to strain the surface scanner with complex detections.
But the real parser gets tokens on the semantic level of the language to be parsed (again strings, comments).
I feel quite happy with this layered approach.
I'd turn to the excellent Text Processing in Python by David Mertz
This being a late answer, there is now something in the official documentation: Writing a tokenizer with the re standard library. This is content in the Python 3 documentation that isn't in the Py 2.7 docs. But it is still applicable to older Pythons.
This includes both short code, easy setup, and writing a generator as several answers here have proposed.
If the docs are not Pythonic, I don't know what is :-)
"Is there a better alternative to just simply returning a list of tuples"
I had to implement a tokenizer, but it required a more complex approach than a list of tuples, therefore I implemented a class for each token. You can then return a list of class instances, or if you want to save resources, you can return something implementing the iterator interface and generate the next token while you progress in the parsing.

Categories

Resources