I have a number of python codes that manipulate large files. In some of them I perform operations among columns or select upon their content. As the input files can have different structures, the operations are provided though command line with a syntax like this c3 + c5 -1 or (c3<4) & (c5>4) (or combinations). c4 is interpreted as forth column of the input file.
My files look something like this ('input_file.txt'):
21.3 4321.34 34.12 4 343.3 2 324
34.34 67.56 764.45 2 54.768 6 45265
986.96 87.98 234.09 1 54.456 3 5262
[...]
Let's say that I want to sum column 4 with column 5 and subtract 1.
I would do
import re
import numpy as np
operation = "c3 + c5 -1" #in reality given from command line
pattern = re.compile(r"c(\d+?)") # compile the regex that matches the column number
# get the actual expression to evaluate
to_evaluate = pattern.sub("ifile[:,\\1]", operation)
#to_evaluate is: "ifile[:,3] + ifile[:,5] -1"
ifile = np.loadtxt('input_file.txt')
result = eval(to_evaluate) #evaluate the operation required
print(result)
# do the rest
Output
[5, 7, 3, ...]
I came up with this implementation because:
it's easy to write and to modify if I want to change the method for reading files (at the moment I can decide to use numpy or pandas) or if I want to add operations
gives me a lot of freedom on what I can do. I can treat c3 + c5 -1, (c3<4) & (c5>4) or (c2+c4)>0 in the same way.
I have the same signature in all my codes: it's less likely to make mistakes
I'm aware that eval can be unsafe (although for now I'm the only user of these codes) and can be slower than the corresponding code, but I couldn't think of a better way.
Is anyone aware of better/safer ways to implement such operations?
extra edit: if it matters, I'm running python 2.7
You can make a safer eval
def safe_eval(eval_str, variable_dict = None):
'''welll... mostly safe:
http://lybniz2.sourceforge.net/safeeval.html
'''
if variable_dict == None:
variable_dict = {}
return eval(eval_str, {"__builtins__" : None}, variable_dict)
Although it will never make it perfectly safe from someone who knows python very well
(see http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html for an example)
Your application is confusing to me though, so I'm not sure how much more I can help you!
I'm not sure if this will help solve what you are doing, but one thing you can do is compile all the functions in a module into a dictionary.
So you could compile the functions you want to use through something like:
module_dict = {}
for n in dir(module):
module_dict[n] = eval('module.'+n)
(I believe this functionality is standard in python 3. i.e. all modules module dicionaries can be accessed.) This puts all the function calls in dictionary form which speeds up calls. It also solves the eval safety issues.
If you are trying to use operations like '+' or '=' you can get their function calls from object.add and object.eq. You could store those calls into your string syntax.
Not sure if that helps.
Related
Having declared assertions on a solver, how can I get and exploit single assertions out of all of them? So, if s.assertions could be transformed to a list, we could access a single statement. This can not be done. I explain by the following assertions on 'BitVecs' and what I'd like to get out.
from z3 import *
s = Solver()
x,y,z,w = BitVecs("x y z w",7) #rows
a,b,c,d,e = BitVecs("a b c d e",7) #cols
constr = [x&a==a,x&b==b,x&c!=c,x&d!=d,x&e!=e,
y&a==a,y&b!=b,y&c!=c,y&d!=d,y&e==e,
z&a!=a,z&b==b,z&c==c,z&d==d,z&e!=e,
w&a!=a,w&b==b,w&c!=c,w&d==d,w&e==e ]
s.add(constr)
R = [x,y,z,w]
C = [a,b,c,d,e]
s.assertions()
I need a matrix (list of lists) that indicates wheter a R,C-pair has == or != type of 'constr'. So, the matrix for the declared constr is
[[1,1,0,0,0],
[1,0,0,0,1],
[0,1,1,1,0],
[0,1,0,1,1]]
.
This is rather an odd thing to want to do. You are constructing these assertions yourself, so it's much better to simply keep track of how you constructed them to find out what they contain.
If these are coming from some other source (possible, I suppose), then you'll have to "parse" them back into AST form and walk their structure to answer questions of the form "what are the variables", "what are the connectives" etc. Doing so will require an understanding of how z3py internally represents these objects. While this is possible, I very much doubt it's something you want to do unless you're working on a library that's supposed to handle everything. (i.e., since you know what you're constructing, simply keep track of it elsewhere.)
But, if you do want to analyze these expressions, the way to go is to study the AST structure. You'll have to become familiar with the contents of this file: https://github.com/Z3Prover/z3/blob/master/src/api/python/z3/z3.py, and especially the functions decl and children amongst others.
edit: FYI for all you paranoid people, the repro code no longer uses eval.
I'm not going to say I discovered a bug in Python (which would get me instantly downvoted), but this is some preeetty weird behavior. I have a list pairs and call sort on it with a custom key function that does not change state. Then I take a subset of pairs (in the same order), and call sort again with the same key function. The result is different from the original subset. Is this possible?
I have provided a repro for you all as a GitHub Gist. Steps to prepare:
Download all 4 files (dpd.txt, index_map.txt, ids.txt and weirdsortbehavior.py) and place them in the same directory
Run the Python program (note: with Python 3, have not tested for Python 2). For me it printed out
0 1916
1 0
Marvel at this behavior.
What is the explanation for this and what can I do to fix it? Thanks.
I think I found why. That is because there are some nan in dpd.txt.
And nan is unable to compare:
float('nan') > 1 # False while float('nan') < 1 # False
So this totally breaks comparison.
If you change your key compare function to:
def _key(id_):
import math
result = -dpd[index_map[id_]], id_.lower()
if math.isnan(result[0]):
result = 0, id_.lower()
return result
It will work.
I have an application that validates a CSV file against some set rules. The application checks if some "columns/fields" in the CVS are marked as mandatory, others it checks if their mandatory status is based upon another field. E.g. Column 2 has a conditional check against column 5 such that if column 5 has a value, then column 2 must also have a value.
I have already implemented this using VB and Python. Problem is this logic is hard coded in the application. What i want is to move this rules into say an XML where the application will read that XML and process the file. If the rules for processing change -and they change often- then the application remains the same and only the XML changes.
Here are two sample rules in python:
Sample One
current_column_data = 5 #data from the current position in the CSV
if validate_data_type(current_column_data, expected_data_type) == False:
return error_message
index_to_check_against = 10 #Column against which there is a "logical" test
text_to_check = get_text(index_to_check_against)
if validate_data_type(text_to_check, expected_data_type) == False:
return error_message
if current_column_data > 10: #This test could be checking String Vs String so have to keep in mind that to avoid errors since current column data could be a string value
if text_to_check <= 0:
return "Text to check should be greater than 0 if current column data is greater than 10 "
Sample Two
current_column_data = "Self Employed" #data from the current position in the CSV
if validate_data_type(current_column_data, expected_data_type) == False:
return error_message
index_to_check_against = 10 #Column against which there is a "logical" test
text_to_check = get_text(index_to_check_against)
if validate_data_type(text_to_check, expected_data_type) == False:
return error_message
if text_to_check == "A": #Here we expect if A is provided in the index to check, then current column should have a value hence we raise an error message
if len(current_column_data) = 0:
return "Current column is mandatory given that "A" is provided in Column_to_check""
Note: For each column in the CSV, we already know the data type to expect, the expected length of that field, whether its mandatory, optional or conditional and if its conditional the other column the condition is based on
Now I just need some guidance on how I can possibly do it in XML and the application reads the XML and knows what to do with each column.
Someone suggested the following sample elsewhere but I still can't wrap my head around the concept.:
<check left="" right="9" operation="GTE" value="3" error_message="logical failure for something" />
#Meaning: Column 9 should be "GTE" i.e. Greater than or equal two value 3"
Is there a different way to go about achieving this kind of logic or even a way to improve what I have here?
Suggestions and pointers welcome
This concept is called a Domain Specific Language (DSL) - you are effectively creating a mini-programming language for validating your CSV files. Your DSL allows you to express succinctly the rules for a valid CSV file.
This DSL could be expressed using XML, or an alternative approach would be to develop a library of functions in python instead. Then your DSL could be expressed as a mini-python program which is a sequence of these functions. This approach is called an in-language or "internal" DSL - and has the benefit that you have the full power of python at your disposal within your language.
Looking at your samples - you're very close to this already. When I read them, they're almost like an English description of the CSV validation rules.
Don't feel you have to go down the XML route - there's nothing wrong with keeping everything in Python
You can split your code, so you have a python file with the "CSV validation rules" expressed in your DSL, which your need to update/redistribute frequently, and separate files which define your DSL functions, which will change less frequently
In some cases it's even possible to develop the DSL to the point where non-programmers can update/maintain "programs" written in it
The problem you are solving is not necessarily bounded with XML. OK, you can do validation for XML with XSD, but that means that your data needed to be XML, and I'm not sure if you can do it to extent that "if A > 3, following rule applies".
A little less elegant, but maybe simpler approach than Ross answers, is simply define set of rules as data and have specific function process them, which is basically what your XML example does, storing (i.e. serializing) the data using XML---but you can use any other serialization format like JSON, YAML, INI or even CSV (not that it would be advisable).
So you could concentrate on the data model of the rules. I'll try to illustrate that with XML (but not using properties):
<cond name="some explanatory name">
<if><expr>...</expr>
<and>
<expr>
<left><column>9</column></left>
<op>ge</op>
<right>3</right>
</expr>
<expr>
<left><column>1</column></left>
<op>true</op>
<right></right>
</expr>
</and>
</cond>
Then, you can load that to Python and traverse over it for each row, raising nice explanatory exception as appropriate.
Edit: You mentioned that the file might need to be human-writable. Note that YAML has been designed for that.
Similar (not the same, changed to make it better illustrate the language) structure:
# comments, explanations...
conds:
- name: some explanatory name
# seen that? no quotes needed (unless you include some of
# quite limited set of special chars)
if:
expr:
# "..."
and:
expr:
left:
column: 9
op: ge
right: 3
expr:
left:
column: 1
op: true
- name: some other explanatory name
# i'm using alternative notation for columns below just to have
# them better indented (not sure about support in libs)
if:
expr:
# "..."
and:
expr:
left: { column: 9 }
op: ge
right: 3
expr:
left: { column: 1 }
op: true
Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...
I am parsing a file with python and pyparsing (it's the report file for PSAT in Matlab but that isn't important). here is what I have so far. I think it's a mess and would like some advice on how to improve it. Specifically, how should I organise my grammar definitions with pyparsing?
Should I have all my grammar definitions in one function? If so, it's going to be one huge function. If not, then how do I break it up. At the moment I have split it at the sections of the file. Is it worth making loads of functions that only ever get called once from one place. Neither really feels right to me.
Should I place all my input and output code in a separate file to the other class functions? It would make the purpose of the class much clearer.
I'm also interested to know if there is an easier way to parse a file, do some sanity checks and store the data in a class. I seem to spend a lot of my time doing this.
(I will accept answers of it's good enough or use X rather than pyparsing if people agree)
I could go either way on using a single big method to create your parser vs. taking it in steps the way you have it now.
I can see that you have defined some useful helper utilities, such as slit ("suppress Literal", I presume), stringtolits, and decimaltable. This looks good to me.
I like that you are using results names, they really improve the robustness of your post-parsing code. I would recommend using the shortcut form that was added in pyparsing 1.4.7, in which you can replace
busname.setResultsName("bus1")
with
busname("bus1")
This can declutter your code quite a bit.
I would look back through your parse actions to see where you are using numeric indexes to access individual tokens, and go back and assign results names instead. Here is one case, where GetStats returns (ngroup + sgroup).setParseAction(self.process_stats). process_stats has references like:
self.num_load = tokens[0]["loads"]
self.num_generator = tokens[0]["generators"]
self.num_transformer = tokens[0]["transformers"]
self.num_line = tokens[0]["lines"]
self.num_bus = tokens[0]["buses"]
self.power_rate = tokens[1]["rate"]
I like that you have Group'ed the values and the stats, but go ahead and give them names, like "network" and "soln". Then you could write this parse action code as (I've also converted to the - to me - easier-to-read object-attribute notation instead of dict element notation):
self.num_load = tokens.network.loads
self.num_generator = tokens.network.generators
self.num_transformer = tokens.network.transformers
self.num_line = tokens.network.lines
self.num_bus = tokens.network.buses
self.power_rate = tokens.soln.rate
Also, a style question: why do you sometimes use the explicit And constructor, instead of using the '+' operator?
busdef = And([busname.setResultsName("bus1"),
busname.setResultsName("bus2"),
integer.setResultsName("linenum"),
decimaltable("pf qf pl ql".split())])
This is just as easily written:
busdef = (busname("bus1") + busname("bus2") +
integer("linenum") +
decimaltable("pf qf pl ql".split()))
Overall, I think this is about par for a file of this complexity. I have a similar format (proprietary, unfortunately, so cannot be shared) in which I built the code in pieces similar to the way you have, but in one large method, something like this:
def parser():
header = Group(...)
inputsummary = Group(...)
jobstats = Group(...)
measurements = Group(...)
return header("hdr") + inputsummary("inputs") + jobstats("stats") + measurements("meas")
The Group constructs are especially helpful in a large parser like this, to establish a sort of namespace for results names within each section of the parsed data.