Having declared assertions on a solver, how can I get and exploit single assertions out of all of them? So, if s.assertions could be transformed to a list, we could access a single statement. This can not be done. I explain by the following assertions on 'BitVecs' and what I'd like to get out.
from z3 import *
s = Solver()
x,y,z,w = BitVecs("x y z w",7) #rows
a,b,c,d,e = BitVecs("a b c d e",7) #cols
constr = [x&a==a,x&b==b,x&c!=c,x&d!=d,x&e!=e,
y&a==a,y&b!=b,y&c!=c,y&d!=d,y&e==e,
z&a!=a,z&b==b,z&c==c,z&d==d,z&e!=e,
w&a!=a,w&b==b,w&c!=c,w&d==d,w&e==e ]
s.add(constr)
R = [x,y,z,w]
C = [a,b,c,d,e]
s.assertions()
I need a matrix (list of lists) that indicates wheter a R,C-pair has == or != type of 'constr'. So, the matrix for the declared constr is
[[1,1,0,0,0],
[1,0,0,0,1],
[0,1,1,1,0],
[0,1,0,1,1]]
.
This is rather an odd thing to want to do. You are constructing these assertions yourself, so it's much better to simply keep track of how you constructed them to find out what they contain.
If these are coming from some other source (possible, I suppose), then you'll have to "parse" them back into AST form and walk their structure to answer questions of the form "what are the variables", "what are the connectives" etc. Doing so will require an understanding of how z3py internally represents these objects. While this is possible, I very much doubt it's something you want to do unless you're working on a library that's supposed to handle everything. (i.e., since you know what you're constructing, simply keep track of it elsewhere.)
But, if you do want to analyze these expressions, the way to go is to study the AST structure. You'll have to become familiar with the contents of this file: https://github.com/Z3Prover/z3/blob/master/src/api/python/z3/z3.py, and especially the functions decl and children amongst others.
Related
I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.
The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).
Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.
u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]
x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)
f = (y - z)[: -5]
plt.plot(f)
plt.show();
So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...
import difflib
import numpy as np
lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]
n = 3
d = []
for i in range(len(lines)):
a = lines[i:i+n]
for j in range(len(lines)):
b = lines[j:j+n]
if i == j: continue # skip same line
group_size = np.sum([len(x) for x in a])
if group_size < 5: continue # skip short lines
ratio = 0
for u, v in zip(a, b):
r = difflib.SequenceMatcher(None, u, v).ratio()
ratio += r if r > 0.7 else 0
d.append(ratio)
dry = sum(d) / len(lines)
In the following, we can identify some repetition at a glance:
w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()
Why not using:
d = np.exp(np.array(d))
Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.
What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):
I am sure there is a more clever solution out there.
Any hint?
I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.
Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.
Looks like you're choosing a long path. I wouldn't go there.
I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.
Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.
Dis would be an interesting option.
I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.
On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.
Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.
With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.
You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.
As such, we can write this trivial implementation:
import operator
class Repeat:
def __init__(self,l):
self.l=list(l)
self.memo={}
def __call__(self,m,n):
l=self.l
memo=self.memo
k=m,n
ret=memo.get(k)
if not ret:
if not m or not n: ret=0,None
elif m!=n and l[m-1]==l[n-1]: # critical change here!
z,tail=self(m-1,n-1)
ret=z+1,((m-1,n-1),tail)
else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
memo[k]=ret
return ret
def go(self):
n=len(self.l)
v=self(n,n)[1]
ret=[]
while v:
x,v=v
ret.append(x)
ret.reverse()
return ret
def repeat(l): return Repeat(l).go()
You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.
Of course, there are also "real" answers and real research on this issue!
Frame challenge: I’m not sure you should do this
It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…
That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.
More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.
Edit:
Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.
I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.
For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).
As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))
Algorithm:
read text line by line
transform it by
removing blanks
removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text
Code:
from itertools import groupby, cycle
import re
def sequences(l):
x2 = cycle(l)
next(x2)
grps = groupby(l, key=lambda j: j + 1 == next(x2))
yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)
with open('program.cpp') as fp:
text = fp.readlines()
# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
line = re.sub(r"\s+", "", line, flags=re.UNICODE)
if line:
processed.append(line)
text_map[proc_ix] = ix
proc_ix += 1
# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
h = hash(''.join(processed[ix:ix+3])) # join 3 lines
hashes.append(h)
hpos.setdefault(h, []).append(ix) # this list will reflect lines that are duplicated
# filter duplicated three liners
seqs = []
for k, v in hpos.items():
if len(v) > 1:
seqs.extend(v)
seqs = sorted(list(set(seqs)))
# find longer sequences
result = {}
for seq in sequences(seqs):
result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))
print('Duplicates found:')
for v in result.values():
print('-'*20)
vbeg, vend = v[0]
print(''.join(text[vbeg:vend]))
print(f'Found {len(v)} duplicates, lines')
for line_numbers in v:
print(f'{1+line_numbers[0]} : {line_numbers[1]}')
I have a number of python codes that manipulate large files. In some of them I perform operations among columns or select upon their content. As the input files can have different structures, the operations are provided though command line with a syntax like this c3 + c5 -1 or (c3<4) & (c5>4) (or combinations). c4 is interpreted as forth column of the input file.
My files look something like this ('input_file.txt'):
21.3 4321.34 34.12 4 343.3 2 324
34.34 67.56 764.45 2 54.768 6 45265
986.96 87.98 234.09 1 54.456 3 5262
[...]
Let's say that I want to sum column 4 with column 5 and subtract 1.
I would do
import re
import numpy as np
operation = "c3 + c5 -1" #in reality given from command line
pattern = re.compile(r"c(\d+?)") # compile the regex that matches the column number
# get the actual expression to evaluate
to_evaluate = pattern.sub("ifile[:,\\1]", operation)
#to_evaluate is: "ifile[:,3] + ifile[:,5] -1"
ifile = np.loadtxt('input_file.txt')
result = eval(to_evaluate) #evaluate the operation required
print(result)
# do the rest
Output
[5, 7, 3, ...]
I came up with this implementation because:
it's easy to write and to modify if I want to change the method for reading files (at the moment I can decide to use numpy or pandas) or if I want to add operations
gives me a lot of freedom on what I can do. I can treat c3 + c5 -1, (c3<4) & (c5>4) or (c2+c4)>0 in the same way.
I have the same signature in all my codes: it's less likely to make mistakes
I'm aware that eval can be unsafe (although for now I'm the only user of these codes) and can be slower than the corresponding code, but I couldn't think of a better way.
Is anyone aware of better/safer ways to implement such operations?
extra edit: if it matters, I'm running python 2.7
You can make a safer eval
def safe_eval(eval_str, variable_dict = None):
'''welll... mostly safe:
http://lybniz2.sourceforge.net/safeeval.html
'''
if variable_dict == None:
variable_dict = {}
return eval(eval_str, {"__builtins__" : None}, variable_dict)
Although it will never make it perfectly safe from someone who knows python very well
(see http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html for an example)
Your application is confusing to me though, so I'm not sure how much more I can help you!
I'm not sure if this will help solve what you are doing, but one thing you can do is compile all the functions in a module into a dictionary.
So you could compile the functions you want to use through something like:
module_dict = {}
for n in dir(module):
module_dict[n] = eval('module.'+n)
(I believe this functionality is standard in python 3. i.e. all modules module dicionaries can be accessed.) This puts all the function calls in dictionary form which speeds up calls. It also solves the eval safety issues.
If you are trying to use operations like '+' or '=' you can get their function calls from object.add and object.eq. You could store those calls into your string syntax.
Not sure if that helps.
I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.
In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"
the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.
The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.
for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.
Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...
I am parsing a file with python and pyparsing (it's the report file for PSAT in Matlab but that isn't important). here is what I have so far. I think it's a mess and would like some advice on how to improve it. Specifically, how should I organise my grammar definitions with pyparsing?
Should I have all my grammar definitions in one function? If so, it's going to be one huge function. If not, then how do I break it up. At the moment I have split it at the sections of the file. Is it worth making loads of functions that only ever get called once from one place. Neither really feels right to me.
Should I place all my input and output code in a separate file to the other class functions? It would make the purpose of the class much clearer.
I'm also interested to know if there is an easier way to parse a file, do some sanity checks and store the data in a class. I seem to spend a lot of my time doing this.
(I will accept answers of it's good enough or use X rather than pyparsing if people agree)
I could go either way on using a single big method to create your parser vs. taking it in steps the way you have it now.
I can see that you have defined some useful helper utilities, such as slit ("suppress Literal", I presume), stringtolits, and decimaltable. This looks good to me.
I like that you are using results names, they really improve the robustness of your post-parsing code. I would recommend using the shortcut form that was added in pyparsing 1.4.7, in which you can replace
busname.setResultsName("bus1")
with
busname("bus1")
This can declutter your code quite a bit.
I would look back through your parse actions to see where you are using numeric indexes to access individual tokens, and go back and assign results names instead. Here is one case, where GetStats returns (ngroup + sgroup).setParseAction(self.process_stats). process_stats has references like:
self.num_load = tokens[0]["loads"]
self.num_generator = tokens[0]["generators"]
self.num_transformer = tokens[0]["transformers"]
self.num_line = tokens[0]["lines"]
self.num_bus = tokens[0]["buses"]
self.power_rate = tokens[1]["rate"]
I like that you have Group'ed the values and the stats, but go ahead and give them names, like "network" and "soln". Then you could write this parse action code as (I've also converted to the - to me - easier-to-read object-attribute notation instead of dict element notation):
self.num_load = tokens.network.loads
self.num_generator = tokens.network.generators
self.num_transformer = tokens.network.transformers
self.num_line = tokens.network.lines
self.num_bus = tokens.network.buses
self.power_rate = tokens.soln.rate
Also, a style question: why do you sometimes use the explicit And constructor, instead of using the '+' operator?
busdef = And([busname.setResultsName("bus1"),
busname.setResultsName("bus2"),
integer.setResultsName("linenum"),
decimaltable("pf qf pl ql".split())])
This is just as easily written:
busdef = (busname("bus1") + busname("bus2") +
integer("linenum") +
decimaltable("pf qf pl ql".split()))
Overall, I think this is about par for a file of this complexity. I have a similar format (proprietary, unfortunately, so cannot be shared) in which I built the code in pieces similar to the way you have, but in one large method, something like this:
def parser():
header = Group(...)
inputsummary = Group(...)
jobstats = Group(...)
measurements = Group(...)
return header("hdr") + inputsummary("inputs") + jobstats("stats") + measurements("meas")
The Group constructs are especially helpful in a large parser like this, to establish a sort of namespace for results names within each section of the parsed data.