How do I implement levenshtein distance on records in a database table using python? I know how to connect python with database, coding in python may not be problem, and I also have the records in a database table. I understand the theory and the dynamic programming of levenshtein distance. The problem here is, how do I write the codes in such a way that after connecting to the database table, I can compare two records having up to three fields and output their similarity score. Below is a snipet of my database table:
Record 1:
Author : Michael I James
Title : Advancement in networking
Journal: ACM
Record 2:
Author: Michael J Inse
Title: Advancement in networking
Journal: ACM
Any ideas is welcome. I'm a newbie in this area, please try explain with a little detail.
Thanks.
My understanding of you problem is that you do need to identify very similar records which are potentially duplicated.
I would solve this in the database itself. No need to do programming. If you don't have the Levenshtein function available in your DB you may want to create a User Defined Function.
Here is an example for MySQL:
CREATE FUNCTION `levenshtein`(s1 VARCHAR(255), s2 VARCHAR(255)) RETURNS int(11) DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR; DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0;
ELSE
SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
Then you do need to compare all of your records to each other. This requires a self full join, which of course can be a bit heavy. If too heavy, you will need to go the Python way, which will allow you to avoid repetitions (compare various times the same records).
Here is what I would do. Note that I would rather use an ID for easier identification:
SELECT a.ID AS IDa,
b.ID AS IDb,
a.Author AS AuthorA,
b.Author AS AuthorB,
ap.levenshtein(a.Author, b.Author) AS Lev_Aut,
a.Title AS TitleA, b.Title AS TitleB, ap.levenshtein(a.Title, b.Title) AS Lev_Title,
a.Journal AS JounalA , b.Journal AS JournalB, ap.levenshtein(a.Journal, b.Journal) AS Lev_Journal,
ap.levenshtein(a.Author, b.Author) + ap.levenshtein(a.Title, b.Title) + ap.levenshtein(a.Journal, b.Journal) AS Composite
FROM test.zzz AS a, test.zzz AS b
WHERE a.ID != b.ID
ORDER BY 8;
Would return a list of Levenshtein values ordered from the best match to the worst (the composite column). The condition avoids a record to be compared to itself.
Related
I have a structure with letters A, Z, F, G.
A have a pair with Z, and F with G.
I have for example AAAFFZZGGFFGGAAAAZZZZ. And I need to "curve" structure and that fold will make a pairs
A A A F F Z Z G G F F -
* * * * * * |
Z Z Z Z A A A A G G -
6 pairs in upper example.
But we can create structure like this:
A A A F F Z Z G G -
* * * |
Z Z Z Z A A A A G G F F -
And head is a fragment of structure where pairs cannot exists. For example head 2:
rozmiar. Na przykład head = 2:
F F
A A A F F Z Z G G / \
* * * * |
Z Z Z Z A A A A \ /
G G
I don't know how to find maximal number of pairs that can be created in this structure
To make the matching go faster you can prepare a second string (R) with the letters flipped to their corresponding values. This will allow direct comparisons of paired positions instead of going through an indirection n^2 times:
S = "AAAFFZZGGFFGGAAAAZZZZ"
match = str.maketrans("AZFG","ZAGF")
R = S.translate(match) # flipped matching letters
mid = len(S)//2
maxCount,maxPos = 0,0
for d in range(mid+1): # moving away from middle
for p in {mid+d,mid-d}: # right them left
pairs = sum(a==b for a,b in zip(S[p-1::-1],R[p:])) # match fold
if pairs>maxCount:
maxCount,maxPos = pairs,p # track best so far and fold position
if p <= maxCount: break # stop when impossible to improve
print(maxCount) # 6 matches
print(maxPos) # folding at 11
print(S[:maxPos].rjust(len(S))) # AAAFFZZGGFF
print(S[maxPos:][::-1].rjust(len(S))) # ZZZZAAAAGG
# ** ** **
You could start with testing the folding point in the center, and then fan out (in zig-zag manner) putting the folding point further from the center.
Then for a given folding point, count the allowed pairs. You can use slicing to create the two strips, one in reversed order, and then zip those to get the pairs.
The outer iteration (determining the folding point) can stop when the size of the shortest strip is shorter than the size of the best answer found so far.
Here is a solution, which also returns the actual fold, so it can be printed for verification:
def best_fold(chain):
allowed = set(("AZ","ZA","FG","GF"))
revchain = chain[::-1]
maxnumpairs = 0
fold = chain # This represents "no solution"
n = len(chain)
head = n // 2
for diff in range(n):
head += diff if diff % 2 else -diff
if head - 2 < maxnumpairs or n - head - 2 < maxnumpairs:
break
numpairs = sum(a+b in allowed
for a, b in zip(revchain[-head+2:], chain[head+2:])
)
if numpairs > maxnumpairs:
maxnumpairs = numpairs
fold = chain[:head].rjust(n) + "\n" + revchain[:-head].rjust(n)
return maxnumpairs, fold
Here is how to run it on the example string:
numpairs, fold = best_fold("AAAFFZZGGFFGGAAAAZZZZ")
print(numpairs) # 5
print(fold) # AAAFFZZGGF
# ZZZZAAAAGGF
I would first start with a suitable data type to implement your creasable string. Now first things first, as you mention in your comment we can use list of characters which is better than a string. Then perhaps a tuple of arrays like [[Char],[Char]] would be sufficient.
Also creasing from left or right shouldn't matter so for simplicity we start with;
[ ["A","A","F","F","Z","Z","G","G","F","F","G","G","A","A","A","A","Z","Z","Z","Z"]
, ["A"]
]
then in every step;
Then in every step we can map our list of characters into a a tuple of creases like
chars.map((_,i,a) => [a.slice(i+1),a.slice(0,i+1).reverse()]
Compare and count for pairs.
In order to make an efficient comparison of corresponding items being a valid pair, a simple look up table as below can be used
{ "A": "Z"
, "Z": "A"
, "G": "F"
, "F": "G"
}
Finally we can filter the longest one(s)
An implementation of creases could be like;
function pairs(cs){
var lut = { "A": "Z"
, "Z": "A"
, "G": "F"
, "F": "G"
};
return cs.map(function(_,i,a){
var crs = [a.slice(i+1),a.slice(0,i+1).reverse()]; // console.log(crs) to see all creases)
return crs[0].reduce( (ps,c,j) => lut[c] === crs[1][j] ? ( ps.res.push([c,crs[1][j],j])
, ps.crs ??= crs.concat(i+1)
, ps
)
: ps
, { res: []
, crs: null
}
);
})
.reduce( (r,d) => r.length ? r[r.length-1].res.length > d.res.length ? r :
r[r.length-1].res.length < d.res.length ? [d] : r.concat(d)
: [d]
, []
);
}
var chars = ["A","A","A","F","F","Z","Z","G","G","F","F","G","G","A","A","A","A","Z","Z","Z","Z"],
result = pairs(chars);
console.log(result);
.as-console-wrapper{
max-height: 100% !important;
}
Now this is a straightforward algorithm and possibly not the most efficient one. It can be made more faster by using complex modular arithmetic on testing pairs without using any additional arrays however i believe it would be an overkill and very hard to explain.
I'm playing around with OR-Tools to create puzzles like a < b, a < c, b < c, c = d. The idea is to give people these equations and they need to find the numbers that need to be entered into the variables a-d to create a valid solution. OR-Tools should create solutions and I remove the numbers to create the puzzle. Some numbers will be left to give a start. For example, in the above equations b=2 might be given, the player can then figure out that a must be 1. Players are given the variables, the equations with operators, and some start values and need to figure out the set of variable assignments that make the set of equations valid.
Creating variables and constraints like this works like a charm:
model.NewIntVar(min_number, max_number, 'a')
model.NewIntVar(min_number, max_number, 'b')
model.NewIntVar(min_number, max_number, 'c')
model.NewIntVar(min_number, max_number, 'd')
model.Add(a < b)
model.Add(a < c)
model.Add(b < c)
model.Add(c = d)
This makes or-tools find the operands.
Now I also want the mathematical operators to be "calculated", so that I don't have to provide them.
My current workaround is randomizing them with operator before creating the constraints like this:
def rand_op():
return random.choice(['<', '>', '='])
ops = {
'<': operator.lt,
'>': operator.gt,
'=': operator.eq
}
model.Add(ops.get(rand_op())(a, b))
model.Add(ops.get(rand_op())(b, c))
...
That does solve my problem, but obviously kind of takes some magic from or-tools as with this randomness many many problems that do not have a solution are created and it takes quite a lot of loops to find solutions.
So I'm wondering how I could achieve this in a better way. One naive approach would be to work with allowed assignments. So I would take two variables for the operands and one variable for the operator (with encoded operator signs) and then make allowed assignments:
ops = {
'<': 0,
'>': 1,
'=': 2
}
a = model.NewIntVar(min_number, max_number, 'a')
b = model.NewIntVar(min_number, max_number, 'b')
op_ab = model.NewIntVar(min_op_number, max_op_number, 'op_ab')
model.AddAllowedAssignments([a, b, op_ab], [
(1, 2, 0),
(2, 1, 1),
(1, 1, 2),
...
])
I'm pretty sure this will work, but it would create quite a lot of allowed constraints. Depending on the size of the puzzle it could go into the millions. The more variables and operators available the more constraints would be needed and the larger the domain for the variables the larger the list of allowed assignments per constraint. So it would grow horizontally and vertically in size, probably exponentially.
Is there another approach I could use? Any ideas on a more performant way? Any constraints that I should look into?
Here's my take on this; I hope I haven't missed some important requirement.
The following model uses OR-tools CP-SAT solver in Python. It is a quite direct approach to the problem, connecting the operators and the variables using OnlyEnforceIf via an array of boolean of length 3 (for the 3 operators) and then require that the sum of these is 1 (exactly one operator).
The model generates all possible solutions given n (number of variables) and their domain (min_val and max_val).
from __future__ import print_function
from ortools.sat.python import cp_model as cp
"""
Print solutions
"""
class SolutionPrinter(cp.CpSolverSolutionCallback):
def __init__(self, x,ops,ops_d,num_sols=0):
cp.CpSolverSolutionCallback.__init__(self)
self.__x = x
self.__ops = ops
self.__ops_d = ops_d
self.__num_sols = num_sols
self.__solution_count = 0
def OnSolutionCallback(self):
self.__solution_count += 1
x = [self.Value(v) for v in self.__x]
n = len(x)
# ops = [self.Value(v) for v in self.__ops]
ops_d = self.__ops_d
num_sols = self.__num_sols
print("===================")
print("Problem:")
for i in range(n):
for j in range(i+1,n):
op = ops_d[self.Value(self.__ops[i,j])]
print(f"x{i} {op} x{j}")
print()
print("Solution:")
print("x:",x)
for i in range(n):
for j in range(i+1,n):
op = ops_d[self.Value(self.__ops[i,j])]
print(f"{x[i]} {op} {x[j]}")
print()
if num_sols > 0 and self.__solution_count >= num_sols:
self.StopSearch()
def SolutionCount(self):
return self.__solution_count
def generate_math_puzzle(n=3,min_val=1,max_val=10,num_sols=0):
model = cp.CpModel()
lt = 0 # <
eq = 1 # >
gt = 2 # =
ops_a = [lt,eq,gt]
num_ops = len(ops_a)
# loopup for presentation
ops_d = {0:"<",
1:"=",
2:">"}
num_pairs = (n*(n-1)) // 2
# variables
x = [model.NewIntVar(min_val, max_val, f'x[{i}]') for i in range(n)]
# We have n-1 operators
# ops = [model.NewIntVar(lt, gt, f'ops[{i}]') for i in range(n-1)]
ops = {}
for i in range(n):
for j in range(i+1,n):
ops[i,j] = model.NewIntVar(lt,gt,f"op_{i}_{j}")
bs = [model.NewBoolVar(f"bs_{i}_{j}") for j in range(num_ops)]
for op in ops_a:
model.Add(ops[i,j]==op).OnlyEnforceIf(bs[op])
model.Add(sum(bs) == 1) # exactly one operation
model.Add(x[i] < x[j]).OnlyEnforceIf(bs[lt])
model.Add(x[i] == x[j]).OnlyEnforceIf(bs[eq])
model.Add(x[i] > x[j]).OnlyEnforceIf(bs[gt])
print("ModelStats:", model.ModelStats())
solver = cp.CpSolver()
# solver.parameters.num_search_workers = 8
# solution_printer = SolutionPrinter(x_flat)
solution_printer = SolutionPrinter(x,ops,ops_d,num_sols)
status = solver.SearchForAllSolutions(model, solution_printer)
if not status in [cp.OPTIMAL,cp.FEASIBLE]:
print("No solution!")
print()
print("NumSolutions:", solution_printer.SolutionCount())
print("NumConflicts:", solver.NumConflicts())
print("NumBranches:", solver.NumBranches())
print("WallTime:", solver.WallTime())
n=3
min_val=1
max_val=10
num_sols = 0
generate_math_puzzle(n,min_val,max_val)
Here are some solutions for the simple problem with n=3, min_val=1, and max_val=10.
Problem:
x0 > x1
x0 < x2
x1 < x2
Solution:
x: [2, 1, 3]
2 > 1
2 < 3
1 < 3
===================
Problem:
x0 = x1
x0 < x2
x1 < x2
Solution:
x: [1, 1, 2]
1 = 1
1 < 2
1 < 2
In all, it's 1000 solutions and takes about 0.1s to generate them all.
Here is a result for a larger problem n=30, min_val=1, and max_val=100:
===================
Problem:
x0 = x1
x0 < x2
x0 < x3
x0 < x4
x0 < x5
x0 < x6
x0 < x7
x0 < x8
x0 < x9
x1 < x2
x1 < x3
x1 < x4
x1 < x5
x1 < x6
x1 < x7
x1 < x8
x1 < x9
x2 < x3
x2 < x4
x2 < x5
x2 < x6
x2 < x7
x2 < x8
x2 < x9
x3 < x4
x3 < x5
x3 < x6
x3 < x7
x3 < x8
x3 < x9
x4 < x5
x4 < x6
x4 < x7
x4 < x8
x4 < x9
x5 < x6
x5 < x7
x5 < x8
x5 < x9
x6 < x7
x6 < x8
x6 < x9
x7 < x8
x7 < x9
x8 < x9
Solution:
x: [1, 1, 2, 3, 4, 5, 6, 7, 15, 65]
1 = 1
1 < 2
1 < 3
1 < 4
1 < 5
1 < 6
1 < 7
1 < 15
1 < 65
1 < 2
1 < 3
1 < 4
1 < 5
1 < 6
1 < 7
1 < 15
1 < 65
2 < 3
2 < 4
2 < 5
2 < 6
2 < 7
2 < 15
2 < 65
3 < 4
3 < 5
3 < 6
3 < 7
3 < 15
3 < 65
4 < 5
4 < 6
4 < 7
4 < 15
4 < 65
5 < 6
5 < 7
5 < 15
5 < 65
6 < 7
6 < 15
6 < 65
7 < 15
7 < 65
15 < 65
Note that for some problem instances (i.e. pairs of xi op xj) there might be multiple solutions.
#bechtold, your problem was interesting enough that I attempted a solution. It is based on my last comment to your post.
Caveat: It's not exactly what you want because
I use c#, not python (I simply don't have enough experience in python and don't have a working installation of it) and
I used the old solver (Google.OrTools.ConstraintSolver) instead of the newer SAT solver (Google.OrTools.Sat); I'll explain the reasons for that below.
Class Term
First I created a class Term with some overloaded operators to make it easier to create the expressions for the puzzle.
The overloaded operators + and * allow creation of terms like 2 * a or a + 2 to make the puzzles a little more interesting. These create a new IntVar and constrain it to be equal to the arithmetic operation on the original variable.
The overloaded operators >, == and < are a little more subtle. They create a constraint like a < b, but do not add it directly to the model. Instead, they also create a new enforced BoolVar (the IsInPuzzle variable of my comment), and constrain it to be less than or equal to the Var() of the constraint. This effectively makes the constraint optional, because depending on whether enforced == 1 or is 0 or has an open domain, the constraint does or doesn't have to be fulfilled. This is pretty much the equivalent of the OnlyEnforceIf() method on constraints in the new SAT solver.
The overloaded operators return Expression objects (see below).
The class includes some housekeeping methods and an overridden ToString() method for easier output.
using Google.OrTools.ConstraintSolver;
using System;
namespace SO69225141_create_puzzles
{
#pragma warning disable CS0660 // Type defines operator == or operator != but does not override Object.Equals(object o)
#pragma warning disable CS0661 // Type defines operator == or operator != but does not override Object.GetHashCode()
internal class Term
#pragma warning restore CS0661 // Type defines operator == or operator != but does not override Object.GetHashCode()
#pragma warning restore CS0660 // Type defines operator == or operator != but does not override Object.Equals(object o)
{
protected Google.OrTools.ConstraintSolver.Solver solver;
public string name { get; private set; }
public IntVar intVar { get; private set; }
private string basedOnVariableName;
public Term(Solver solver_, string name_, long minNumber_, long maxNumber_)
{
this.solver = solver_ ?? throw new ArgumentNullException(nameof(solver_));
if (string.IsNullOrEmpty(name_))
{
name = "unknown";
}
else
{
name = name_;
}
intVar = solver.MakeIntVar(minNumber_, maxNumber_, name);
basedOnVariableName = name;
}
public Term(Solver solver_, string name_) : this(solver_, name_, PuzzleCreator.minNumber, PuzzleCreator.maxNumber) { }
// Returns true if this term is based on the same variable as the other term (e.g. true for "(a + 2)" given "a" but not for "(3 * b)".
public bool isBasedOnSameVariableAs(Term other)
{
return this.basedOnVariableName == other.basedOnVariableName;
}
// Creates a new Term whose IntVar is constrained to be term + val
public static Term operator +(Term term, long val)
{
if (object.ReferenceEquals(term, null)) { throw new ArgumentNullException(nameof(term)); }
Term newTerm = new Term(term.solver, $"({term.name} + {val})", term.intVar.Min() + val, term.intVar.Max() + val);
newTerm.basedOnVariableName = term.basedOnVariableName;
term.solver.Add(newTerm.intVar == (term.intVar + val));
return newTerm;
}
// Creates a new Term whose IntVar is constrained to be val * term
public static Term operator *(long val, Term term)
{
if (object.ReferenceEquals(term, null)) { throw new ArgumentNullException(nameof(term)); }
Term newTerm = new Term(term.solver, $"({val} * {term.name})", val * term.intVar.Min(), val * term.intVar.Max());
newTerm.basedOnVariableName = term.basedOnVariableName;
term.solver.Add(newTerm.intVar == (val * term.intVar));
return newTerm;
}
public static Expression operator >(Term left, Term right)
{
if (object.ReferenceEquals(left, null)) { throw new ArgumentNullException(nameof(left)); }
if (object.ReferenceEquals(right, null)) { throw new ArgumentNullException(nameof(right)); }
Solver solver = left.intVar.solver();
Constraint constraint = solver.MakeGreater(left.intVar, right.intVar);
string name = $"{left.name} > {right.name}";
IntVar enforced = solver.MakeBoolVar(name);
solver.Add(constraint.Var() >= enforced);
Expression expr = new Expression(name, enforced, constraint.Var());
return expr;
}
public static Expression operator <(Term left, Term right)
{
if (object.ReferenceEquals(left, null)) { throw new ArgumentNullException(nameof(left)); }
if (object.ReferenceEquals(right, null)) { throw new ArgumentNullException(nameof(right)); }
Solver solver = left.intVar.solver();
Constraint constraint = solver.MakeLess(left.intVar, right.intVar);
string name = $"{left.name} < {right.name}";
IntVar enforced = solver.MakeBoolVar(name);
solver.Add(constraint.Var() >= enforced);
Expression expr = new Expression(name, enforced, constraint.Var());
return expr;
}
public static Expression operator ==(Term left, Term right)
{
if (object.ReferenceEquals(left, null)) { throw new ArgumentNullException(nameof(left)); }
if (object.ReferenceEquals(right, null)) { throw new ArgumentNullException(nameof(right)); }
Solver solver = left.intVar.solver();
Constraint constraint = solver.MakeEquality(left.intVar, right.intVar);
string name = $"{left.name} == {right.name}";
IntVar enforced = solver.MakeBoolVar(name);
solver.Add(constraint.Var() >= enforced);
Expression expr = new Expression(name, enforced, constraint.Var());
return expr;
}
public static Expression operator !=(Term left, Term right)
{
if (object.ReferenceEquals(left, null)) { throw new ArgumentNullException(nameof(left)); }
if (object.ReferenceEquals(right, null)) { throw new ArgumentNullException(nameof(right)); }
Solver solver = left.intVar.solver();
Constraint constraint = solver.MakeNonEquality(left.intVar, right.intVar);
string name = $"{left.name} != {right.name}";
IntVar enforced = solver.MakeBoolVar(name);
solver.Add(constraint.Var() >= enforced);
Expression expr = new Expression(name, enforced, constraint.Var());
return expr;
}
public override string ToString()
{
if (intVar.Bound())
{
return $"{name} = {intVar.Value()}";
}
else
{
return intVar.ToString();
}
}
}
}
Class Expression
The class Expression keeps track of the enforced and constraint variable as well some helper methods to aid decision building and output.
The isEnforced property is true when the expression is being enforced as part of a puzzle.
The Bound() property will be true either after a decision to enforce the expression or after refutation of the decision, but also if the truth value of the expression is already determined via the propagation of the previously enforced expressions.
using System;
using System.Text;
using Google.OrTools.ConstraintSolver;
namespace SO69225141_create_puzzles
{
internal class Expression
{
public Expression(string name_, IntVar enforcedVar_, IntVar constraintVar_)
{
this.name = name_ ?? throw new ArgumentNullException(nameof(name_));
this.enforcedVar = enforcedVar_ ?? throw new ArgumentNullException(nameof(enforcedVar_));
this.constraintVar = constraintVar_ ?? throw new ArgumentNullException(nameof(constraintVar_));
}
public string name { get; private set; }
public IntVar enforcedVar { get; private set; }
public IntVar constraintVar { get; private set; }
public bool Bound()
{
return enforcedVar.Bound() || constraintVar.Bound();
}
public bool isEnforced
{
get
{
return enforcedVar.Bound() && (enforcedVar.Value() == 1);
}
}
public override string ToString()
{
var sb = new StringBuilder();
sb.Append(name);
if (!isEnforced)
{
sb.Append("; not enforced");
if (constraintVar.Bound())
{
switch (constraintVar.Value())
{
case 0:
sb.Append("; not fulfillable");
break;
case 1:
sb.Append("; fulfilled");
break;
}
}
}
return sb.ToString();
}
}
}
Class PuzzleCreator
The main class that assembles the model and controls the search is called PuzzleCreator.
The method initModel(Solver) does the following:
Creates the base variables a, b, c (as many as are specified in numberOfVariables) as Term objects.
Creates combinations of the base variables and some constants as derived Term objects like a + 2 or 2 * a
Creates an optional constraint for each combination of a pair of terms and a comparison operator, skipping pairs of terms based on the same variable so that expressions like a < a + 2 don't clutter up the model.
Adds the enforced variables of the optional constraints to the decision variables list.
There is a dump() method to output everything to the screen for checking.
The method solveModel(Solver) searches for new puzzles. It creates a custom decision builder (described below) to control the search, and continues looping as long as solutions are found or until the limit on number of solutions is reached. Each solution of the model corresponds to a new puzzle.
The method Main of course, puts things together to allow everything to be executed.
using Google.OrTools.ConstraintSolver;
using System;
using System.Collections.Generic;
namespace SO69225141_create_puzzles
{
public class PuzzleCreator
{
// Parameters for model creation
public const int numberOfVariables = 3;
public const long minNumber = 1;
public const long maxNumber = 3;
public const long minConstant = 2;
public const long maxConstant = 2;
public const int numberOfPuzzlesToCreate = 10;
public static void Main(string[] args)
{
PuzzleCreator model = new PuzzleCreator();
using (Google.OrTools.ConstraintSolver.Solver solver = new Google.OrTools.ConstraintSolver.Solver("Test"))
{
model.InitModel(solver);
model.dump();
model.solveModel(solver);
}
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
// List of all the terms (a, b, ... 2 * a, a + 2, ...)
List<Term> allTerms;
// List of all the base variables only (a, b, c...)
List<Term> baseVariables;
// The decision variables for the solver, namely whether a certain expression is enforced.
Expression[] allExpressions;
public void InitModel(Google.OrTools.ConstraintSolver.Solver solver)
{
allTerms = new List<Term>();
baseVariables = new List<Term>();
// Make the base variables
for (int i = 0; i < numberOfVariables; i++)
{
string name = ((char)((int)'a' + i)).ToString();
Term variable = new Term(solver, name);
allTerms.Add(variable);
baseVariables.Add(variable);
}
// Just to make it more interesting, also create some terms
// like (a + 2) and (2 * a)
foreach (Term term in baseVariables)
{
foreach (arithmeticOperators op in Enum.GetValues(typeof(arithmeticOperators)))
{
switch (op)
{
case arithmeticOperators.plus:
for (long addend = Math.Max(1L, minConstant); addend <= maxConstant; addend++)
{
Term sum = (term + addend);
allTerms.Add(sum);
}
break;
case arithmeticOperators.times:
for (long factor = Math.Max(2L, minConstant); factor <= maxConstant; factor++)
{
Term product = (factor * term);
allTerms.Add(product);
}
break;
}
}
}
// Now create all combinations of terms and comparison operators
var allExpressionsList = new List<Expression>();
foreach (Term left in allTerms)
{
foreach (Term right in allTerms)
{
if (!left.isBasedOnSameVariableAs(right))
{
foreach (comparisonOperators op in Enum.GetValues(typeof(comparisonOperators)))
{
switch (op)
{
case comparisonOperators.less:
{
string name = $"{left.name} < {right.name}";
Expression comparison = (left < right);
allExpressionsList.Add(comparison);
}
break;
case comparisonOperators.equals:
{
string name = $"{left.name} == {right.name}";
Expression comparison = (left == right);
allExpressionsList.Add(comparison);
}
break;
case comparisonOperators.greater:
{
string name = $"{left.name} > {right.name}";
Expression comparison = (left > right);
allExpressionsList.Add(comparison);
}
break;
}
}
}
}
}
allExpressions = allExpressionsList.ToArray();
}
public void dump()
{
Console.WriteLine("Complete model:");
Console.WriteLine($"{baseVariables.Count} variables:");
foreach (Term term in baseVariables)
{
Console.WriteLine(term.ToString());
}
Console.WriteLine($"{allTerms.Count} arithmetic expressions:");
foreach (Term term in allTerms)
{
Console.WriteLine(term.ToString());
}
Console.WriteLine($"{allExpressions.Length} comparison expressions:");
foreach (Expression expr in allExpressions)
{
Console.WriteLine(expr.name);
}
}
public void solveModel(Solver solver)
{
// Could add limits if things are taking too long
// SearchLimit limits = solver.MakeLimit(maxTime, Int64.MaxValue, Int64.MaxValue, maxSolutions, true, true);
// Start a new solution search
CustomDecisionBuilder decisionBuilder = new CustomDecisionBuilder(allExpressions, baseVariables);
solver.NewSearch(decisionBuilder);
bool solutionFound = true;
int nsols = 0;
string indent = " ";
while (solutionFound && (nsols < numberOfPuzzlesToCreate))
{
solutionFound = solver.NextSolution();
if (solutionFound)
{
// TODO: Verify that the given expressions really result in a unique solution for the base variables.
// TODO: Check if expressions can be removed and still result in a unique solution.
Console.WriteLine($"\r\nPuzzle {nsols}:");
Console.Write(indent);
Console.WriteLine($"Find {numberOfVariables} numbers from {minNumber} to {maxNumber} such that");
foreach (Expression expr in decisionBuilder.decisionsMadeFor)
{
if (expr.isEnforced)
{
Console.Write(indent);
Console.WriteLine(expr.ToString());
}
}
Console.WriteLine("Solution:");
foreach (Term term in baseVariables)
{
Console.Write(indent);
Console.WriteLine(term.ToString());
}
nsols++;
}
}
// Solver-Statistik ausgeben
Console.WriteLine("Solver:" + solver.ToString());
Console.WriteLine(string.Format("State: {0}", solver.State()));
Console.WriteLine(string.Format("Number of solutions: {0}", solver.Solutions()));
Console.WriteLine(string.Format("WallTime: {0} ms", solver.WallTime()));
Console.WriteLine(string.Format("Failures: {0}", solver.Failures()));
Console.WriteLine(string.Format("Branches: {0} ", solver.Branches()));
Console.WriteLine(string.Format("Memory: {0} ", Solver.MemoryUsage()));
}
}
enum comparisonOperators
{
less,
equals,
greater
}
enum arithmeticOperators
{
plus,
times
}
}
Class CustomDecisionBuilder
You may have noted that the base variables a, b, c ... are not part of the decision variables. The decision variables only control which expressions are active in the puzzle; the base variables are a result of those decisions.
This custom decision builder is designed to keep returning decisions on which expressions to enforce until the base variables are all bound (i.e. have specific values). After (almost) each decision, the Google solver propagates the constraints, in other words it makes inferences about the domains of all the variables in the problem to fulfill the constraints. This includes the enforced variables as well! (For example, if a < b was enforced, the solver can infer that b < a can no longer be enforced and sets its enforced variable to 0. When the domains of the base variables have been reduced to a single value (i.e. they are Bound()), the expressions being enforced are sufficient to define a puzzle having a unique solution, so the decision builder stops returning decisions to indicate that a new puzzle has been found.
The new Google.OrTools.Sat solver does not have custom decision builders, which I believe is a great lacking, so this approach could not be used with it.
Their absence completely prevents its use in the real-world application with hundreds of users for which I productively use the old solver.
The propagation is not perfect, for example after having enforced a < b, the propagation did not recognize that (a + 2) < (b + 2) must also then always be true (see puzzle 5 in the output below). Therefore, an additional decision was taken to also enforce (a + 2) < (b + 2). The puzzles generated with this version are therefore overconstrained and contain redundant constraints. Some constraints could be removed and still have a unique solution for the puzzle.
The relevant method in the class is NextWrapper(Solver) which
checks if all the base variables are bound and returns null to indicate "no more decisions"
otherwise looks for the first undecided expression and
returns a decision to enforce the expression.
So it basically continues enforcing expressions until a new puzzle with a unique solution has been found.
When a solution has been found, the Google solver cuts that branch off of the search space by refuting the last decision. At that point, it also restores the domains of all the variables to the situation before the decision was made. Here, that means that the last expression to be enforced is freed up again, the base variables now have enlarged domains again, and the decision builder proceeds to select other expressions to enforce.
Here is the code of the custom decision builder.
using System;
using System.Collections.Generic;
using Google.OrTools.ConstraintSolver;
namespace SO69225141_create_puzzles
{
/// <summary>
/// Custom decision builder for puzzle creation expression enforcing variables.
/// Decides to enforce expressions until the base variables are all bound, then no more decisions
/// </summary>
internal class CustomDecisionBuilder : DecisionBuilder
{
Expression[] expressions;
List<Term> baseVariables;
private HashSet<Expression> decisionsMadeFor_ = new HashSet<Expression>();
public CustomDecisionBuilder(Expression[] expressions_, List<Term> baseVariables_)
{
this.expressions = expressions_ ?? throw new ArgumentNullException(nameof(expressions_));
this.baseVariables = baseVariables_ ?? throw new ArgumentNullException(nameof(baseVariables_));
}
public IEnumerable<Expression> decisionsMadeFor
{
get
{
return decisionsMadeFor_;
}
}
public override Decision NextWrapper(Solver solver)
{
// Are the base variables all bound?
bool allBaseVariablesBound = true;
foreach (Term baseVariable in baseVariables)
{
if (!baseVariable.intVar.Bound())
{
allBaseVariablesBound = false;
break;
}
}
if (allBaseVariablesBound)
{
// Done with decisions, all variables are bound.
return null;
}
// Find first expression that is neither enforced nor already fixed via propagation
for (int i = 0; i < expressions.Length; i++)
{
if (!expressions[i].Bound())
{
decisionsMadeFor_.Add(expressions[i]);
return solver.MakeAssignVariableValue(expressions[i].enforcedVar, 1L);
}
}
// All variables bound !!!!????, no more decisions to make.
// This means all expressions are definitely enforced (or definitely not)
// but the solution is still not unique. Consider this as failure
return solver.MakeFailDecision();
}
}
}
Here is a sample output of the problem:
Puzzle 0:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b < c
Solution:
a = 1
b = 2
c = 3
Puzzle 1:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
b < (2 * a)
Solution:
a = 2
b = 3
c = 3
Puzzle 2:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
b == (2 * a)
Solution:
a = 1
b = 2
c = 2
Puzzle 3:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
c < (a + 2)
c < (2 * a)
Solution:
a = 2
b = 3
c = 3
Puzzle 4:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
c < (a + 2)
c == (2 * a)
Solution:
a = 1
b = 2
c = 2
Puzzle 5:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
c < (a + 2)
(a + 2) < (b + 2)
(a + 2) < (2 * b)
(a + 2) < (c + 2)
(a + 2) < (2 * c)
(2 * a) == b
Solution:
a = 1
b = 2
c = 2
Puzzle 6:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
c < (a + 2)
(a + 2) < (b + 2)
(a + 2) < (2 * b)
(a + 2) < (c + 2)
(a + 2) < (2 * c)
(2 * a) > b
Solution:
a = 2
b = 3
c = 3
Puzzle 7:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
c < (a + 2)
(a + 2) < (b + 2)
(a + 2) < (2 * b)
(a + 2) < (c + 2)
(a + 2) < (2 * c)
(2 * a) == c
Solution:
a = 1
b = 2
c = 2
Puzzle 8:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
c < (a + 2)
(a + 2) < (b + 2)
(a + 2) < (2 * b)
(a + 2) < (c + 2)
(a + 2) < (2 * c)
(2 * a) > c
Solution:
a = 2
b = 3
c = 3
Puzzle 9:
Find 3 numbers from 1 to 3 such that
a < b
a < c
b == c
b < (a + 2)
c < (a + 2)
(a + 2) < (b + 2)
(a + 2) < (2 * b)
(a + 2) < (c + 2)
(a + 2) < (2 * c)
(2 * a) < (b + 2)
(2 * a) < (2 * b)
(2 * a) < (c + 2)
(2 * a) < (2 * c)
(b + 2) == (c + 2)
(b + 2) < (2 * c)
Solution:
a = 2
b = 3
c = 3
The method isn't 100% accurate, because the propagation done by the solver afer a decision is not perfect. Some dependencies can only be discovered by backtracking, and it might also not be recognized that a particular expression has become redundant. For this reason, the puzzles generated may have more expressions in them than would be strictly necessary to find a unique solution.
Although I did a cursory check of the output, the program is not 100% tested by any means. It has a couple of TODO's in it to verify the solutions in place and remove expressions that are redundant...
So, for my Computer Graphics class I was tasked with doing a Polygon Filler, my software renderer is currently being coded in Python. Right now, I want to test this pointInPolygon code I found at: How can I determine whether a 2D Point is within a Polygon? so I can make my own method later on basing myself on that one.
The code is:
int pnpoly(int nvert, float *vertx, float *verty, float testx, float testy)
{
int i, j, c = 0;
for (i = 0, j = nvert-1; i < nvert; j = i++) {
if ( ((verty[i]>testy) != (verty[j]>testy)) &&
(testx < (vertx[j]-vertx[i]) * (testy-verty[i]) / (verty[j]-verty[i]) + vertx[i]) )
c = !c;
}
return c;
}
And my attempt to recreate it in Python is as following:
def pointInPolygon(self, nvert, vertx, verty, testx, testy):
c = 0
j = nvert-1
for i in range(nvert):
if(((verty[i]>testy) != (verty[j]>testy)) and (testx < (vertx[j]-vertx[i]) * (testy-verty[i]) / (verty[j]-verty[i] + vertx[i]))):
c = not c
j += 1
return c
But this obviously will return a index out of range in the second iteration because j = nvert and it will crash.
Thanks in advance.
You're reading the tricky C code incorrectly. The point of j = i++ is to both increment i by one and assign the old value to j. Similar python code would do j = i at the end of the loop:
j = nvert - 1
for i in range(nvert):
...
j = i
The idea is that for nvert == 3, the values would go
j | i
---+---
2 | 0
0 | 1
1 | 2
Another way to achieve this is that j equals (i - 1) % nvert,
for i in range(nvert):
j = (i - 1) % nvert
...
i.e. it is lagging one behind, and the indices form a ring (like the vertices do)
More pythonic code would use itertools and iterate over the coordinates themselves. You'd have a list of pairs (tuples) called vertices, and two iterators, one of which is one vertex ahead the other, and cycling back to the beginning because of itertools.cycle, something like:
# make one iterator that goes one ahead and wraps around at the end
next_ones = itertools.cycle(vertices)
next(next_ones)
for ((x1, y1), (x2, y2)) in zip(vertices, next_ones):
# unchecked...
if (((y1 > testy) != (y2 > testy))
and (testx < (x2 - x1) * (testy - y1) / (y2-y1 + x1))):
c = not c
I am working on evolutionary algorithm and need a way to generate reference point (Das and Dennis approach) in python. I have no idea and struck in this part can somebody please help to write this code.Thanks in advance I have matlab code but not understanding it how to convert to python.
function Zr = GenerateReferencePoints(M, p)
Zr = GetFixedRowSumIntegerMatrix(M, p)' / 4; #Not understanding the use of {'}
end
function A = GetFixedRowSumIntegerMatrix(M, RowSum)
if M < 1
error('M cannot be less than 1.');
end
if floor(M) ~= M
error('M must be an integer.');
end
if M == 1
A = RowSum;
return;
end
A = [];
for i = 0:RowSum
B = GetFixedRowSumIntegerMatrix(M - 1, RowSum - i);
A = [A; i*ones(size(B,1),1) B]; #What is this function doing not getting it
end
end
I don't understand RowSum used for which purpose it is used
I've written this very badly optimized C code that does a simple math calculation:
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
#define MAX(a, b) (((a) > (b)) ? (a) : (b))
unsigned long long int p(int);
float fullCheck(int);
int main(int argc, char **argv){
int i, g, maxNumber;
unsigned long long int diff = 1000;
if(argc < 2){
fprintf(stderr, "Usage: %s maxNumber\n", argv[0]);
return 0;
}
maxNumber = atoi(argv[1]);
for(i = 1; i < maxNumber; i++){
for(g = 1; g < maxNumber; g++){
if(i == g)
continue;
if(p(MAX(i,g)) - p(MIN(i,g)) < diff && fullCheck(p(MAX(i,g)) - p(MIN(i,g))) && fullCheck(p(i) + p(g))){
diff = p(MAX(i,g)) - p(MIN(i,g));
printf("We have a couple %llu %llu with diff %llu\n", p(i), p(g), diff);
}
}
}
return 0;
}
float fullCheck(int number){
float check = (-1 + sqrt(1 + 24 * number))/-6;
float check2 = (-1 - sqrt(1 + 24 * number))/-6;
if(check/1.00 == (int)check)
return check;
if(check2/1.00 == (int)check2)
return check2;
return 0;
}
unsigned long long int p(int n){
return n * (3 * n - 1 ) / 2;
}
And then I've tried (just for fun) to port it under Python to see how it would react. My first version was almost a 1:1 conversion that run terribly slow (120+secs in Python vs <1sec in C).
I've done a bit of optimization, and this is what I obtained:
#!/usr/bin/env/python
from cmath import sqrt
import cProfile
from pstats import Stats
def quickCheck(n):
partial_c = (sqrt(1 + 24 * (n)))/-6
c = 1/6 + partial_c
if int(c.real) == c.real:
return True
c = c - 2*partial_c
if int(c.real) == c.real:
return True
return False
def main():
maxNumber = 5000
diff = 1000
for i in range(1, maxNumber):
p_i = i * (3 * i - 1 ) / 2
for g in range(i, maxNumber):
if i == g:
continue
p_g = g * (3 * g - 1 ) / 2
if p_i > p_g:
ma = p_i
mi = p_g
else:
ma = p_g
mi = p_i
if ma - mi < diff and quickCheck(ma - mi):
if quickCheck(ma + mi):
print ('New couple ', ma, mi)
diff = ma - mi
cProfile.run('main()','script_perf')
perf = Stats('script_perf').sort_stats('time', 'calls').print_stats(10)
This runs in about 16secs which is better but also almost 20 times slower than C.
Now, I know C is better than Python for this kind of calculations, but what I would like to know is if there something that I've missed (Python-wise, like an horribly slow function or such) that could have made this function faster.
Please note that I'm using Python 3.1.1, if this makes a difference
Since quickCheck is being called close to 25,000,000 times, you might want to use memoization to cache the answers.
You can do memoization in C as well as Python. Things will be much faster in C, also.
You're computing 1/6 in each iteration of quickCheck. I'm not sure if this will be optimized out by Python, but if you can avoid recomputing constant values, you'll find things are faster. C compilers do this for you.
Doing things like if condition: return True; else: return False is silly -- and time consuming. Simply do return condition.
In Python 3.x, /2 must create floating-point values. You appear to need integers for this. You should be using //2 division. It will be closer to the C version in terms of what it does, but I don't think it's significantly faster.
Finally, Python is generally interpreted. The interpreter will always be significantly slower than C.
I made it go from ~7 seconds to ~3 seconds on my machine:
Precomputed i * (3 * i - 1 ) / 2 for each value, in yours it was computed twice quite a lot
Cached calls to quickCheck
Removed if i == g by adding +1 to the range
Removed if p_i > p_g since p_i is always smaller than p_g
Also put the quickCheck-function inside main, to make all variables local (which have faster lookup than global).
I'm sure there are more micro-optimizations available.
def main():
maxNumber = 5000
diff = 1000
p = {}
quickCache = {}
for i in range(maxNumber):
p[i] = i * (3 * i - 1 ) / 2
def quickCheck(n):
if n in quickCache: return quickCache[n]
partial_c = (sqrt(1 + 24 * (n)))/-6
c = 1/6 + partial_c
if int(c.real) == c.real:
quickCache[n] = True
return True
c = c - 2*partial_c
if int(c.real) == c.real:
quickCache[n] = True
return True
quickCache[n] = False
return False
for i in range(1, maxNumber):
mi = p[i]
for g in range(i+1, maxNumber):
ma = p[g]
if ma - mi < diff and quickCheck(ma - mi) and quickCheck(ma + mi):
print('New couple ', ma, mi)
diff = ma - mi
Because the function p() monotonically increasing you can avoid comparing the values as g > i implies p(g) > p(i). Also, the inner loop can be broken early because p(g) - p(i) >= diff implies p(g+1) - p(i) >= diff.
Also for correctness, I changed the equality comparison in quickCheck to compare difference against an epsilon because exact comparison with floating point is pretty fragile.
On my machine this reduced the runtime to 7.8ms using Python 2.6. Using PyPy with JIT reduced this to 0.77ms.
This shows that before turning to micro-optimization it pays to look for algorithmic optimizations. Micro-optimizations make spotting algorithmic changes much harder for relatively tiny gains.
EPS = 0.00000001
def quickCheck(n):
partial_c = sqrt(1 + 24*n) / -6
c = 1/6 + partial_c
if abs(int(c) - c) < EPS:
return True
c = 1/6 - partial_c
if abs(int(c) - c) < EPS:
return True
return False
def p(i):
return i * (3 * i - 1 ) / 2
def main(maxNumber):
diff = 1000
for i in range(1, maxNumber):
for g in range(i+1, maxNumber):
if p(g) - p(i) >= diff:
break
if quickCheck(p(g) - p(i)) and quickCheck(p(g) + p(i)):
print('New couple ', p(g), p(i), p(g) - p(i))
diff = p(g) - p(i)
There are some python compilers that might actually do a good bit for you. Have a look at Psyco.
Another way of dealing with math intensive programs is to rewrite the majority of the work into a math kernel, such as NumPy, so that heavily optimized code is doing the work, and your python code only guides the calculation. To get the most out of this strategy, avoid doing calculations in loops, and instead let the math kernel do all of that.
The other respondents have already mentioned several optimizations that will help. However, ultimately, you're not going to be able to match the performance of C in Python. Python is a nice tool, but since it's interpreted, it isn't really suited for heavy number crunching or other apps where performance is key.
Also, even in your C version, your inner loop could use quite a bit of help. Updated version:
for(i = 1; i < maxNumber; i++){
for(g = 1; g < maxNumber; g++){
if(i == g)
continue;
max=i;
min=g;
if (max<min) {
// xor swap - could use swap(p_max,p_min) instead.
max=max^min;
min=max^min;
max=max^min;
}
p_max=P(max);
p_min=P(min);
p_i=P(i);
p_g=P(g);
if(p_max - p_min < diff && fullCheck(p_max-p_min) && fullCheck(p_i + p_g)){
diff = p_max - p_min;
printf("We have a couple %llu %llu with diff %llu\n", p_i, p_g, diff);
}
}
}
///////////////////////////
float fullCheck(int number){
float den=sqrt(1+24*number)/6.0;
float check = 1/6.0 - den;
float check2 = 1/6.0 + den;
if(check == (int)check)
return check;
if(check2 == (int)check2)
return check2;
return 0.0;
}
Division, function calls, etc are costly. Also, calculating them once and storing in vars such as I've done can make things a lot more readable.
You might consider declaring P() as inline or rewrite as a preprocessor macro. Depending on how good your optimizer is, you might want to perform some of the arithmetic yourself and simplify its implementation.
Your implementation of fullCheck() would return what appear to be invalid results, since 1/6==0, where 1/6.0 would return 0.166... as you would expect.
This is a very brief take on what you can do to your C code to improve performance. This will, no doubt, widen the gap between C and Python performance.
20x difference between Python and C for a number crunching task seems quite good to me.
Check the usual performance differences for some CPU intensive tasks (keep in mind that the scale is logarithmic).
But look on the bright side, what's 1 minute of CPU time compared with the brain and typing time you saved writing Python instead of C? :-)