i am doing some OCR whit python, in order to get the coordinates of the letters in an image, i take the centroid of a region(returned by the regionprops from skimage.measure) and if a distance between one centroid vs the others centroids is less than some value, i drop that region, i though this would solve the problem of several regions one inside the others but i missed that if a region with less area is detected first(like just a part of a letter) all the bigger regions (that may contain the whole letter) are ignored, here is my code
centroids = []
for region in regionprops(label_image):
if len(centroids) == 0:
centroids.append(region.centroid[1])
do some stuff...
if len(centroids) != 0:
distances = []
for centroid in centroids:
distance = abs(centroid - region.centroid[1])
distances.append(distance)
if all(i >= 0.5 * region_width for i in distances):
do some stuff...
centroids.append(region.centroid[1])
now the questions here is if there is a way to order the list returned by regionprops by area? and how to do it?, or if you can give a better way to avoid the problem of a region inside another regions, thanks in advance
The Python built-in sorted() takes a key= argument, a function by which to sort, and a reversed= argument to sort in decreasing order. So you can change your loop to:
for region in sorted(
regionprops(label_image),
key=lambda r: r.area,
reverse=True,
):
To check whether one region is completely contained in another, you can use r.bbox, and check whether one box is inside another, or overlaps it.
Finally, if you have a lot of regions, I recommend you build a scipy.spatial.cKDTree with all the centroids before running your loop, as this will make it much faster to check whether a region is close to existing ones.
Related
I'm having issues sorting contours how I want them to be sorted.
I have two columns and K(variable) rows in an image that I'm trying to draw contours and then crop out. I can do this fine but I need them to get returned in a specific order.
I'm halfway there since I managed to sort them as such;
1st column gets returned first in the contours list then when it's done it goes to the second column and goes through all its rows as well.
These are the lines of code that are sorting the list:
contours, hierarchy = cv2.findContours(dilated_value, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
contours.sort(key=lambda c: np.min(c[:, :, 0]))
contours.reverse()
I need to get a return value of: [[Column1, row1], [Column2, row1], ....] and as such.
Here's a more specific example.
Right now it's being stored like the following:
What should I change in that sort instruction
Basically what you can do is simply get all contours and then find all the polygons that can be formed with them. You can remove the larger ones as you don't need contour of entire table or entire row/column. Once you filter those out you will be remaining with the small ones i.e. cells. Then you can extract them easily and index them according to their centorid. Using those centorid you will know the position of each and can put them anywhere according to your choice.
Found the answer from another question.
It's working like magic make sure to upvote them and show gratitude to the original answer please.
def get_contour_precedence(contour, cols):
origin = cv2.boundingRect(contour)
return origin[1] * cols + origin[0]
contours.sort(key=lambda x:get_contour_precedence(x, image.shape[1]))
A well written explanation can be found in the original answer but the code is self explanatory.
I have a bunch of lines described by their direction as well as a point that describes its origin. I have to combine these lines to make them form rectangles that can lie within eachother, but their edges cannot overlap. I also know that the origin of the lines lie within an edge of a rectangle, but it does not necessarily lie in the middle of that edge. Basically the input I have could be something like this:
And what I'm trying to achieve looks something like this:
Where every line is now described by the points where it intersected the other lines to form the correct rectangles.
I'm looking for an algorithm that finds the relevant intersection points and links them to the lines that describe the rectangles.
First of all, this problem as it was stated, may have multiple solutions. For example I don't see any constraint that invalidates the following:
So, you need to define an objective, for example:
maximize total covered are
maximize number of rectangles
maximize number of used lines
...
Here I'm trying to maximize number of rectangle using a greedy approach. Keep in mind that a greedy algorithm never guarantees to find the optimum solution, but finds a sub-optimal one in a reasonable time.
Now, there are two steps in my algorithm:
Find all possible rectangles
Select a set of rectangles that satisfy constrains
Step 1: Find all possible rectangles
Two vertical lines (l & r) plus two horizontal lines (b & t) can form a valid rectangle if:
l.x < r.x and b.y < t.y
l.y and r.y are between b.y and t.y
b.x and t.x are between l.x and r.x
In the following pseudocode, Xs and Ys are sorted lists of vertical and horizontal line respectively:
function findRectangles
for i1 from 1 to (nx-1)
for i2 from (i1+1) to nx
for j1 from 1 to (ny-1)
if (Ys[j1].x>=Xs[i1].x and
Ys[j1].x<=Xs[i2].x and
Ys[j1].y<=Xs[i1].y and
Ys[j1].y<=Xs[i2].y)
for j2 from (j1+1) to ny
if (Ys[j2].x>=Xs[i1].x and
Ys[j2].x<=Xs[i2].x and
Ys[j2].y>=Xs[i1].y and
Ys[j2].y>=Xs[i2].y)
add [i1 j1 i2 j2] to results
end if
end for
end if
end for
end for
end for
end
Step 2: Select valid rectangles
Valid rectangles, as stated in the problem, can not overlap partially and also can not share an edge. In previous step, too many rectangles are found. But, as I said before, there may be more than one combination of these rectangles that satisfy constraints. To maximize the number of rectangle I suggest the following algorithm that tends to accept smaller rectangles:
function selectRects( Xs, Ys, rects )
results[];
sort rectangles by their area;
for i from 1 to rects.count
if (non of edges of rects[i] are eliminated)&
(rects[i] does not partially overlap any of items in results)
add rects[i] to results;
Xs[rects[i].left].eliminated = true;
Xs[rects[i].right].eliminated = true;
Ys[rects[i].bottom].eliminated = true;
Ys[rects[i].top].eliminated = true;
end if
end for
end
I have a large (>200,000) list of objects (of type RegionProperties, produced by skimage.measure.regionprops). The attributes of each object can be accessed with [] or .. For example:
my_list = skimage.measure.regionprops(...)
my_list[0].area
gets the area.
I want to filter this list to extract elements which have area > 300 to then act on them. I have tried:
# list comprehension
selection = [x for x in my_list if x.area > 300]
for foo in selection:
...
# filter (with predefined function rather than lambda, for speed)
def my_condition(x)
return(x.area > 300)
selection = filter(my_condition, my_list)
for foo in selection:
...
# generator
def filter_by_area(x):
for el in x:
if el.area > 300: yield el
for foo in filter_by_area(prop):
...
I find that generator ~ filter > comprehension in terms of speed but only marginally (4.15s, 4.16s, 4.3s). I have to repeat such a filter thousands of times, resulting in hours of CPU time just filtering a list. This simple operation if currently the bottleneck of the whole image analysis process.
Is there another way of doing this? Possibly involving C, or some peculiarity of RegionProperties objects?
Or maybe a completely different algorithm? I thought about eroding the image to make small particles disappear and only keep large ones, but the measurements have to be done on the non-eroded image and finding the correspondance between the two is long too.
Thank you very much in advance for any pointer!
As suggested by Mr. F, I tried isolating the filtering part by doing some dumb operation in the loop:
selection = [x for x in my_list if x.area > 300]
for foo in selection:
a = 1 + 1
this resulted in exactly the same times as before, even though I was extracting a few properties of the particles in the loop before. This pushed me to look more into how the area property of particles, on which I am doing the filtering, is extracted.
It turns out that skimage.measure.regionprops just prepares the data to compute the properties, it does not compute them explicitly. Extracting one property (such as area) triggers the computation of all the properties needed to get to the extracted property. It turns out that the area is computed as the first moment of the particle image, which, in turns, triggers the computation of all the moments, which triggers other computations, etc. So just doing x.area is not just extracting a pre-computed value but actually computing plenty of stuff.
There is a simpler solution to compute the area. For the record, I do it this way
numpy.sum(x._label_image[x._slice] == x.label)
So my problem is actually very specific to scikit-image RegionProperties objects. By using the formula above to compute the area, instead of using x.area, I get the filtering time down from 4.3s to ~1s.
Thanks for Mr. F for the comment which prompted me to go on this exploration of the code of scikit-image and solve my performance problem (the whole image processing routine when from several days to several hours!).
PS: by the way, with this code, it seems list comprehension gets a (very small) edge over the other two methods. And it's clearer so that's perfect!
I have a dictionary which has coordinates as keys. They are by default in 3 dimensions, like dictionary[(x,y,z)]=values, but may be in any dimension, so the code can't be hard coded for 3.
I need to find if there are other values within a certain radius of a new coordinate, and I ideally need to do it without having to import any plugins such as numpy.
My initial thought was to split the input into a cube and check no points match, but obviously that is limited to integer coordinates, and would grow exponentially slower (radius of 5 would require 729x the processing), and with my initial code taking at least a minute for relatively small values, I can't really afford this.
I heard finding the nearest neighbor may be the best way, and ideally, cutting down the keys used to a range of +- a certain amount would be good, but I don't know how you'd do that when there's more the one point being used.Here's how I'd do it with my current knowledge:
dimensions = 3
minimumDistance = 0.9
#example dictionary + input
dictionary[(0,0,0)]=[]
dictionary[(0,0,1)]=[]
keyToAdd = [0,1,1]
closestMatch = 2**1000
tooClose = False
for keys in dictionary:
#calculate distance to new point
originalCoordinates = str(split( dictionary[keys], "," ) ).replace("(","").replace(")","")
for i in range(dimensions):
distanceToPoint = #do pythagors with originalCoordinates and keyToAdd
#if you want the overall closest match
if distanceToPoint < closestMatch:
closestMatch = distanceToPoint
#if you want to just check it's not within that radius
if distanceToPoint < minimumDistance:
tooClose = True
break
However, performing calculations this way may still run very slow (it must do this to millions of values). I've searched the problem, but most people seem to have simpler sets of data to do this to. If anyone can offer any tips I'd be grateful.
You say you need to determine IF there are any keys within a given radius of a particular point. Thus, you only need to scan the keys, computing the distance of each to the point until you find one within the specified radius. (And if you do comparisons to the square of the radius, you can avoid the square roots needed for the actual distance.)
One optimization would be to sort the keys based on their "Manhattan distance" from the point (that is, add the component offsets), since the Euclidean distance will never be less than this. This would avoid some of the more expensive calculations (though I don't think you need and trigonometry).
If, as you suggest later in the question, you need to handle multiple points, you can obviously process each individually, or you could find the center of those points and sort based on that.
I'm a trying to calculate a kind of fuzzy Jaccard index between two sets with the following rationale: as the Jaccard index, I want to calculate the ratio between the number of items that are common to both sets and the total number of different items in both sets. The problem is that I want to use a similarity function with a threshold to determine what what counts as the "same" item being in both sets, so that items that are similar:
Aren't counted twice in the union
Are counted in the intersection.
I have a working implementation here (in python):
def fuzzy_jaccard(set1, set2, similarity, threshold):
intersection_size = union_size = len(set1 & set2)
shorter_difference, longer_difference = sorted([set2 - set1, set1 - set2], key=len)
while len(shorter_difference) > 0:
item1, item2 = max(
itertools.product(longer_difference, shorter_difference),
key=lambda (a, b): similarity(a, b)
)
longer_difference.remove(item1)
shorter_difference.remove(item2)
if similarity(item1, item2) > threshold:
union_size += 1
intersection_size += 1
else:
union_size += 2
union_size = union_size + len(longer_difference)
return intersection_size / union_size
The problem here is the this is quadratic in the size of the sets, because in itertools.product I iterate in all possible pairs of items taken one from each set(*). Now, I think I must do this because I want to match each item a from set1 with the best possible candidate b from set2 that isn't more similar to another item a' from set1.
I have a feeling that there should be a O(n) way of doing that I'm not grasping. Do you have any suggestions?
There are other issues two, like recalculating the similarity for each pair once I get the best match, but I don't care to much about them.
I doubt there's any way that would be O(n) in the general case, but you can probably do a lot better than O(n^2) at least for most cases.
Is similarity transitive? By this I mean: can you assume that distance(a, c) <= distance(a, b) + distance(b, c)? If not, this answer probably won't help. I'm treating similarities like distances.
Try clumping the data:
Pick a radius r. Based on intuition, I suggest setting r to one-third of the average of the first 5 similarities you calculate, or something.
The first point you pick in set1 becomes the centre of your first clump. Classify the points in set2 as being in the clump (similarity to the centre point <= r) or outside the clump. Also keep track of points that are within 2r of the clump centre.
You can require that clump centre points be at least a distance of 2r from each other; in that case some points may not be in any clump. I suggest making them at least r from each other. (Maybe less if you're dealing with a large number of dimensions.) You could treat every point as a clump centre but then you wouldn't save any processing time.
When you pick a new point, first compare it with the clump centre points (even though they're in the same set). Either it's in an already existing clump, or it becomes a new clump centre, (or perhaps neither if it's between r and 2r of a clump centre). If it's within r of a clump centre, then compare it with all points in the other set that are within 2r of that clump centre. You may be able to ignore points further than 2r from the clump centre. If you don't find a similar point within the clump (perhaps because the clump has no points left), then you may have to scan all the rest of the points for that case. Hopefully this would mostly happen only when there aren't many points left in the set. If this works well, then in most cases you'd find the most similar point within the clump and would know that it's the most similar point.
This idea may require some tweaking.
If there are a large number of dimenstions involved, then you might find that for a given radius r, frustratingly many points are within 2r of each other while few are within r of each other.
Here's another algorithm. The more time-consuming it is to calculate your similarity function (as compared to the time it takes to maintain sorted lists of points) the more index points you might want to have. If you know the number of dimensions, it might make sense to use that number of index points. You might reject a point as a candidate index point if it's too similar to another index point.
For each of the first point you use and any others you decide to use as index points, generate a list of all the remaining points in the other set, sorted in order of distance from the index point,
When you're comparing a point P1 to points in the other set, I think you can skip over sets for two possible reasons. Consider the most similar point P2 you've found to P1. If P2 is similar to an index point then you can skip all points which are sufficiently dissimilar from that index point. If P2 is dissimilar to an index point then you can skip over all points which are sufficiently similar to that index point. I think in some cases you can skip over some of both types of point for the same index point.