Order randomly distributed data into regular grid

Order randomly distributed data into regular grid - python

I am trying to order data scattered over the globe on a regular lat/lon grid.
I could go through all the data and check to which grid cell they belong and then append them to a list for each cell. This approach seems to me long and not very efficient.
I am sure that this is not a new problem, but I have not been able to find a solution on the web. Does anyone has a suggestion or can point me to an example or tutorial?

I believe geohashing would be useful, in this particular case you could use Morton number. This blog titled "Spatial Keys – Memory Efficient Geohashes" has example implementation, it's in Java, but Python version isn't much different.
long hash = 0;
double minLat = minLatI;
double maxLat = maxLatI;
double minLon = minLonI;
double maxLon = maxLonI;
int i = 0;
while (true) {
if (minLat midLat) {
hash |= 1;
minLat = midLat;
} else
maxLat = midLat;
}
hash <<= 1;
if (minLon midLon) {
hash |= 1;
minLon = midLon;
} else
maxLon = midLon;
}
i++;
if (i < iterations)
hash <<= 1;
else
break;
}
return hash;
The advantage of the Morton code is that you can calculate less bits (less iteration above) for bigger grid, more bits for finer grid. Or have fine grained codes calculated and just use prefixes for bigger grid.

Related

How to make every vehicle do something in Or Tools?

In several problems of the VRP, when informing the total number of vehicles, it is compromised that all are used and that at least they visit a node. In reality, this may not even be the best, but I would like to understand why and how to adapt according to needs.
The example below concerns the simple VRP example according to OR Tools, with a small edition in the distance matrix and some changes according to the website (blog) - https://activimetrics.com/blog/ortools/counting_dimension /. According to the latter, it is possible to carry out a fair distribution of routes, which seemed totally appealing, since, as a rule, the solver minimizes the longest route and ends up using fewer vehicles and assigning several nodes to it. An important need was the use of an approach that makes the vehicle act, ensuring that it is used at least once.
However, if 5 vehicles are used to solve the problem, by logic and result obtained, he gets there, places a node for each vehicle, which without this edition was not possible. The problem is that using only 4 vehicles, the solver is no longer there, it manages to distribute routes, but always leaves a vehicle out.
using System;
using System.Collections.Generic;
using Google.OrTools.ConstraintSolver;
public class VrpGlobalSpan
{
class DataModel
{
public long[,] DistanceMatrix = {
{0, 9777, 10050,7908,10867,16601},
{9777, 0, 4763, 4855, 19567,31500},
{10050, 4763,0,2622,11733,35989},
{7908,4855,2622,0,10966,27877},
{10867,19567,11733,10966,0,27795},
{16601,31500,35989,27877,27795,0},
};
public int VehicleNumber = 4;
public int Depot = 0;
};
/// <summary>
/// Print the solution.
/// </summary>
static void PrintSolution(
in DataModel data,
in RoutingModel routing,
in RoutingIndexManager manager,
in Assignment solution)
{
// Inspect solution.
long maxRouteDistance = 0;
for (int i = 0; i < data.VehicleNumber; ++i)
{
Console.WriteLine("Route for Vehicle {0}:", i);
long routeDistance = 0;
var index = routing.Start(i);
while (routing.IsEnd(index) == false)
{
Console.Write("{0} -> ", manager.IndexToNode((int)index));
var previousIndex = index;
index = solution.Value(routing.NextVar(index));
routeDistance += routing.GetArcCostForVehicle(previousIndex, index, 0);
}
Console.WriteLine("{0}", manager.IndexToNode((int)index));
Console.WriteLine("Distance of the route: {0}m", routeDistance);
maxRouteDistance = Math.Max(routeDistance, maxRouteDistance);
}
Console.WriteLine("Maximum distance of the routes: {0}m", maxRouteDistance);
}
public static void Main(String[] args)
{
// Instantiate the data problem.
DataModel data = new DataModel();
// Create Routing Index Manager
RoutingIndexManager manager = new RoutingIndexManager(
data.DistanceMatrix.GetLength(0),
data.VehicleNumber,
data.Depot);
// Create Routing Model.
RoutingModel routing = new RoutingModel(manager);
// Create and register a transit callback.
int transitCallbackIndex = routing.RegisterTransitCallback(
(long fromIndex, long toIndex) => {
// Convert from routing variable Index to distance matrix NodeIndex.
var fromNode = manager.IndexToNode(fromIndex);
var toNode = manager.IndexToNode(toIndex);
return data.DistanceMatrix[fromNode, toNode];
}
);
// Define cost of each arc.
routing.SetArcCostEvaluatorOfAllVehicles(transitCallbackIndex);
double answer = 5/data.VehicleNumber +1;
//double Math.Ceiling(answer);
//double floor = (int)Math.Ceiling(answer);
routing.AddConstantDimension(
1,
(int)Math.Ceiling(answer),
true, // start cumul to zero
"Distance") ;
RoutingDimension distanceDimension = routing.GetDimensionOrDie("Distance");
//distanceDimension.SetGlobalSpanCostCoefficient(100);
for (int i = 0; i < data.VehicleNumber; ++i)
{
distanceDimension.SetCumulVarSoftLowerBound(routing.End(i), 2, 1000000);
}
// Setting first solution heuristic.
RoutingSearchParameters searchParameters =
operations_research_constraint_solver.DefaultRoutingSearchParameters();
//searchParameters.FirstSolutionStrategy =
// FirstSolutionStrategy.Types.Value.PathCheapestArc;
searchParameters.TimeLimit = new Google.Protobuf.WellKnownTypes.Duration { Seconds = 5 };
searchParameters.LocalSearchMetaheuristic = LocalSearchMetaheuristic.Types.Value.Automatic;
// Solve the problem.
Assignment solution = routing.SolveWithParameters(searchParameters);
// Print solution on console.
PrintSolution(data, routing, manager, solution);
}
}
Perhaps it must have been a topic already discussed, but I wanted to understand and understand what is the best path to follow and what measures to take to transform this example and others to follow in a better approach.
I thank you in advance for your attention and I am waiting for feedback.
Thank you.

Optimising a NBT parser that's only searching for a fraction of data in Python

The nbt file is created by the schematica mod for Mincraft.
With content.find(f'\xYY\xZZ..') and content = content[lower:upper] I was able to slice the data to the most important part: The blockdata which has the following form when translated (my data is NOT translated!):
TAG_List("blocks (Compound)"): (xdim*ydim*zdim) entries of type TAG_Compound
{
TAG_Compound:
{
TAG_List("pos (Int)"): 3 entries of type TAG_Int
{
TAG_Int: X
TAG_Int: Y
TAG_Int: Z
}
TAG_Int("state"): S
}
//...
TAG_Compound:
{
TAG_Compound("nbt"):
{
//Some uninteresting stuff
}
TAG_List("pos (Int)"): 3 entries of type TAG_Int
{
TAG_Int: X
TAG_Int: Y
TAG_Int: Z
}
TAG_Int("state"): S
}
}
My goal is to extract all "state" information of the raw byte data as fast as possible while ignoring all "pos" and "nbt" data.
Since not every "block"-Compound has a "nbt"-Subcompound, I can't just save every Xth byte, since the length of each "block"-Compound is unknown.
So my current solution is to search for the keyword "state" (data.find(b'\x73\x74\x61\x74\x65')), save the following 4 bytes, cut of the word "state" from the data (data = data[index+4:]) and iterate.
And I'm by no means an expert, but I feel like there has to be a better way.
To give you a better idea of the data I'm dealing with:
b'\t\x00\x06blocks\n\x00\x00\x00\x08\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x05state\x00\x00\x00\x00\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x05state\x00\x00\x00\x01\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x03\x00\x05state\x00\x00\x00\x02\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x00\x03\x00\x05state\x00\x00\x00\x03\x00\n\x00\x03nbt\t\x00\x05Items\x00\x00\x00\x00\x00\x08\x00\x02id\x00\x0fminecraft:chest\x08\x00\x04Lock\x00\x00\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x03\x00\x05state\x00\x00\x00\x04\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x03\x00\x05state\x00\x00\x00\x05\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x05state\x00\x00\x00\x03\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x05state\x00\x00\x00\x03\x00'
(I want the 4 bytes after each "state")
//Edit: I just realized slicing takes a lot of time, so if I just searched for all occurences, I'd be a lot faster.
But I still hope there's a better way

Wandering star - codeabbey task

I'm trying to solve this problem and I'm not sure what to do next.
Link to the problem
Problem statement:
Suppose that some preliminary image preprocessing was already done and you have data in form of coordinates of stars on two pictures. These pictures are about 100x100 millimeters, and coordinates are also given in millimeters relative to their center. Look at the schematic reprsentation below:
You see that in both pictures stars are shown in roughly circular area (think of it as of apperture of our telescope), and you can find out that they represent the same piece of the sky - thoulgh slightly rotated and slightly shifted.
You also can see that one of the stars (marked with red arrows) have changed its position relative to others.
Your task is to find out such a "wandering star" for it could be comet or asteroid with high probability.
Note that some stars which are close to the edge could be absent from one of pictures (due to shift) - but the "wandering star" is not that far from the center and therefore is presented on both of images.
Input data contains two sections corresponding to two images.
Each sequence starts with a single integer - amount of stars listed.
Then the coordinates (X and Y) of stars follow.
Answer should give two indexes (0-based) of the wandering star in the first and second section respectively.
The example is the same as pictures above. The star #29 from the first section with coordinates (-18.2, 11.1) is the same as the star #3 from the second section with coordinates (-19.7, 6.9).
Example input data:
94 # section 1 contains 94 stars
-47.5 -10.4
19.1 25.9
18.9 -10.4
-2.1 -47.6
...
...
92 # section 2 contains 92 stars
-14.8 10.9
18.8 -0.1
-11.3 5.7
-19.7 6.9
-11.5 -16.7
-45.4 -15.3
6.0 -46.9
-24.1 -26.3
30.2 27.4
...
...
The problem I'm facing
The problem is that the vectors do no match, and don't even have the same size. So for example the first vector in the first section doesn't match the first in the second section, so I can't calculate the rotation matrix based on that. I also tried calculating it based on the centroids of each section, but some points on the edge might be absent, so they will have different centroids(I tried including only the vectors who's length is < 40, the size still doesn't match).
So my question is what should I base my calculations on? How do I find some matching vectors so I can calculate the rotation matrix on them? I need a push in the right direction.
What I did is implement the functions to find the rotation matrix between two given vectors. The formula I'm using:
transformed_vector = R * original_vector + t
where R is the rotation matrix and since the vector also moves along the axis a bit I'm also adding t
Now all I need is two vectors to base my calculations on.
Edit: I should probably mention that I'm actually given two arrays of vectors, one for each image, I'm not actually given the images. I need to find the star that moved based on those vectors.
Thanks!

First, for simplicity let's initially pretend that there is no wandering star, and that every star that is in one image is in the other -- that is, that image B has been produced from image A by applying just a (possibly zero) shift and a (possibly zero) rotation.
Signatures
I think the best way to approach this is to "forget about" the position of each star in each image, and instead record for each star only a signature: some information that will not be changed by a shift (translation) or rotation. Then you can look at every signature in image A, and try to match it with a signature in image B. If we make the signatures in a sensible way, then they should do a good job of distinguishing the stars, so that no two stars in the same image get the same or highly similar signatures.
For a given star, something that is not changed by shifting or rotation is its distance to any other star in the same image. So you could use the complete set of distances to all the other n-1 stars as a star's signature (stored as an unordered set, since we don't know "which stars are which"). But it's probably sufficient (and faster, and ultimately more robust) to just use a subset of these, and in that case the distances to its nearest neighbours are probably the most informative (unless you have repeating patterns of stars appearing). How many to use? I would start by calculating the distance of each star to its nearest neighbour in the same image; if in image A, all of these these distances are "different enough" (i.e., have difference below some arbitrary threshold that you choose), and likewise in image B, then you're done -- that is, the signatures that you have calculated already distinguish the stars well enough -- otherwise go back and for each star insert the distance to its second-nearest neighbour into its signature, repeating this until "uniqueness" of signatures is achieved.
Comparing signatures
This brings up the question of how to decide whether two signatures are "different enough" when a signature contains more than one distance. Now you are comparing two sets of numbers instead of just two numbers. One way would be to sort the distances in each signature, and then compare them by calculating some measure of distance between points like the sum of squared differences (e.g., if each signature contains 3 distances, this would be (u[1]-v[1])^2 + (u[2]-v[2])^2 + (u[3]-v[3])^2 for two stars u and v, with u[i] being the distance from star u to its ith-nearest neighbour in the same image, and likewise for v.) On your example dataset, I would guess that probably 3-4 distances per star would be enough.
Robust and efficient signature comparison
However this will turn out to be not particularly robust once wandering and missing stars are considered: Suppose that some star u in image A has the wandering star v as its nearest neighbour, and in image B, v has moved away so that it is now the 5th-nearest neighbour to u. Then if we compare u's signature in image A with its signature in image B, we will get an incorrectly large distance, because we will be "blindly" comparing uA[1] with uB[1] and so on, instead of comparing uA[2] with uB[1], uA[3] with uB[2] and so on as we would like to (since these distances will be equal). So a better way to compare two signatures would be by taking the ordered list of distances of each one and allowing the numbers to "slide along until they fit" the numbers in the other one.
Happily this can be done in linear time with a list merge. Given two sorted lists of numbers X and Y, we perform the usual list merge strategy of choosing the smallest element from either list, removing it from there and writing it out, but we also keep track of two candidate partial solutions: Bm, the score of the best partial solution in which the most recently output number is matched with an earlier number, and Bf, the score of the best partial solution in which it is free (not matched with anything yet). Choose a penalty cost p to assign to numbers that are not matched with any number in the other list, and a scoring function score(a, b) that assigns a value of 0 when a = b and higher values the more different a and b are (e.g., score(a, b) = (a - b)^2). Then the optimal score sigDiff(X, Y) can be computed using:
Set Bm = Bf = 0.
If X and Y are both empty, then halt. The final score is min(Bf + p, Bm).
Call the element at the front of X x, and the element at the front of Y y. Let z be the smaller of x and y, and set whichList to the ID of the list (X or Y) that z came from.
Remove the first element (z) from the list named by whichList.
If whichList == prevWhichList (i.e. if the previous smallest number, prev, was also from the same list as z), or if this is the first iteration, then set newBm = min(Bf + p, Bm) + p.
Otherwise, it's possible to match z with the previous number written out, so set newBm = Bf + score(z, prev).
Set Bf = min(Bf + p, Bm).
Set Bm = newBm.
Set prev = z and prevWhichList = whichList.
Goto 2.
For example, suppose we have the two distance lists (i.e., two signatures) X = [10, 20, 30, 40] and Y = [11, 18, 41, 50], the penalty is 20, and score() is the sum of squared differences as above. The above algorithm will "align" the two sequences as follows:
X 10 20 30 40
matches \ / \
Y 11 18 41 50
cost 1 4 20 1 20 Total cost: 46
By contrast, the "naive" sum-of-squares method would give:
X 10 20 30 40
matches | | | |
Y 11 18 41 50
cost 1 4 121 100 Total cost: 226
Matching signatures between the two images
We have built all this machinery so that we can try to match stars between the two images by matching their signatures. Arguably the "right" way to do this would be to solve the bipartite maximum weighted matching problem on the graph in which the vertices in one side are the signatures of the stars in image A, the vertices in the other side are the signatures of the stars in image B, and there is an edge from every vertex X in one side to every vertex Y in the other side having weight -sigDiff(X, Y) (negated because we want to minimise the total difference).
However, solving this might take a while, and it's also possible that this algorithm will find some incorrect "approximate" matches in an attempt to get an overall minimum cost. We would prefer to focus on just those pairs of stars which we are certain correspond to one another. That can be done much faster with a heuristic based on the idea that if, for a star X in image A whose closest star in image B (based on sigDiff()) is Y, it turns out that Y's closest star in image A is also X (i.e., if the best match of X's best match is X), then we can be confident that X and Y really match each other. These confident matching pairs can be found quickly, and used to estimate an affine transform from image A to image B using least squares.
Ignoring missing stars
We first compute the convex hull of image B's stars. Then, using the transform determined from the confident star pairs, we transform every star in image A to its estimated corresponding location in image B, and take the convex hull of this transformed image. We then intersect these two convex hulls: every star in either image that falls inside this intersection is a star that we expect to find in both images. Finally we throw away all stars from either image that are outside this intersected convex hull, and start again!
After running everything again (maybe it's not necessary to rerun everything, actually), we should find that there is exactly 1 star in image A and 1 star in image B that have poor similarity to all other stars in the other image (as measured by sigDiff(), as usual). These "two" stars are of course the single "wandering" star :)

[edit2] complete reedit
Have found some time/mood for this to make it more robust
let xy0[],xy1[] be the input star lists
let max_r be the nearby search area treshld
let max_err be the max acceptable cluster match error
so here is the Algorithm:
sort xy0[] by x asc
this makes the search faster and easier
find star clusters in xy0[]
loop through all stars
and cross reference them with stars nearby
because of sorting the nearby stars will be also near current star index
so just search close area before and after this star in array
until x-distance cross the max_r
add cluster to cl0[] cluster list if found
(cluster is 2 and more close stars)
before adding new cluster
check if no cluster is near it
and merge if too close to another cluster instead
full recompute found clusters
avg coordinate
distances between all stars inside
sort them by distance asc
do the 1.,2.,3. also for xy1[],cl1[]
find matches between clusters
so check if the distances list inside are the same
(remember the lowest sum of abs error)
if error is bigger then max_err reject this cluster match
this is strong matching have tested on many clusters (big max_r) without miss match on this dataset
take 2 clusters from found clusters in cl0[] that have match found
take also the matched clusters
compute transformation between xy0[],xy1[] from this 4 points
I use avg coordinate of cluster and it matches perfectly
This is the visual result:
left side is xy0[] set
middle is xy1[] set
and on the right both where blue bigger dots are xy0[]
and green smaller dots are transformed xy1[]
the numbers are the error of cluster match (-1 means no match found)
[notes]
I use my own List<T> template ...
it is just dynamically reallocating linear array
List<int> x; is the same as int x[];
where x[i] is item access
x.num is number of items in the array
x.add(5); is the same as x[x.num]=5; x.num++;
From this point you can check for matches between xy0 and transformed xy1
so flag matched stars to avoid duplicate use of them
use some treshold for this like max_err
from what stars are left
find two that are closest to each other
this should be the wandering star ... have fun
(you can sort the transformed xy1 again)
do not forget to use the original star indexes ix0[],ix1[] for result output
[edit3] also the rest works
//---------------------------------------------------------------------------
// answer: 29 3
// input data:
const int n0=94; double xy0[n0][2]=
{
-47.5,-10.4,19.1,25.9,18.9,-10.4,-2.1,-47.6,41.8,-12.1,-15.7,12.1,-11.0,-0.6,
-15.6,-7.6,14.9,43.5,16.6,0.1,3.6,-33.5,-14.2,20.8,17.8,-29.8,-2.2,-12.8,
44.6,19.7,17.9,-41.3,24.6,37.0,43.9,14.5,23.8,19.6,-4.2,-40.5,32.0,17.2,
22.6,-26.9,9.9,-33.4,-13.6,6.6,48.5,-3.5,-9.9,-39.9,-28.2,20.7,7.1,15.5,
-36.2,-29.9,-18.2,11.1,-1.2,-13.7,9.3,9.3,39.2,15.8,-5.2,-16.2,-34.9,5.0,
-13.4,-31.8,24.7,-29.1,1.4,24.0,-24.4,18.0,11.9,-29.1,36.3,18.6,30.3,38.4,
4.8,-20.5,-46.8,12.1,-44.2,-6.0,-1.4,-39.7,-1.0,-13.7,13.3,23.6,37.4,-7.0,
-22.3,37.8,17.6,-3.3,35.0,-9.1,-44.5,13.1,-5.1,19.7,-12.1,1.7,-30.9,-1.9,
-19.4,-15.0,10.8,31.9,19.7,3.1,29.9,-16.6,31.7,-26.8,38.1,30.2,3.5,25.1,
-14.8,19.6,2.1,29.0,-9.6,-32.9,24.8,4.9,-2.2,-24.7,-4.3,-37.4,-3.0,37.4,
-34.0,-21.2,-18.4,34.6,9.3,-45.2,-21.1,-10.3,-19.8,29.1,31.3,37.7,27.2,19.3,
-1.6,-45.6,35.3,-23.5,-39.9,-19.8,-3.8,40.6,-15.7,12.5,-0.8,-16.3,-5.1,13.1,
-13.8,-25.7,43.8,5.6,9.2,38.6,42.2,0.2,-10.0,-48.6,14.1,-6.5,34.6,-26.8,
11.1,-6.7,-6.1,25.1,-38.3,8.1,
};
const int n1=92; double xy1[n1][2]=
{
-14.8,10.9,18.8,-0.1,-11.3,5.7,-19.7,6.9,-11.5,-16.7,-45.4,-15.3,6.0,-46.9,
-24.1,-26.3,30.2,27.4,21.4,-27.2,12.1,-36.1,23.8,-38.7,41.5,5.3,-8.7,25.5,
36.6,-5.9,43.7,-14.6,-9.7,-8.6,34.7,-19.3,-15.5,19.3,21.4,3.9,34.0,29.8,
6.5,19.5,28.2,-21.7,13.4,-41.8,-25.9,-6.9,37.5,27.8,18.1,44.7,-43.0,-19.9,
-15.7,18.0,2.4,-31.6,9.6,-37.6,15.4,-28.8,43.6,-11.2,4.6,-10.2,-8.8,38.2,
8.7,-34.6,-4.7,14.1,-1.7,31.3,0.6,27.9,26.3,13.7,-1.2,26.3,32.1,-17.7,
15.5,32.6,-14.4,-12.6,22.3,-22.5,7.0,48.5,-6.4,20.5,-42.9,4.2,-23.0,31.6,
-24.6,14.0,-30.2,-26.5,-29.0,15.7,6.0,36.3,44.3,13.5,-27.6,33.7,13.4,-43.9,
10.5,28.9,47.0,1.4,10.2,14.0,13.3,-15.9,-3.4,-25.6,-14.7,10.5,21.6,27.6,
21.8,10.6,-37.8,-14.2,7.6,-21.8,-8.6,1.3,6.8,-13.3,40.9,-15.3,-10.3,41.1,
6.0,-10.8,-1.5,-31.4,-35.6,1.0,2.5,-14.3,24.4,-2.6,-24.1,-35.3,-29.9,-34.7,
15.9,-1.0,19.5,7.0,44.5,19.1,39.7,2.7,2.7,42.4,-23.0,25.9,25.0,28.2,31.2,-32.8,
3.9,-38.4,-44.8,2.7,-39.9,-19.3,-7.0,-0.6,5.8,-10.9,-44.5,19.9,-31.5,-1.2,
};
//---------------------------------------------------------------------------
struct _dist // distance structure
{
int ix; // star index
double d; // distance to it
_dist(){}; _dist(_dist& a){ *this=a; }; ~_dist(){}; _dist* operator = (const _dist *a) { *this=*a; return this; }; /*_dist* operator = (const _dist &a) { ...copy... return this; };*/
};
struct _cluster // star cluster structure
{
double x,y; // avg coordinate
int iy; // ix of cluster match in the other set or -1
double err; // error of cluster match
List<int> ix; // star ix
List<double> d; // distances of stars ix[] against each other
_cluster(){}; _cluster(_cluster& a){ *this=a; }; ~_cluster(){}; _cluster* operator = (const _cluster *a) { *this=*a; return this; }; /*_cluster* operator = (const _cluster &a) { ...copy... return this; };*/
};
const double max_r=5.0; // find cluster max radius
const double max_err=0.2; // match cluster max distance error treshold
const double max_rr=max_r*max_r;
const double max_errr=max_err*max_err;
int wi0,wi1; // result wandering star ix ...
int ix0[n0],ix1[n1]; // original star indexes
List<_cluster> cl0,cl1; // found clusters
double txy1[n1][2]; // transformed xy1[]
//---------------------------------------------------------------------------
double atanxy(double x,double y)
{
const double pi=M_PI;
const double pi2=2.0*M_PI;
int sx,sy;
double a;
const double _zero=1.0e-30;
sx=0; if (x<-_zero) sx=-1; if (x>+_zero) sx=+1;
sy=0; if (y<-_zero) sy=-1; if (y>+_zero) sy=+1;
if ((sy==0)&&(sx==0)) return 0;
if ((sx==0)&&(sy> 0)) return 0.5*pi;
if ((sx==0)&&(sy< 0)) return 1.5*pi;
if ((sy==0)&&(sx> 0)) return 0;
if ((sy==0)&&(sx< 0)) return pi;
a=y/x; if (a<0) a=-a;
a=atan(a);
if ((x>0)&&(y>0)) a=a;
if ((x<0)&&(y>0)) a=pi-a;
if ((x<0)&&(y<0)) a=pi+a;
if ((x>0)&&(y<0)) a=pi2-a;
return a;
}
//---------------------------------------------------------------------------
void compute()
{
int i0,i1,e,f;
double a,x,y;
// original indexes (to keep track)
for (e=0;e<n0;e++) ix0[e]=e;
for (e=0;e<n1;e++) ix1[e]=e;
// sort xy0[] by x asc
for (e=1;e;) for (e=0,i0=0,i1=1;i1<n0;i0++,i1++)
if (xy0[i0][0]>xy0[i1][0])
{
e=ix0[i0] ; ix0[i0] =ix0[i1] ; ix0[i1] =e; e=1;
a=xy0[i0][0]; xy0[i0][0]=xy0[i1][0]; xy0[i1][0]=a;
a=xy0[i0][1]; xy0[i0][1]=xy0[i1][1]; xy0[i1][1]=a;
}
// sort xy1[] by x asc
for (e=1;e;) for (e=0,i0=0,i1=1;i1<n1;i0++,i1++)
if (xy1[i0][0]>xy1[i1][0])
{
e=ix1[i0] ; ix1[i0] =ix1[i1] ; ix1[i1] =e; e=1;
a=xy1[i0][0]; xy1[i0][0]=xy1[i1][0]; xy1[i1][0]=a;
a=xy1[i0][1]; xy1[i0][1]=xy1[i1][1]; xy1[i1][1]=a;
}
_dist d;
_cluster c,*pc,*pd;
List<_dist> dist;
// find star clusters in xy0[]
for (cl0.num=0,i0=0;i0<n0;i0++)
{
for (dist.num=0,i1=i0+1;(i1<n0)&&(fabs(xy0[i0][0]-xy0[i1][0])<=max_r);i1++) // stars nearby
{
x=xy0[i0][0]-xy0[i1][0]; x*=x;
y=xy0[i0][1]-xy0[i1][1]; y*=y; a=x+y;
if (a<=max_rr) { d.ix=i1; d.d=a; dist.add(d); }
}
if (dist.num>=2) // add/compute cluster if found
{
c.ix.num=0; c.err=-1.0;
c.ix.add(i0); for (i1=0;i1<dist.num;i1++) c.ix.add(dist[i1].ix); c.iy=-1;
c.x=xy0[i0][0]; for (i1=0;i1<dist.num;i1++) c.x+=xy0[dist[i1].ix][0]; c.x/=dist.num+1;
c.y=xy0[i0][1]; for (i1=0;i1<dist.num;i1++) c.y+=xy0[dist[i1].ix][1]; c.y/=dist.num+1;
for (e=1,i1=0;i1<cl0.num;i1++)
{
pc=&cl0[i1];
x=c.x-pc->x; x*=x;
y=c.y-pc->y; y*=y; a=x+y;
if (a<max_rr) // merge if too close to another cluster
{
pc->x=0.5*(pc->x+c.x);
pc->y=0.5*(pc->y+c.y);
for (e=0;e<c.ix.num;e++)
{
for (f=0;f<pc->ix.num;f++)
if (pc->ix[f]==c.ix[e]) { f=-1; break; }
if (f>=0) pc->ix.add(c.ix[e]);
}
e=0; break;
}
}
if (e) cl0.add(c);
}
}
// full recompute clusters
for (f=0,pc=&cl0[f];f<cl0.num;f++,pc++)
{
// avg coordinate
pc->x=0.0; for (i1=0;i1<pc->ix.num;i1++) pc->x+=xy0[pc->ix[i1]][0]; pc->x/=pc->ix.num;
pc->y=0.0; for (i1=0;i1<pc->ix.num;i1++) pc->y+=xy0[pc->ix[i1]][1]; pc->y/=pc->ix.num;
// distances
for (pc->d.num=0,i0= 0;i0<pc->ix.num;i0++)
for ( i1=i0+1;i1<pc->ix.num;i1++)
{
x=xy0[pc->ix[i1]][0]-xy0[pc->ix[i0]][0]; x*=x;
y=xy0[pc->ix[i1]][1]-xy0[pc->ix[i0]][1]; y*=y;
pc->d.add(sqrt(x+y));
}
// sort by distance asc
for (e=1;e;) for (e=0,i0=0,i1=1;i1<pc->d.num;i0++,i1++)
if (pc->d[i0]>pc->d[i1])
{
a=pc->d[i0]; pc->d[i0]=pc->d[i1]; pc->d[i1]=a; e=1;
}
}
// find star clusters in xy1[]
for (cl1.num=0,i0=0;i0<n1;i0++)
{
for (dist.num=0,i1=i0+1;(i1<n1)&&(fabs(xy1[i0][0]-xy1[i1][0])<=max_r);i1++) // stars nearby
{
x=xy1[i0][0]-xy1[i1][0]; x*=x;
y=xy1[i0][1]-xy1[i1][1]; y*=y; a=x+y;
if (a<=max_rr) { d.ix=i1; d.d=a; dist.add(d); }
}
if (dist.num>=2) // add/compute cluster if found
{
c.ix.num=0; c.err=-1.0;
c.ix.add(i0); for (i1=0;i1<dist.num;i1++) c.ix.add(dist[i1].ix); c.iy=-1;
c.x=xy1[i0][0]; for (i1=0;i1<dist.num;i1++) c.x+=xy1[dist[i1].ix][0]; c.x/=dist.num+1;
c.y=xy1[i0][1]; for (i1=0;i1<dist.num;i1++) c.y+=xy1[dist[i1].ix][1]; c.y/=dist.num+1;
for (e=1,i1=0;i1<cl1.num;i1++)
{
pc=&cl1[i1];
x=c.x-pc->x; x*=x;
y=c.y-pc->y; y*=y; a=x+y;
if (a<max_rr) // merge if too close to another cluster
{
pc->x=0.5*(pc->x+c.x);
pc->y=0.5*(pc->y+c.y);
for (e=0;e<c.ix.num;e++)
{
for (f=0;f<pc->ix.num;f++)
if (pc->ix[f]==c.ix[e]) { f=-1; break; }
if (f>=0) pc->ix.add(c.ix[e]);
}
e=0; break;
}
}
if (e) cl1.add(c);
}
}
// full recompute clusters
for (f=0,pc=&cl1[f];f<cl1.num;f++,pc++)
{
// avg coordinate
pc->x=0.0; for (i1=0;i1<pc->ix.num;i1++) pc->x+=xy1[pc->ix[i1]][0]; pc->x/=pc->ix.num;
pc->y=0.0; for (i1=0;i1<pc->ix.num;i1++) pc->y+=xy1[pc->ix[i1]][1]; pc->y/=pc->ix.num;
// distances
for (pc->d.num=0,i0= 0;i0<pc->ix.num;i0++)
for ( i1=i0+1;i1<pc->ix.num;i1++)
{
x=xy1[pc->ix[i1]][0]-xy1[pc->ix[i0]][0]; x*=x;
y=xy1[pc->ix[i1]][1]-xy1[pc->ix[i0]][1]; y*=y;
pc->d.add(sqrt(x+y));
}
// sort by distance asc
for (e=1;e;) for (e=0,i0=0,i1=1;i1<pc->d.num;i0++,i1++)
if (pc->d[i0]>pc->d[i1])
{
a=pc->d[i0]; pc->d[i0]=pc->d[i1]; pc->d[i1]=a; e=1;
}
}
// find matches
for (i0=0,pc=&cl0[i0];i0<cl0.num;i0++,pc++) if (pc->iy<0){ e=-1; x=0.0;
for (i1=0,pd=&cl1[i1];i1<cl1.num;i1++,pd++) if (pc->d.num==pd->d.num)
{
for (y=0.0,f=0;f<pc->d.num;f++) y+=fabs(pc->d[f]-pd->d[f]);
if ((e<0)||(x>y)) { e=i1; x=y; }
}
x/=pc->d.num;
if ((e>=0)&&(x<max_err))
{
if (cl1[e].iy>=0) cl0[cl1[e].iy].iy=-1;
pc->iy =e; cl1[e].iy=i0;
pc->err=x; cl1[e].err=x;
}
}
// compute transform
double tx0,tx1,ty0,ty1,tc,ts;
tx0=0.0; tx1=0.0; ty0=0.0; ty1=0.0; tc=1.0; ts=0.0; i0=-1; i1=-1;
for (e=0,f=0,pc=&cl0[e];e<cl0.num;e++,pc++) if (pc->iy>=0) // find 2 clusters with match
{
if (f==0) i0=e;
if (f==1) { i1=e; break; }
f++;
}
if (i1>=0)
{
pc=&cl0[i0]; // translation and offset from xy0 set
pd=&cl0[i1];
tx1=pc->x;
ty1=pc->y;
a =atanxy(pd->x-pc->x,pd->y-pc->y);
pc=&cl1[pc->iy]; // translation and offset from xy1 set
pd=&cl1[pd->iy];
tx0=pc->x;
ty0=pc->y;
a-=atanxy(pd->x-pc->x,pd->y-pc->y);
tc=cos(a);
ts=sin(a);
}
// transform xy1 -> txy1 (in xy0 coordinate system)
for (i1=0;i1<n1;i1++)
{
x=xy1[i1][0]-tx0;
y=xy1[i1][1]-ty0;
txy1[i1][0]=x*tc-y*ts+tx1;
txy1[i1][1]=x*ts+y*tc+ty1;
}
// sort txy1[] by x asc (after transfrm)
for (e=1;e;) for (e=0,i0=0,i1=1;i1<n1;i0++,i1++)
if (txy1[i0][0]>txy1[i1][0])
{
e= ix1[i0] ; ix1[i0] = ix1[i1] ; ix1[i1] =e; e=1;
a=txy1[i0][0]; txy1[i0][0]=txy1[i1][0]; txy1[i1][0]=a;
a=txy1[i0][1]; txy1[i0][1]=txy1[i1][1]; txy1[i1][1]=a;
}
// find match between xy0,txy1 (this can be speeded up by exploiting sorted order)
int ix01[n0],ix10[n1];
for (i0=0;i0<n0;i0++) ix01[i0]=-1;
for (i1=0;i1<n1;i1++) ix10[i1]=-1;
for (i0=0;i0<n0;i0++){ a=-1.0;
for (i1=0;i1<n1;i1++)
{
x=xy0[i0][0]-txy1[i1][0]; x*=x;
y=xy0[i0][1]-txy1[i1][1]; y*=y; x+=y;
if (x<max_errr)
if ((a<0.0)||(a>x)) { a=x; ix01[i0]=i1; ix10[i1]=i0; }
}}
// find the closest stars from unmatched stars
a=-1.0; wi0=-1; wi1=-1;
for (i0=0;i0<n0;i0++) if (ix01[i0]<0)
for (i1=0;i1<n1;i1++) if (ix10[i1]<0)
{
x=xy0[i0][0]-txy1[i1][0]; x*=x;
y=xy0[i0][1]-txy1[i1][1]; y*=y; x+=y;
if ((wi0<0)||(a>x)) { a=x; wi0=i0; wi1=i1; }
}
}
//---------------------------------------------------------------------------
void draw()
{
bmp->Canvas->Font->Charset=OEM_CHARSET;
bmp->Canvas->Font->Name="System";
bmp->Canvas->Font->Pitch=fpFixed;
bmp->Canvas->Font->Color=0x00FFFF00;
bmp->Canvas->Brush->Color=0x00000000;
bmp->Canvas->FillRect(TRect(0,0,xs,ys));
_cluster *pc;
int i,x0,y0,x1,y1,x2,y2,xx,yy,r,_r=4;
double x,y,m;
x0=xs/6; x1=3*x0; x2=5*x0;
y0=ys/2; y1= y0; y2= y0;
x=x0/60.0; y=y0/60.0; if (x<y) m=x; else m=y;
// clusters match
bmp->Canvas->Pen ->Color=clAqua;
bmp->Canvas->Brush->Color=0x00303030;
for (i=0,pc=&cl0[i];i<cl0.num;i++,pc++)
if (pc->iy>=0)
{
x=pc->x*m; xx=x0+x;
y=pc->y*m; yy=y0-y;
bmp->Canvas->MoveTo(xx,yy);
x=cl1[pc->iy].x*m; xx=x1+x;
y=cl1[pc->iy].y*m; yy=y1-y;
bmp->Canvas->LineTo(xx,yy);
}
// clusters area
for (i=0,pc=&cl0[i];i<cl0.num;i++,pc++)
{
x=pc->x*m; xx=x0+x;
y=pc->y*m; yy=y0-y;
r=pc->d[pc->d.num-1]*m*0.5+_r;
bmp->Canvas->Ellipse(xx-r,yy-r,xx+r,yy+r);
bmp->Canvas->TextOutA(xx+r,yy+r,AnsiString().sprintf("%.3lf",pc->err));
}
for (i=0,pc=&cl1[i];i<cl1.num;i++,pc++)
{
x=pc->x*m; xx=x1+x;
y=pc->y*m; yy=y1-y;
r=pc->d[pc->d.num-1]*m*0.5+_r;
bmp->Canvas->Ellipse(xx-r,yy-r,xx+r,yy+r);
bmp->Canvas->TextOutA(xx+r,yy+r,AnsiString().sprintf("%.3lf",pc->err));
}
// stars
r=_r;
bmp->Canvas->Pen ->Color=clAqua;
bmp->Canvas->Brush->Color=clBlue;
for (i=0;i<n0;i++)
{
x=xy0[i][0]*m; xx=x0+x;
y=xy0[i][1]*m; yy=y0-y;
bmp->Canvas->Ellipse(xx-r,yy-r,xx+r,yy+r);
}
for (i=0;i<n1;i++)
{
x=xy1[i][0]*m; xx=x1+x;
y=xy1[i][1]*m; yy=y1-y;
bmp->Canvas->Ellipse(xx-r,yy-r,xx+r,yy+r);
}
// merged sets
r=_r;
bmp->Canvas->Pen ->Color=clBlue;
bmp->Canvas->Brush->Color=clBlue;
for (i=0;i<n0;i++)
{
x=xy0[i][0]*m; xx=x2+x;
y=xy0[i][1]*m; yy=y2-y;
bmp->Canvas->Ellipse(xx-r,yy-r,xx+r,yy+r);
}
r=_r-2;
bmp->Canvas->Pen ->Color=clGreen;
bmp->Canvas->Brush->Color=clGreen;
for (i=0;i<n1;i++)
{
x=txy1[i][0]*m; xx=x2+x;
y=txy1[i][1]*m; yy=y2-y;
bmp->Canvas->Ellipse(xx-r,yy-r,xx+r,yy+r);
}
// wandering star
r=_r+5;
bmp->Canvas->Pen ->Color=0x00FF8000;
bmp->Canvas->Font ->Color=0x00FF8000;
bmp->Canvas->Brush->Style=bsClear;
x=xy0[wi0][0]*m; xx=x2+x;
y=xy0[wi0][1]*m; yy=y2-y;
bmp->Canvas->Ellipse(xx-r,yy-r,xx+r,yy+r);
bmp->Canvas->TextOutA(xx+r,yy+r,ix0[wi0]);
bmp->Canvas->Pen ->Color=0x0040FF40;
bmp->Canvas->Font ->Color=0x0040FF40;
x=txy1[wi1][0]*m; xx=x2+x;
y=txy1[wi1][1]*m; yy=y2-y;
bmp->Canvas->Ellipse(xx-r,yy-r,xx+r,yy+r);
bmp->Canvas->TextOutA(xx+r,yy+r,ix1[wi1]);
bmp->Canvas->Brush->Style=bsSolid;
Form1->Canvas->Draw(0,0,bmp);
}
//---------------------------------------------------------------------------
And here the final result:
as you can see the result matches the answer from source link

Rewriting this python function in c++ seems to make it run a lot slower. Is this reasonable?

I'm implementing some graph traversal functions in python but I need better performance so I decided to try to rewrite the functions in c++, but they seem to run slower. I'm a c++ beginner so I'm not sure if this is expected behavior.
The following python functions implements a Breadth-First Search on an unweighted graph. Its objective is to visit every vertex once and measure how many hops away each vertex is from the source.
graph is dict {vertex : set(neighbor1, neighbor2 ... , neighbor n) }
return is dict {vertex : distance_to_source}
def shortest_path_lengths(graph,source):
seen={}
level=0
nextlevel={source}
while nextlevel:
thislevel=nextlevel
nextlevel=set()
for v in thislevel:
if v not in seen:
seen[v]=level
nextlevel.update(graph[v])
level=level+1
return seen
And runs:
%timeit seen = shortest_path_lengths(G,0)
10 loops, best of 3: 79.7 ms per loop
For my c++ implementation:
graph is map< long vertex, set < long > vertex neighbors >
return is map < long vertex ,int distance_from_source >
map<long,int> spl(graph G, long source)
{
int level = 0;
map<long, int> seen;
set<long> nextlevel;
set<long> thislevel;
nextlevel.insert(source);
while (! nextlevel.empty())
{
thislevel = nextlevel;
nextlevel.clear();
for (auto it = thislevel.begin(); it != thislevel.end(); ++it)
{
if (! seen.count(*it))
{
seen[*it] = level;
//cout << G[*it];
nextlevel.insert(G[*it].begin(), G[*it].end());
}
}
level++;
}
return seen;
}
and I measure its execution time with:
clock_t begin = clock();
seen = spl(graph1,0);
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
printf("\nTime taken: %.8fs\n", elapsed_secs);
and get output:
Time taken: 0.38512900s
which is almost 5 times slower than the python implementation for the same graph. Seeing as I'm a beginner in c++, I'm not really sure if I'm measuring time wrong, I'm implementing something wrong, or even if this is expected behavior.
EDIT:
After converting maps into unordered_maps, using the -O2 -flto compile parameters, and passing the graph by const reference, the run time of the c++ function for a graph of size 65k drops down to 0.09sec, which is still a bit slower than pythons 0.08s for the same graph.
On a bigger graph of 75k nodes (but over twice as many edges), c++ falls further behind at 0.3s to python's 0.2s
EDIT2:
After changing the nested set inside the map to unordered_set as well, and changing the thislevel/nextlevel sets also to unordered_sets, the c++ code beats the python code on the smaller graph ( 0.063 to 0.081 sec) but only matches it on the bigger one (0.2 to 0.2)
EDIT3:
On an even bigger graph (85k nodes, over 1.5m edges), python needs 0.9sec for the operation, while the C++ code needs 0.75s

First thing that jumped out at me is you used a hash map in Python, which is what dictionaries are, and a tree-based map in C++, which is what set is. The C++ equivalent is unordered_map.

thislevel = nextlevel;
In c++, this makes full copy of the sets. You should rather use pointers to the sets, and swap the pointers instead of the sets.

From C to Python

I'm really sorry if this is a lame question, but I think this may potentially help others making the same transition from C to Python. I have a program that I started writing in C, but I think it's best if I did it in Python because it just makes my life a lot easier.
My program retrieves intraday stock data from Yahoo! Finance and stores it inside of a struct. Since I'm so used to programming in C I generally try to do things the hard way. What I want to know is what's the most "Pythonesque" way of storing the data into an organized fashion. I was thinking an array of tuples?
Here's a bit of my C program.
// Parses intraday stock quote data from a Yahoo! Finance .csv file.
void parse_intraday_data(struct intraday_data *d, char *path)
{
char cur_line[100];
char *csv_value;
int i;
FILE *data_file = fopen(path, "r");
if (data_file == NULL)
{
perror("Error opening file.");
return;
}
// Ignore the first 15 lines.
for (i = 0; i < 15; i++)
fgets(cur_line, 100, data_file);
i = 0;
while (fgets(cur_line, 100, data_file) != NULL) {
csv_value = strtok(cur_line, ",");
csv_value = strtok(NULL, ",");
d->close[i] = atof(csv_value);
csv_value = strtok(NULL, ",");
d->high[i] = atof(csv_value);
csv_value = strtok(NULL, ",");
d->low[i] = atof(csv_value);
csv_value = strtok(NULL, ",");
d->open[i] = atof(csv_value);
csv_value = strtok(NULL, "\n");
d->volume[i] = atoi(csv_value);
i++;
}
d->close[i] = 0;
d->high[i] = 0;
d->low[i] = 0;
d->open[i] = 0;
d->volume[i] = 0;
d->count = i - 1;
i = 0;
fclose(data_file);
}
So far my Python program retrieves the data like this.
response = urllib2.urlopen('https://www.google.com/finance/getprices?i=' + interval + '&p=' + period + 'd&f=d,o,h,l,c,v&df=cpct&q=' + ticker)
Question is, what's the best or most elegant way of storing this data in Python?

Keep it simple. Read the line, split it by commas, and store the values inside a (named)tuple. That’s pretty close to using a struct in C.
If your program gets more elaborate it might (!) make sense to replace the tuple by a class, but not immediately.
Here’s an outline:
from collections import namedtuple
IntradayData = namedtuple('IntradayData',
['close', 'high', 'low', 'open', 'volume', 'count'])
response = urllib2.urlopen('https://www.google.com/finance/getprices?q=AAPL')
result=response.read().split('\n')
result = result[15 :] # Your code does this, too. Not sure why.
all_data = []
for i, data in enumerate(x):
if data == '': continue
c, h, l, o, v, _ = map(float, data.split(','))
all_data.append(IntradayData(c, h, l, o, v, i))

I believe it depends on how much data manipulation you will want to do after retrieving data.
If, for example, you plan to just print them on the screen then an array of tuple would do.
However, if you need to be able to sort, search, and other kind of data manipulation I believe a custom class could help: you would then work with a list (or even a home-brew container) of custom objects, allowing you for easily adding custom methods, based on your need.
Note that this is just my opinion, and I'm not an advanced Python developer.

Pandas (http://pandas.pydata.org/pandas-docs/stable/) is particularly well suited to this. Numpy is a little lower level, but may also suit your purposes. I really recommend going the pandas route, though. Either way you shouldn't lose too much of C's speed, so that's a plus.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.