Given that we have two lines on a graph (I just noticed that I inverted the numbers on the Y axis, this was a mistake, it should go from 11-1)
And we only care about whole number X axis intersections
We need to order these points from highest Y value to lowest Y value regardless of their position on the X axis (Note I did these pictures by hand so they may not line up perfectly).
I have a couple of questions:
1) I have to assume this is a known problem, but does it have a particular name?
2) Is there a known optimal solution when dealing with tens of billions (or hundreds of millions) of lines? Our current process of manually calculating each point and then comparing it to a giant list requires hours of processing. Even though we may have a hundred million lines we typically only want the top 100 or 50,000 results some of them are so far "below" other lines that calculating their points is unnecessary.
Your data structure is a set of tuples
lines = {(y0, Δy0), (y1, Δy1), ...}
You need only the ntop points, hence build a set containing only
the top ntop yi values, with a single pass over the data
top_points = choose(lines, ntop)
EDIT --- to choose the ntop we had to keep track of the smallest
one, and this is interesting info, so let's return also this value
from choose, also we need to initialize decremented
top_points, smallest = choose(lines, ntop)
decremented = top_points
and start a loop...
while True:
Generate a set of decremented values
decremented = {(y-Δy, Δy) for y, Δy in top_points}
decremented = {(y-Δy, Δy) for y, Δy in decremented if y>smallest}
if decremented == {}: break
Generate a set of candidates
candidates = top_lines.union(decremented)
generate a new set of top points
new_top_points, smallest = choose(candidates, ntop)
The following is no more necessary
check if new_top_points == top_points
if new_top_points == top_points: break
top_points = new_top_points</strike>
of course we are in a loop...
The difficult part is the choose function, but I think that this
answer to the question
How can I sort 1 million numbers, and only print the top 10 in Python?
could help you.
It's not a really complicated thing, just a "normal" sorting problem.
Usually sorting requires a large amount of computing time. But your case is one where you don't need to use complex sorting techniques.
You on both graphs are growing or falling constantly, there are no "jumps". You can use this to your advantage. The basic algorithm:
identify if a graph is growing or falling.
write a generator, that generates the values; from left to right if raising, form right to left if falling.
get the first value from both graphs
insert the lower on into the result list
get a new value from the graph that had the lower value
repeat the last two steps until one generator is "empty"
append the leftover items from the other generator.
Related
Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.
There exists a set of points (or items, it doesn't matter). Each point a is at a specific distance from other points in the set. The distance can be retrieved via the function retrieve_dist(a, b).
This question is about programming (in Python) an algorithm to pick a point, with replacement, from this set of points. The picked point:
i) has to be at the maximum possible distance from all already-selected points, while adhering to the requirement in (ii)
ii) the number of times an already-selected point occurs in the sample must carry weight in this calculation. I.e. more frequently-selected points should be weighed more heavily.
E.g. imagine a and b have already been selected (100 and 10 times respectively). Then when the next point is to be selected, it's distance from a matters more than its distance from b, in line with the frequency of occurrence of a in the already-selected sample.
What I can try:
This would have been easy to accomplish if weights/frequencies weren't in play. I could do:
distances = defaultdict(int)
for new_point in set_of_points:
for already_selected_point in selected_points:
distances[new_point] += retrieve_dist(new_point, already_selected_point)
Then I'd sort distances.items() by the second entry in each tuple, and would get the desired item to select.
However, when frequencies of already-selected points come into play, I just can't seem to wrap my head around this problem.
Can an expert help out? Thanks in advance.
A solution to your problem would be to make selected_points a list rather than a set. In this case, each new point is compared to a and b (and all other points) as many times as they have already been found.
If each point is typically found many times, it might be possible to improve perfomance using a dict instead, with the key being the points, and the value being the number of times each point is selected. In that case I think your algorithm would be
distances = defaultdict(int)
for new_point in set_of_points:
for already_selected_point, occurances in selected_points.items():
distances[new_point] += occurances * retrieve_dist(new_point, already_selected_point)
I have the following code which randomly generates a list of (X,Y) tuples:
import random
coords = []
for i in range(10):
x = random.randint(85,939)
y = random.randint(75,693)
coords.append((x,y))
In the final list, the X values of each tuple are considered to overlap if the absolute difference between them is less then 85, and the Y values are considered to overlap if the absoulte difference is less than 75. How can I make sure that none of the tuples in the final list will overlap in both dimensions?
The easiest way to do this is to just keep sampling and discarding coordinates which will create overlap. This will however become very inefficient when you come close to filling the available space. If that's not an issue you should go with this solution.
A bit more efficient and as far as I can tell statistically equivalent is to sample one coordinate, like the row, first. Then compute the occupied area in that row and sample from the remaining positions if there are any.
To avoid the same problem as in the easy solution, if there are no available spaces in a row, it should be removed from the possible sampling outcomes for the row (plus the margin of 75 in both directions).
Ideally, you don't compute the occupied regions each time but you keep a mapping from the row to the occupied space in that row and the amount of non-full rows and just update this mapping when inserting new images. You will need storage for n_rows + 1 extra numbers.
To clarify: When sampling from a restricted space just subtract the occupied positions and get a sampling result n. Then find the correct position for n by walking along the coordinate axis n steps, skipping all the occupied positions.
Imagine you have a list of points in the 2D-space. I am trying to find symmetric points.
For doing that I iterate over my list of points and apply symmetry operations. So suppose I apply one of these operations to the first point and after this operation it is equal to other point in the list. These 2 points are symmetric.
So what I want is to erase this other point from the list that I am iterating so in this way my iterating variable say "i" won't take this value. Because I already know that it is symmetric with the first point.
I have seen similar Posts but they remove a value in the list that they have already taken. What I want is to remove subsequent values.
Whatever symmetric points turn out to be True add them to a set, since set maintains unique elements and look up is O(1) you can use if point not in set condition.
if point not in s:
#test for symmetry
if symmetric:
s.add(point)
In general it is a bad idea to remove values from a list you are iterating over. There are, however, another ways to skip the symmetric points. For example, you can check for each point if you have seen a symmetric one before:
for i, point in enumerate(points):
if symmetric(point) not in points[:i]:
# Do whatever you want to do
Here symmetric produces a point according to your symmetry operation. If your symmetry operation connects more that two points you can do
for i, point in enumerate(points):
for sympoint in symmetric(point):
if sympoint in points[:i]:
break
else:
# Do whatever you want to do
I have a dictionary which has coordinates as keys. They are by default in 3 dimensions, like dictionary[(x,y,z)]=values, but may be in any dimension, so the code can't be hard coded for 3.
I need to find if there are other values within a certain radius of a new coordinate, and I ideally need to do it without having to import any plugins such as numpy.
My initial thought was to split the input into a cube and check no points match, but obviously that is limited to integer coordinates, and would grow exponentially slower (radius of 5 would require 729x the processing), and with my initial code taking at least a minute for relatively small values, I can't really afford this.
I heard finding the nearest neighbor may be the best way, and ideally, cutting down the keys used to a range of +- a certain amount would be good, but I don't know how you'd do that when there's more the one point being used.Here's how I'd do it with my current knowledge:
dimensions = 3
minimumDistance = 0.9
#example dictionary + input
dictionary[(0,0,0)]=[]
dictionary[(0,0,1)]=[]
keyToAdd = [0,1,1]
closestMatch = 2**1000
tooClose = False
for keys in dictionary:
#calculate distance to new point
originalCoordinates = str(split( dictionary[keys], "," ) ).replace("(","").replace(")","")
for i in range(dimensions):
distanceToPoint = #do pythagors with originalCoordinates and keyToAdd
#if you want the overall closest match
if distanceToPoint < closestMatch:
closestMatch = distanceToPoint
#if you want to just check it's not within that radius
if distanceToPoint < minimumDistance:
tooClose = True
break
However, performing calculations this way may still run very slow (it must do this to millions of values). I've searched the problem, but most people seem to have simpler sets of data to do this to. If anyone can offer any tips I'd be grateful.
You say you need to determine IF there are any keys within a given radius of a particular point. Thus, you only need to scan the keys, computing the distance of each to the point until you find one within the specified radius. (And if you do comparisons to the square of the radius, you can avoid the square roots needed for the actual distance.)
One optimization would be to sort the keys based on their "Manhattan distance" from the point (that is, add the component offsets), since the Euclidean distance will never be less than this. This would avoid some of the more expensive calculations (though I don't think you need and trigonometry).
If, as you suggest later in the question, you need to handle multiple points, you can obviously process each individually, or you could find the center of those points and sort based on that.