Get Aerospike hyperLogLog(HLL) intersection count of multiple HLL unions

Get Aerospike hyperLogLog(HLL) intersection count of multiple HLL unions - python

I have 2 or more HLLs that are unioned, I want to get the intersection count of that unions.
I have used the example from here hll-python example
Following is my code
ops = [hll_ops.hll_get_union(HLL_BIN, records)]
_, _, result1 = client.operate(getKey(value), ops)
ops = [hll_ops.hll_get_union(HLL_BIN, records2)]
_, _, result2 = client.operate(getKey(value2), ops)
ops = [hll_ops.hll_get_intersect_count(HLL_BIN, [result1[HLL_BIN]] + [result2[HLL_BIN]])]
_, _, resultVal = client.operate(getKey(value), ops)
print(f'intersectAll={resultVal}')
_, _, resultVal2 = client.operate(getKey(value2), ops)
print(f'intersectAll={resultVal2}')
I get 2 different results when I use different keys for the intersection using hll_get_intersect_count, i.e resultVal and resultVal2 are not same. This does not happen in the case of union count using function hll_get_union_count. Ideally the value of intersection should be the same.
Can any one tell me why is this happening and what is the right way to do it?

Was able to figure out the solutions (with the help of Aerospike support, the same question was posted here and discussed more elaboratively aerospike forum).
Posting my code for others having the same issue.
Intersection of HLLs is not supported in Aerospike. However, If I am to get intersection of multiple HLLs I will have to save one union into aerospike and then get intersection count of one vs the rest of the union.
The key we provide in client.operate function for hll_get_intersect_count is used to get the intersection with the union.
Following is the code I came up with
ops = [hll_ops.hll_get_union(HLL_BIN, records)]
_, _, result1 = client.operate(getKey(value), ops)
# init HLL bucket
ops = [hll_ops.hll_init(HLL_BIN, NUM_INDEX_BITS, NUM_MH_BITS)]
_, _, _ = client.operate(getKey('dummy'), ops)
# get set union and insert to inited hll bucket and save it for 5 mins(300 sec)
# use hll_set_union to save the HLL into aeropike temporarily
ops = [hll_ops.hll_set_union(HLL_BIN, [records2])]
_, _, _ = client.operate(getKey('dummy'), ops, meta={"ttl": 300})
ops = [hll_ops.hll_get_intersect_count(HLL_BIN, [result1[HLL_BIN]])]
_, _, resultVal = client.operate(getKey('dummy'), ops)
print(f'intersectAll={resultVal}')
For more reference, you can look here for hll_set_union
reference.
More elaborate discussion can be found here

Related

Subgroup ArcPy list Query

Mornig, folks.
I have two equal sets of layers, disposed in subgroups in my ArcGIS Pro (2.9.0), as shown here.
It's important that they have the same name (Layer1, Layer2, ...) in both groups.
Now, I'm writing an ArcPy code that makes a Definition Query, but I want to do it only in one specific sub layer (Ex. Compare\Layer1 and Compare\Layer2).
For now, I have this piece of code that, I hope, can help.
p = arcpy.mp.ArcGISProject('current')
m = p.listMaps()[0]
l = m.listLayers()
for row in l:
print(row.name)
COD_QUERY = 123
for row in l:
if row.name in ('Compare\Layer1'):
row.definitionQuery = "CODIGO_EOL = {}".format(COD_QUERY)
print('ok')
When I write 'Compare\layer1' where's supposed to select only the Layer1 placed in the Compare group, the code doesn't work as expected and does the Query both in Compare\Layer1 and Base\Layer2. That's the exact problem tha I'm having.
Hope I can find some help with u guys. XD

The layer's name (or longName) does not include the group layer's name.
Try using a wildcard (follow the link and search for listLayers) and filter for the particular group layer. A group layer object has a method listLayers too, you can again leverage it to get a specific layer.
import arcpy
COD_QUERY = 123
project = arcpy.mp.ArcGISProject("current")
map = project.listMaps()[0]
filtered_group_layers = map.listLayers("Compare")
if filtered_group_layers and filtered_group_layers[0].isGroupLayer:
filtered_layers = filtered_group_layers[0].listLayers("Layer1")
if filtered_layers:
filtered_layers[0].definitionQuery = f"CODIGO_EOL = {COD_QUERY}"
Or you can use loops. The key here is to filter out the group layers using isGroupLayer property before accessing the layer's listLayers method.
import arcpy
COD_QUERY = 123
project = arcpy.mp.ArcGISProject("current")
map = project.listMaps()[0]
group_layers = (layer for layer in map.listLayers() if layer.isGroupLayer)
for group_layer in group_layers:
if group_layer.name in "Compare":
for layer in group_layer.listLayers():
if layer.name in "Layer1":
layer.definitionQuery = f"CODIGO_EOL = {COD_QUERY}"

CadQuery: Selecting an edge by index (Filleting specific edges)

I come from the engineering CAD world and I'm creating some designs in CadQuery. What I want to do is this (pseudocode):
edges = part.edges()
edges[n].fillet(r)
Or ideally have the ability to do something like this (though I can't find any methods for edge properties). Pseudocode:
edges = part.edges()
for edge in edges:
if edge.length() > x:
edge.fillet(a)
else:
edge.fillet(b)
This would be very useful when a design contains non-orthogonal faces. I understand that I can select edges with selectors, but I find them unnecessarily complicated and work best with orthogonal faces. FreeCAD lets you treat edges as a list.
I believe there might be a method to select the closest edge to a point, but I can't seem to track it down.
If someone can provide guidance that would be great -- thank you!
Bonus question: Is there a way to return coordinates of geometry as a list or vector? e.g.:
origin = cq.workplane.center().val
>> [x,y,z]
(or something like the above)

Take a look at this code, i hope this will be helpful.
import cadquery as cq
plane1 = cq.Workplane()
block = plane1.rect(10,12).extrude(10)
edges = block.edges("|Z")
filleted_block = edges.all()[0].fillet(0.5)
show(filleted_block)

For the posterity. To select multiple edges eg. for chamfering you can use newObject() on Workplane. The argument is a list of edges (they have to be cq.occ_impl.shapes.Edge instances, NOT cq.Workplane instances).
import cadquery as cq
model = cq.Workplane().box(10, 10, 5)
edges = model.edges()
# edges.all() returns worplanes, we have to get underlying geometry
selected = list(map(lambda x: x.objects[0], edges.all()))
model_with_chamfer = model.newObject(selected).chamfer(1)
To get edge length you can do something like this:
edge = model.edges().all()[0] # This select one 'random' edge
length = edge.objects[0].Length()
edge.Length() doesn't work since edge is Workplane instance, not geometry instance.
To get edges of certain length you can just create dict with edge geometry and length and filter it using builtin python's filter(). Here is a snippet of my implementation for chamfering short edges on topmost face:
top_edges = model.edges(">Z and #Z")
def get_length(edge):
try:
return edge.vals()[0].Length()
except Exception:
return 0.0
# Inside edges are shorter - filter only those
edge_len_list = list(map(
lambda x: (x.objects[0], get_length(x)),
top_edges.all()))
avg = mean([a for _, a in edge_len_list])
selected = filter(lambda x: x[1] < avg, edge_len_list)
selected = [e for e, _ in selected]
vertical_edges = model.edges("|Z").all()
selected.extend(vertical_edges)
model = model.newObject(selected)
model = model.chamfer(chamfer_size)

How do reduce a set of columns along another set of columns, holding all other columns?

I think this is a simple operation, but for some reason I'm not finding immediate indicators in my quick perusal of the Pandas docs.
I have prototype working code below, but it seems kinda dumb IMO. I'm sure that there are much better ways to do this, and concepts to describe it.
Is there a better way? If not, at least better way to describe?
Abstract Problem
Basically, I have columns p0, p1, y0, y1, .... ... are just things I'd like held constant (remain as separate in table). p0, p1 are things I'd like to reduce against. y0, y1 are columns I'd like to be reduced.
DataFrame.grouby didn't seem like what I wanted. When perusing the code, I wasn't sure if anything else was I wanted. Multi-indexing also seemed like a possible context, but I didn't immediately see an example of what I desired.
Here's the code that does I what I want:
def merge_into(*, from_, to_):
for k, v in from_.items():
to_[k] = v
def reduce_along(df, along_cols, reduce_cols, df_reduce=pd.DataFrame.mean):
hold_cols = set(df.columns) - set(along_cols) - set(reduce_cols)
# dumb way to remember Dict[HeldValues, ValuesToReduce]
to_reduce_map = defaultdict(list)
for i in range(len(df)):
row = df.iloc[i]
# can I instead use a series? is that hashable?
key = tuple(row[hold_cols])
to_reduce = row[reduce_cols]
to_reduce_map[key].append(to_reduce)
rows = []
for key, to_reduce_list in to_reduce_map.items():
# ... yuck?
row = pd.Series({k: v for k, v in zip(hold_cols, key)})
reduced = df_reduce(pd.DataFrame(to_reduce_list))
merge_into(from_=reduced, to_=row)
rows.append(row)
return pd.DataFrame(rows)
reducto = reduce_along(summary, ["p0", "p1"], ["y0", "y1"])
display(reducto)
Background
I am running some sweeps for ML stuff; in it, I sweep on some model architecture param, as well as dataset size and the seed that controls random initialization of the model parameters.
I'd like to reduce along the seed to get a "feel" for what architectures are possibly more robust to initialization; for now, I'd like to see what dataset size helps the most. In the future, I'd like to do (heuristic) reduction along dataset size as well.

Actually, looks like DataFrame.groupby(hold_cols).agg({k: ["mean"] for k in reduce_cols}) is what I want. Source: https://jamesrledoux.com/code/group-by-aggregate-pandas
# See: https://stackoverflow.com/a/47699378/7829525
std = functools.partial(np.std)
def reduce_along(df, along_cols, reduce_cols, agg=[np.mean, std]):
hold_cols = list(set(df.columns) - set(along_cols) - set(reduce_cols))
hold_cols = [x for x in df.columns if x in hold_cols] # Preserve order
# From: https://jamesrledoux.com/code/group-by-aggregate-pandas
df = df.groupby(hold_cols).agg({k: ag for k in reduce_cols})
df = df.reset_index()
return df

Issue with Python list interaction with for loop

I am having a problem with my genetic feature optimization algorithm that I am attempting to build. The idea is that a specific combination of features will be tested and if the model accuracy using those features is higher than the previous maximum, then the combination of features replaces the previous maximum combination. through running through the remaining potential features in this way, the final combination should be the optimal combination of features given the feature vector size. Currently, the code that looks to achieve this looks like:
def mutate_features(features, feature):
new_features = features
index = random.randint(0,len(features)-1)
new_features[index] = feature
return new_features
def run_series(n, f_list, df):
features_list = []
results_list = []
max_results_list = [[0,0,0,0,0]]
max_feature_list = []
features = [0,0,0,0,1]
for i in range(0,5): # 5 has just been chosen as the range for testing purposes
results = run_algorithm(df, f_list, features)
features_list.append(features)
results_list.append(results)
if (check_result_vector(max_results_list, results)):
max_results_list.append(results)
max_feature_list.append(features)
else:
print("Revert to previous :" +str(max_feature_list[-1]))
features = max_feature_list[-1]
features = mutate_features(features, f_list[i])
print("Feature List = " +str(features_list))
print("Results List = " +str(results_list))
print("Max Results List = " +str(max_results_list))
print("Max Feature List = "+str(max_feature_list))
The output from this code has been included below;
Click to zoom or enlarge the photo
The section that I do not understand is the output of the max_feature_list and feature_list.
If anything is added through the use of .append() to the max_feature_list or the feature_list inside the for loop, it seems to change all items that are already members of the list to be the same as the latest addition to the list. I may not be fully understanding of the syntax/logic around this and would really appreciate any feedback as to why the program is doing this.

It happens because you change the values of features inside mutate_features function and then, since the append to max_feature_list is by reference, the populated values in max_feature_list are changing too because their underlying value changed.
One way to prevent such behaviour is to deepcopy features inside mutate_features, mutate the copied features as you want and then return it.
For example:
from copy import deepcopy
def mutate_features(features, feature):
new_features = deepcopy(features)
index = random.randint(0,len(features)-1)
new_features[index] = feature
return new_features
features = [1, 2, 3]
res = []
res.append(features)
features = mutate_features(features, feature)
res.append(features)
print(res)

Assignment means point to address in python?

I am implementing a Hierarchy clustering algorithm(with similarity) using python 3.6, the following doing is basically build new empty graph ,and keep connecting the the group(represent by list here ) with largest similarity on original recursively
the code in position 1 of code ,I want to return the best partition, however the function return is exactly the same as comminity_list,it looks like best_partition = comminity_list. make best_partition point to the address of 'comminity_list' how come it happens, what I got wrong here? how should I fix that ?
def pearson_clustering(G):
H = nx.create_empty_copy(G). # build a empty copy of G(no vetices)
best = 0 #for current modularity
current =0 #for best modualarty
A = nx.adj_matrix(G). #get adjacent matrix
org_deg =deg_dict(A, G.nodes()) # degree of G
org_E = G.number_of_edges(). # number of edges of G
comminity_list = intial_commnity_list(G) # function return a list of lists here
best_partition = None
p_table =pearson_table(G) #pearson_table return a dictionary of each pair Pearson correlation
l = len(comminity_list)
while True:
if(l == 2): break
current = modualratiry(H,org_deg,org_E) #find current modularity
l = len(comminity_list)
p_build_cluster(p_table,H,G,comminity_list) #building clustering on H
if(best < current):
best_partition = comminity_list. #postion1
best = current #find the clustering with largest modularity
return best_partition #postion2

it looks like best_partition = comminity_list. make best_partition point to the address of 'comminity_list' how come it happens, what I got wrong here? how should I fix that ?
That is just python's implicit assignment behaviour. When you do "best_partition = comminity_list" you just assign comminity_list to the same address as best_partition.
If you want to explicitly copy the list you can use this (which replaces the list best_partition with the comminity_list):
best_partition[:] = comminity_list
or the copy function. If comminity_list has sublists you will need the deepcopy function instead, from the same module (otherwise you will get a copy of the original list, but the sublists will still be just address references).
best_partition = comminity_list.copy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Aerospike hyperLogLog(HLL) intersection count of multiple HLL unions - python

Related

Subgroup ArcPy list Query

CadQuery: Selecting an edge by index (Filleting specific edges)

How do reduce a set of columns along another set of columns, holding all other columns?

Issue with Python list interaction with for loop

Assignment means point to address in python?

Categories

Resources